Rapid Container Restart Loop

What this means

A crash loop occurs when an ECS container starts, fails almost immediately, gets replaced by the service scheduler, and the replacement fails in the same way. This creates a rapid cycle of start-crash-replace that can repeat dozens of times before anyone notices. Unlike occasional restarts that might be caused by transient issues, a crash loop means the container is unable to run in its current configuration.

smplogs detects this pattern by looking for more than 5 container restart events concentrated within a 30-minute window. The time concentration distinguishes a crash loop from distributed restarts that happen over hours or days. A container that restarts once a day for a week is a different kind of problem (likely a slow memory leak) than one that restarts 8 times in 20 minutes (almost certainly a configuration or startup failure).

The most common causes of crash loops in ECS are missing or incorrect environment variables, unreachable dependencies that the application checks at startup (a database, a config server, an external API), incompatible container images (wrong architecture, missing shared libraries), and entrypoint scripts that exit on the first error. On Fargate, a crash loop is particularly expensive because each restart attempt provisions a new Fargate task, and you are billed for the minimum 1-minute runtime of each attempt even if the container dies in 3 seconds.

Unlike Kubernetes, ECS does not have built-in crash loop backoff. The service scheduler will keep trying to replace failed tasks at the same rate indefinitely. This means a crash loop generates a high volume of CloudWatch log streams (one per attempt), burns through your Fargate compute budget, and can exhaust the available IP addresses in your subnet if each attempt claims and releases an ENI. The deployment circuit breaker (introduced in 2020) is the primary mechanism to stop this behavior, but it must be explicitly enabled on your service.

Detection criteria

HIGH

More than 5 container restart events in a log export spanning less than 30 minutes. The time density of restarts distinguishes a crash loop from occasional, isolated restart events.

Example finding

HIGH Rapid Container Restart Loop

9 container restarts detected within a 22-minute window. All tasks exited with code 1. Consistent failure pattern suggests a configuration or startup issue.

14:02:11 Task abc-001 started, exited code 1 at 14:02:14

14:04:33 Task abc-002 started, exited code 1 at 14:04:36

14:07:01 Task abc-003 started, exited code 1 at 14:07:04

Error: Missing required environment variable DATABASE_URL

How to fix

Stop the loop before debugging. Update the service desired count to 0 to stop ECS from launching new tasks: aws ecs update-service --cluster my-cluster --service my-service --desired-count 0. This stops the bleeding — no more failed tasks accumulating in your logs and consuming compute. You can also roll back to the previous task definition revision if you know the last working version, which simultaneously stops the loop and restores service.

Read the first and last log entries from the failed task's log stream. The first entries show what the container did during initialization. The last entries (just before the crash) show what failed. Common patterns: "Error: Cannot find module" (wrong working directory or missing dependency), "Error: Missing required environment variable" (env var not set in task definition or Secrets Manager reference failed), "Error: EACCES: permission denied" (non-root user cannot bind to port or access a volume), "exec format error" (ARM image on x86 or vice versa).

Enable the deployment circuit breaker. This is the single most important preventive measure. Add "deploymentCircuitBreaker": { "enable": true, "rollback": true } to your service's deployment configuration. The circuit breaker counts consecutive task launch failures and, after reaching a threshold (based on desired count), automatically rolls back to the last stable deployment. Without this, ECS will loop forever.

Test the container locally before deploying. Run the exact image with the same environment variables and resource limits: docker run --memory 512m --env-file .env my-image:tag. If it crashes locally with the same error, you can iterate much faster than deploying to ECS each time. For Secrets Manager references, pull the secrets with the AWS CLI and pass them as --env flags during local testing.

Add a startup health check with a generous start period. Set the startPeriod in your container health check to give the application time to initialize (database migrations, cache warming, dependency checks) before ECS starts evaluating health. A container that takes 45 seconds to start but has a 10-second start period will be killed before it finishes initializing, creating a false crash loop. Set the start period to at least 2x your observed startup time.

Container Restart Events

Non-zero exit codes indicating crash restarts

Container OOM Kills Detected

Out-of-memory terminations with exit code 137

Catch crash loops before they burn through your Fargate budget. Upload a CloudWatch JSON export.

Get started free

What this means

Detection criteria

Example finding

How to fix

Related patterns