Container Restart Events

What this means

ECS monitors the exit code of every container in a task. When a container exits with a non-zero code, ECS records the event and, depending on your service's restart policy, launches a replacement task. Each restart appears in CloudWatch as a new log stream with a fresh task ID, which makes it easy to miss the pattern unless you are specifically looking for it.

The exit code tells you a great deal about what went wrong. Exit code 1 is a generic application error — an unhandled exception, a failed assertion, or a missing configuration value. Exit code 137 means the process received SIGKILL, usually from the OOM killer (see the OOM Kills pattern). Exit code 139 is a segmentation fault. Exit code 143 means the container received SIGTERM and exited, which is the normal graceful-shutdown path during deployments but abnormal if it happens outside a deployment window.

On Fargate, ECS automatically replaces failed tasks if they belong to a service. The service scheduler respects the minimumHealthyPercent and maximumPercent settings, but there is no exponential backoff by default — if the container crashes immediately on startup, ECS will keep restarting it as fast as it can provision new tasks. On EC2 launch type, the Docker daemon's own restart policy interacts with the ECS agent, and you may see the container restart on the same host without a new task ARN.

Frequent restarts cause problems beyond the obvious downtime. Each restart means cold starts — fresh JVM warmup, empty caches, new database connection pools. If your service is behind an Application Load Balancer, the target group will keep deregistering and re-registering targets, causing increased latency and 502 errors for clients during the deregistration drain period.

Detection criteria

HIGH

More than 3 container restart events detected across log windows. Multiple non-zero exit codes or repeated task replacement patterns.

ELEVATED

At least 1 container restart event with a non-zero exit code. Isolated restarts may indicate transient issues but should be monitored.

Example finding

HIGH Container Restart Events

7 container restart events detected across 15 log windows. Exit codes observed: 1 (x4), 137 (x2), 139 (x1). Service stability is degraded.

Container exited with code 1: Error: Cannot find module '/app/config/production.json'

Task stopped: arn:aws:ecs:us-east-1:123456:task/api-cluster/def456

Service api-service: task set updated, desiredCount=3, runningCount=1

How to fix

Examine the exit codes in your stopped tasks. Open the ECS console, navigate to your cluster, click "Tasks," and filter by "Stopped." Each stopped task shows the exit code and a brief reason. Group these by exit code to understand whether you are dealing with one failure mode or several. You can also query this programmatically with aws ecs describe-tasks --tasks $(aws ecs list-tasks --cluster my-cluster --desired-status STOPPED --query 'taskArns' --output text).

Check the last lines of each stopped task's log stream. CloudWatch preserves the log stream even after the task stops. The final log entries before the exit usually contain the stack trace or error message that caused the crash. If the container dies before writing any logs, the issue is likely in the entrypoint script or a missing environment variable that prevents the process from starting at all.

Verify SIGTERM handling in your application. During rolling deployments, ECS sends SIGTERM to your container and waits for the stopTimeout period (default 30 seconds on Fargate) before sending SIGKILL. If your application does not trap SIGTERM, it will be killed ungracefully. In Node.js, listen for process.on('SIGTERM', ...). In Go, use signal.Notify. In Java, add a shutdown hook. Graceful shutdown should close the HTTP listener, drain active connections, flush buffers, and then exit with code 0.

Set appropriate restart policies. For ECS services, configure the deployment circuit breaker with deploymentCircuitBreaker: { enable: true, rollback: true }. This prevents ECS from endlessly restarting a broken deployment. The circuit breaker trips after a configurable number of consecutive failures and automatically rolls back to the last stable task definition.

Add health checks to your task definition. The healthCheck block in the container definition lets you specify a command (such as curl -f http://localhost:8080/health) that ECS runs at intervals. If the health check fails repeatedly, ECS marks the task as unhealthy and replaces it, which gives you a controlled restart path rather than waiting for an outright crash.

Container OOM Kills Detected

Out-of-memory terminations with exit code 137

Rapid Container Restart Loop

Many restarts in a short window indicating a crash loop

See how often your containers are restarting and why. Upload a CloudWatch JSON export.

Analyze for free

What this means

Detection criteria

Example finding

How to fix

Related patterns