Service Health Degradation

What this means

This is a compound finding that fires only when two separate patterns overlap: your containers are being killed by the OOM killer, and at the same time the surviving containers are producing errors at a rate above 10%. The combination is far worse than either problem in isolation because it creates a reinforcing failure loop. OOM kills reduce the number of healthy tasks, which concentrates traffic on the survivors, which increases their memory pressure and error rate, which triggers more OOM kills.

In a typical ECS service with three running tasks behind an Application Load Balancer, one OOM kill drops capacity by a third. The ALB health check takes 10-30 seconds to detect the failed target and stop routing to it. During that window, some requests are routed to the dead task and receive 502 errors. Meanwhile, the two surviving tasks absorb 50% more traffic each, which may push them past their own memory limits. If a second task OOM-kills before the replacement launches, you are down to a single task handling triple the normal load. This is what makes the pattern critical.

On Fargate, replacement tasks take 30-90 seconds to provision and start, depending on image size and initialization time. On EC2 launch type, if the cluster has spare capacity, replacement is faster (the container starts on an existing host), but if the cluster is fully packed, ECS must wait for Auto Scaling to launch a new instance, which adds 2-5 minutes. During this entire window, your service is degraded.

This pattern is most commonly triggered by deployments that introduce a memory regression (a larger payload parser, an unbounded cache, a dependency upgrade that increased baseline memory) combined with a load spike that exercises the new code path. It also appears in long-running services with slow memory leaks that have been running for days — the leak finally pushes the container past its limit, and the resulting churn causes the error rate to spike.

Detection criteria

CRITICAL

OOM kill events detected (exit code 137 or OOMKilled messages) AND error rate exceeds 10% of log windows. Both conditions must be present simultaneously.

Example finding

CRITICAL Service Health Degradation

4 OOM kills detected with 18.2% error rate across 110 log windows. Service is critically unstable with cascading failures across tasks.

OOMKilled: task/api-cluster/task-7a8b stopped (exit 137)

ERROR: connection reset by peer (downstream: inventory-svc)

Service api-service: runningCount=1, desiredCount=3

WARN: ALB target group has 2 unhealthy targets

How to fix

Immediately increase the memory allocation. This is the highest-priority action. Update your task definition to at least double the current memory limit and deploy it with --force-new-deployment. On Fargate, use aws ecs update-service --cluster my-cluster --service my-service --task-definition my-task:new-revision --force-new-deployment. The rolling deployment will gradually replace OOM-prone tasks with tasks that have more headroom. Do not wait for root cause analysis — stop the bleeding first.

Scale out to more tasks immediately. While the new task definition deploys, increase the desired count to spread load. aws ecs update-service --cluster my-cluster --service my-service --desired-count 6. More tasks means each individual task handles less traffic and is less likely to hit its memory ceiling. This is a temporary measure — you can scale back down once the memory issue is resolved.

Check for recent deployments. Look at the ECS service events tab for the most recent deployment. If the OOM kills started after a deployment, the new code or its dependencies likely introduced a memory regression. Roll back by updating the service to use the previous task definition revision: aws ecs update-service --task-definition my-task:previous-revision. If you have the deployment circuit breaker enabled, ECS may have already initiated an automatic rollback.

Investigate whether the error rate is caused by OOM kills or is independent. If the errors are all "connection reset" or "upstream unavailable" errors that correlate with OOM kill timestamps, the errors are a symptom of the OOM kills, and fixing the memory issue will resolve both. But if the errors are application-level (database errors, validation failures, authentication issues), they are a separate problem that needs its own investigation. Check the High Container Error Rate pattern for guidance on diagnosing standalone error rate issues.

Set up proactive monitoring to prevent recurrence. Create CloudWatch alarms on both MemoryUtilization (from Container Insights) and a metric filter for error-level log messages. Alert at 80% memory utilization — well before the OOM kill threshold — and at 5% error rate. Use SNS to route these alarms to your on-call rotation. Consider adding a CloudWatch composite alarm that triggers only when both conditions are met, which directly mirrors this detection pattern.

Container OOM Kills Detected

Out-of-memory terminations with exit code 137

High Container Error Rate

Percentage of log windows containing container errors

Detect OOM kills and error rate spikes together. Upload your CloudWatch logs to see the full picture.

Start analyzing

What this means

Detection criteria

Example finding

How to fix

Related patterns