Container OOM Kills Detected

What this means

When an ECS container exceeds its allocated memory limit, the Linux kernel's OOM killer terminates the process with signal 9 (SIGKILL), which surfaces as exit code 137 in your task logs. This is not a graceful shutdown — the container receives no warning and has no opportunity to flush buffers, close connections, or complete in-flight requests. Any work that was happening at the moment of the kill is lost.

On Fargate, memory limits are defined at the task level. If your task definition allocates 512 MB and the container's resident set size crosses that boundary, the OOM kill is immediate. On EC2 launch type, the container-level memoryReservation (soft limit) and memory (hard limit) interact with the host's available memory. A container can burst past its soft limit if the host has headroom, but the hard limit is enforced by cgroups and triggers the OOM kill.

OOM kills are one of the most common causes of task instability in ECS. They often appear as intermittent failures — your service works fine under normal load but dies under traffic spikes or when processing large payloads. Memory leaks in long-running services cause the same symptom, except the kills happen on a predictable schedule that correlates with uptime rather than load.

The distinction between a one-off OOM kill and repeated kills is significant. A single kill might mean an unusually large request triggered an allocation spike. Three or more kills in a short window almost always means the memory limit is structurally too low for the workload, or there is a genuine memory leak that needs profiling.

Detection criteria

CRITICAL

More than 3 OOM kill events detected across log windows (exit code 137 or "OutOfMemoryError" / "OOMKilled" messages).

HIGH

At least 1 OOM kill event detected. Even a single kill warrants investigation since it indicates the container is operating near its memory ceiling.

Example finding

CRITICAL Container OOM Kills Detected

5 OOM kill events detected across 12 log windows. Containers terminated with exit code 137 indicating kernel OOM killer intervention.

Container killed: exit code 137 (OOMKilled)

Task arn:aws:ecs:us-east-1:123456:task/my-cluster/abc123

Memory limit: 512 MB | Peak usage: 512 MB

How to fix

Increase the memory allocation in your task definition. Open the ECS console or your infrastructure-as-code template and raise the memory value. On Fargate, valid memory/CPU combinations are constrained — for example, 1 vCPU supports 2 GB through 8 GB. A good starting point is to double the current limit, deploy, and monitor whether peak usage stays below 75% of the new ceiling. If you are on EC2 launch type, also confirm the container instance has enough free memory to honor the new hard limit.

Profile your application's memory usage. Connect to a running container with ecs execute-command and inspect resident memory with top or cat /sys/fs/cgroup/memory/memory.usage_in_bytes. For JVM applications, review heap settings (-Xmx) and metaspace. For Node.js, check the --max-old-space-size flag. For Go services, look at goroutine counts and buffer pool sizes. The goal is to understand whether the usage pattern is a leak (grows over time) or a spike (grows under load).

Enable Container Insights. CloudWatch Container Insights gives you per-task memory utilization metrics at one-minute granularity. This lets you see the ramp-up pattern before the kill happens, which is essential for right-sizing. Enable it at the cluster level with aws ecs update-cluster-settings --cluster my-cluster --settings name=containerInsights,value=enabled. Create a CloudWatch alarm on the MemoryUtilization metric to alert at 85% before you get killed at 100%.

Separate memory-intensive work into sidecar containers. If your main application container handles both request serving and background processing (image resizing, report generation, data transformation), move the heavy work into a sidecar with its own memory allocation. This isolates failures — the sidecar can OOM without taking down the request-serving container.

Use swap on EC2 launch type. ECS on EC2 supports the maxSwap and swappiness container definition parameters. While swap is not a fix for a genuine leak, it can absorb transient spikes that would otherwise trigger OOM kills. This option is not available on Fargate.

Container Restart Events

Non-zero exit codes indicating crash restarts

Service Health Degradation

Combined OOM kills and high error rate

Find out if your containers are hitting memory limits. Drop in a CloudWatch JSON export.

Check your logs

What this means

Detection criteria

Example finding

How to fix

Related patterns