ECS Container OOM Kills

How to diagnose exit code 137 and out-of-memory container terminations from your CloudWatch logs.

TL;DR

An OOM kill happens when your ECS container exceeds its hard memory limit and the Linux kernel terminates it with SIGKILL (exit code 137). The container gets no graceful shutdown. Check your task definition memory limit against actual usage with Container Insights, then either increase the limit or fix the underlying memory problem. For Java, set -XX:MaxRAMPercentage=75. For Node.js, set --max-old-space-size. Always size your hard limit to at least 1.5x your observed peak usage.

What is an ECS OOM kill?

Every ECS task definition includes a memory limit for each container. This limit tells the Linux kernel exactly how much RAM the container is allowed to consume. When a container's resident memory usage reaches that ceiling, the kernel's out-of-memory (OOM) killer steps in and terminates the process immediately with a SIGKILL signal. There is no warning, no graceful shutdown hook, and no chance to flush buffers or close connections. The process simply stops.

ECS surfaces this event as exit code 137. The number comes from the Unix convention of encoding signal-killed processes as 128 plus the signal number. SIGKILL is signal 9, so 128 + 9 = 137. Any time you see exit code 137 in your ECS task events, the container was forcefully killed by the kernel rather than shutting down on its own.

If the container is part of an ECS service, the service scheduler will automatically restart the task to maintain the desired count. But if the OOM condition persists, you end up in a restart loop: the container starts, consumes memory, gets killed, restarts, and repeats. This pattern destabilizes deployments, triggers alarms, and can cascade into broader outages if dependent services start timing out on the restarting containers.

This is fundamentally different from how Lambda handles memory. Lambda allocates memory at the function level and the runtime tracks it, producing a clear Max Memory Used field in every REPORT line. ECS, by contrast, delegates enforcement to the Linux kernel's cgroup memory controller. The container is a regular Linux process group with a hard memory ceiling enforced by cgroups, and the OOM killer is the enforcement mechanism. This means the failure mode is harsher: no structured error message from ECS itself, just a dead container and an exit code.

ECS task definitions also support a memoryReservation (soft limit) separate from the hard memory limit. The soft limit is used for scheduling and bin-packing tasks onto EC2 instances but does not trigger OOM kills. Only the hard limit triggers the kernel OOM killer. If you only set memoryReservation without a hard limit on EC2 launch type, the container can technically consume all available host memory, potentially starving other containers on the same instance.

Identifying OOM kills in CloudWatch logs

OOM kills leave traces at multiple levels. The most direct signal comes from ECS task state change events, which show up in CloudWatch when you have Container Insights or CloudTrail enabled. Look for the stopped reason message:

Essential container in task exited
  "stoppedReason": "OutOfMemoryError: Container killed due to memory usage"
  "containers": [{
    "exitCode": 137,
    "reason": "OutOfMemoryError: Container killed due to memory usage"
  }]

On Fargate, you may also see the reason reported as CannotPullContainerError or ResourceInitializationError in rare cases, but exit code 137 with the OutOfMemoryError reason is the definitive OOM pattern.

Before the kernel kills the container, your application may emit its own out-of-memory errors in the application logs. These are often the most useful for diagnosis because they tell you what the application was doing when memory ran out. Common application-level OOM patterns include:

# Java
java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: Metaspace

# Node.js
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

# Python
MemoryError
MemoryError: Unable to allocate 2.50 GiB for an array with shape (335544320,)

# Go
runtime: out of memory
fatal error: runtime: out of memory

These application-level errors typically appear a few seconds before the container actually dies. Sometimes the application logs them and continues running briefly in a degraded state before the kernel delivers the final SIGKILL. In other cases, particularly with Go and Node.js, the runtime aborts immediately upon hitting its own internal memory limit, and the exit code may be 1 or 134 (SIGABRT) rather than 137. The distinction matters: exit code 137 means the kernel killed it, while other exit codes mean the runtime killed itself before the kernel had to.

If you have Container Insights enabled, you can also watch the MemoryUtilized and MemoryReserved metrics. A container that steadily climbs toward 100% memory utilization and then suddenly disappears from the metrics is a strong indicator of an OOM kill, even if you missed the task state event.

Common causes of ECS OOM kills

Memory limit set too low in the task definition

The simplest cause: the application legitimately needs more memory than you allocated. This is common after adding new features, upgrading dependencies, or moving a workload from a large EC2 instance to a container without profiling its actual memory footprint. A task definition allocating 512 MB to a Spring Boot application that routinely needs 800 MB will OOM every time.

Memory leaks

The container starts fine and runs for hours or days, then suddenly dies. This is the classic memory leak pattern. Common culprits include unclosed database connections that accumulate over time, in-memory caches that grow without eviction, event listener registrations that are never removed, and HTTP client connection pools that are created per-request instead of shared. The container works fine under testing because tests don't run long enough for the leak to matter.

Java heap not sized to the container limit

This is one of the most common ECS OOM causes for Java workloads. By default, the JVM sets its maximum heap size based on the physical host memory, not the container's cgroup limit. On an older JVM (pre-Java 10) running in a container with 1 GB allocated on a 16 GB host, the JVM might try to allocate a 4 GB heap because it sees 16 GB of physical RAM. Even with newer JVMs that respect cgroup limits, the default heap is only about 25% of the container memory, which may not be enough. The JVM also needs memory outside the heap for metaspace, thread stacks, native libraries, and direct byte buffers, so setting the heap to 100% of the container limit will also OOM.

Large file processing loading entire files into memory

A container that downloads a file from S3 and reads it entirely into memory before processing will OOM on any file larger than the available headroom. This is especially common with CSV parsing, JSON deserialization of large arrays, and image processing. If the file size is variable and occasionally spikes, the container works most of the time and only OOMs on the largest inputs.

Spike in concurrent requests

Each incoming request typically allocates memory for parsing the request body, creating response objects, holding intermediate computation state, and buffering the response. Under normal traffic, the peak concurrent request count might be 50, using 400 MB total. A sudden traffic spike to 500 concurrent requests could push memory to 4 GB, far beyond the container limit. This is different from a leak because memory would return to normal if the spike subsided, but the kernel kills the container before that can happen.

Sidecar containers eating into shared memory

On Fargate, the task-level memory is shared across all containers in the task. If you have an application container and sidecar containers (Datadog agent, Envoy proxy, log router), each sidecar consumes part of the task memory budget. A Datadog agent typically uses 256-512 MB. An Envoy sidecar can use 100-300 MB. If you sized the task memory for the application alone without accounting for sidecars, the application container gets OOM killed even though it's working within its expected memory range.

How smplogs detects OOM kills

When you upload ECS CloudWatch logs to smplogs, the WASM analysis engine scans for OOM patterns across both task state events and application-level errors. It counts occurrences of exit code 137, matches the common application OOM signatures for Java, Node.js, Python, and Go, and checks for restart frequency that indicates an OOM restart loop. The engine produces a severity-ranked finding based on the count and frequency of OOM events:

CRITICAL OOM Kills Detected

5 containers were terminated due to out-of-memory conditions (exit code 137).

-> Increase container memory limits or investigate memory leaks. Review application heap settings.

The severity level scales with the number of OOM events found in the log file. A single OOM event in a multi-hour log produces a HIGH finding. Repeated OOM events, especially more than three within a short window, escalate to CRITICAL because they indicate the container is in a restart loop and likely affecting availability.

smplogs also detects the secondary effects of OOM kills. When containers repeatedly restart due to OOM, the engine identifies the pattern as a "Container Restart Storm" and links it to the underlying OOM cause in the root cause analysis:

HIGH Container Restart Storm

12 container restarts detected in a 30-minute window. Most restarts correlated with OOM exit codes.

-> Investigate memory allocation. Consider increasing task memory or reducing per-request memory consumption.

The root cause analysis section connects the dots: it shows the OOM kill count, the restart frequency, and the memory-related application errors together, giving you a single view of the problem instead of having to piece together scattered log entries.

Step-by-step fixes

Fix 1: Increase memory in the task definition

Start by understanding how much memory the container actually uses. Check Container Insights for the MemoryUtilized metric over the last 7 days and find the peak. If the peak is close to the hard limit, the limit is too low. Increase the hard memory limit to at least 1.5x the observed peak.

# Task definition - increase memory
{
  "containerDefinitions": [{
    "name": "app",
    "memory": 2048,           // hard limit (MiB) - OOM kill threshold
    "memoryReservation": 1024,  // soft limit (MiB) - used for scheduling
    ...
  }]
}

On Fargate, set the task-level memory accounting for all containers including sidecars. On EC2 launch type, you can set per-container hard limits independently.

Fix 2: Set JVM heap flags for Java applications

Do not rely on JVM defaults in containers. Explicitly configure the heap as a percentage of container memory, leaving room for non-heap allocations:

# Dockerfile or JAVA_OPTS environment variable
ENV JAVA_OPTS="-XX:+UseContainerSupport \
  -XX:MaxRAMPercentage=75.0 \
  -XX:InitialRAMPercentage=50.0 \
  -XX:+ExitOnOutOfMemoryError"

MaxRAMPercentage=75 gives the heap 75% of the container memory and leaves 25% for metaspace, thread stacks, native memory, and the OS. The +ExitOnOutOfMemoryError flag tells the JVM to exit immediately on heap exhaustion rather than limping along in a broken state, which gives ECS a clean restart signal.

Fix 3: Set --max-old-space-size for Node.js

Node.js defaults to a heap limit of about 1.5 GB regardless of the container size. If your container has 512 MB allocated, Node.js will try to use 1.5 GB and get OOM killed. Set the limit explicitly:

# Set to ~75% of container memory (in MB)
# For a 1024 MB container:
ENV NODE_OPTIONS="--max-old-space-size=768"

Leave headroom for V8's young generation, native addons, and the OS. A good rule of thumb is 75% of the container memory for old-space, which gives you 25% buffer for everything else.

Fix 4: Add memory profiling to find leaks

If the container runs for hours before OOM, you likely have a memory leak. Instrument the application to expose memory metrics and take periodic heap snapshots:

// Node.js - log memory usage every 60 seconds
setInterval(() => {
  const mem = process.memoryUsage();
  console.log(JSON.stringify({
    rss_mb: Math.round(mem.rss / 1048576),
    heap_used_mb: Math.round(mem.heapUsed / 1048576),
    heap_total_mb: Math.round(mem.heapTotal / 1048576),
    external_mb: Math.round(mem.external / 1048576)
  }));
}, 60000);

For Java, enable native memory tracking with -XX:NativeMemoryTracking=summary and periodically dump the summary to logs. For production leak hunting, use async-profiler or Eclipse MAT on a heap dump taken before the OOM occurs.

Fix 5: Implement streaming for large file processing

If your container processes files from S3 or other sources, never load the entire file into memory. Use streaming APIs instead:

// Bad - loads entire file into memory
const data = await s3.getObject(params);
const body = await streamToString(data.Body);
const records = JSON.parse(body);

// Good - processes line by line
const stream = (await s3.getObject(params)).Body;
const rl = readline.createInterface({ input: stream });
for await (const line of rl) {
  processRecord(JSON.parse(line));
}

The streaming approach uses constant memory regardless of file size. For JSON, consider using a streaming JSON parser like jsonstream2 or stream-json instead of JSON.parse() on the full payload.

Fix 6: Add memory-based health checks

Catch memory pressure before the OOM killer does. Configure your ECS health check or ALB target group health check to monitor memory usage and return unhealthy when memory exceeds a threshold:

// Express health check endpoint
app.get('/health', (req, res) => {
  const memUsage = process.memoryUsage();
  const memLimitMB = parseInt(process.env.MEMORY_LIMIT_MB || '1024');
  const usedMB = memUsage.rss / 1048576;
  const memPercent = (usedMB / memLimitMB) * 100;

  if (memPercent > 85) {
    console.warn(`Memory pressure: ${usedMB.toFixed(0)}MB / ${memLimitMB}MB (${memPercent.toFixed(1)}%)`);
    return res.status(503).json({ status: 'unhealthy', reason: 'memory_pressure' });
  }
  res.json({ status: 'healthy', memory_percent: memPercent.toFixed(1) });
});

When the health check fails, ECS will drain the container gracefully and start a new one. This is far better than an OOM kill because the container gets time to finish in-flight requests and close connections cleanly. Set the threshold at 85% of the container memory limit to give enough warning before the kernel acts.

Prevention best practices

Enable Container Insights for memory utilization metrics

Container Insights gives you per-container memory utilization as a CloudWatch metric. Set up a CloudWatch alarm at 80% memory utilization so you get notified before the container hits the limit. Without this, the first indication of a memory problem is an OOM kill in production.

Size memory to 1.5x observed peak usage

After running in production for at least a week, check the maximum MemoryUtilized value. Set the hard limit to 1.5x that peak. This gives enough headroom for traffic spikes, garbage collection surges, and gradual growth from new features. If you observe the container consistently using only 30% of its limit, you're over-provisioned and wasting money on Fargate.

Use memory reservation separately from the hard limit

On EC2 launch type, set memoryReservation (soft limit) to the typical steady-state memory usage and the memory (hard limit) to the burst ceiling. This lets the ECS scheduler pack tasks efficiently during normal operation while still allowing headroom for spikes. For example, a container that normally uses 512 MB but can spike to 1 GB during deployments should have a reservation of 512 MB and a hard limit of 1024 MB.

Implement graceful degradation under memory pressure

Design your application to shed load when memory is high. This can mean rejecting new requests with a 503 status code, reducing in-memory cache sizes, or stopping background processing until memory drops. The goal is to prevent the OOM kill by voluntarily reducing memory consumption instead of letting the kernel make the decision for you.

Run load tests to find the memory ceiling before production

Before deploying a new service or after significant code changes, run a load test that ramps up concurrent connections until the container OOMs. This tells you the actual breaking point. Record the peak memory at each concurrency level and use that data to set your hard limit with appropriate headroom. A container that OOMs at 200 concurrent requests should not be deployed to a service that might see 500 concurrent requests without either scaling horizontally or increasing the memory limit.

Debugging container crashes? Drop your ECS CloudWatch logs into smplogs to instantly identify OOM patterns, restart storms, and memory pressure trends.

Try it free

Related guides