Container Timeout Errors

What this means

Timeout errors occur when your ECS container waits for a response from a downstream dependency — a database query, an HTTP API call, a message queue operation — and the response does not arrive within the configured time limit. The container's application code gives up waiting and logs the timeout. Unlike connection errors where the network link fails immediately, timeouts mean the connection was established but the other end did not respond in time.

Timeouts are a subtler failure than connection refusals. A connection refusal happens in milliseconds and tells you clearly that nothing is listening. A timeout forces your container to hold resources (a thread, a connection pool slot, memory for the pending request) for the entire timeout duration before it can move on. In a Node.js application with a 30-second HTTP timeout, a single slow upstream holds a connection pool slot for 30 seconds. In a Java application with a thread-per-request model, each timeout holds a thread hostage, and enough concurrent timeouts can exhaust the thread pool, causing the entire service to stop responding.

Timeouts in ECS workloads generally stem from one of three sources. The downstream dependency may be genuinely slow — a database under heavy load, an external API with degraded performance, or an ElastiCache cluster that is running hot. The container itself may be resource-starved — if it is CPU-throttled (running on a shared EC2 instance or a Fargate task with insufficient CPU shares), the application may not be able to process responses fast enough, causing what looks like a timeout even though the downstream responded in time. Or network-level issues like packet loss or NAT gateway throttling can increase round-trip times past the timeout threshold.

In ECS services behind an Application Load Balancer, timeouts create a cascade. The ALB has its own idle timeout (default 60 seconds). If your container times out on a downstream call and takes too long to respond to the ALB, the ALB drops the connection and returns a 504 to the client. Meanwhile, the container may still be waiting on the downstream, unaware that the client has already received an error. This leads to wasted compute and can create phantom load on downstream services.

Detection criteria

HIGH

Timeout error patterns detected in more than 2 log windows. Matches "timeout," "ETIMEDOUT," "deadline exceeded," "request timed out," "read timeout," and "socket timeout" patterns in error-level log lines.

Example finding

HIGH Container Timeout Errors

Timeout errors detected in 6 of 38 log windows. Primary timeout targets: RDS PostgreSQL query execution (5000ms), external payment gateway API (10000ms).

ERROR: Query read timeout after 5000ms - SELECT * FROM orders WHERE...

ERROR: ETIMEDOUT connecting to payments.stripe.com:443

ERROR: gRPC deadline exceeded: inventory-service.internal:50051

How to fix

Identify which downstream dependencies are timing out. Group the timeout log entries by target host or service name. If 90% of your timeouts are to a single database, that is a focused problem. If timeouts are spread across multiple services, you may have a network-layer issue (NAT gateway throttling, DNS resolution delays) rather than a specific dependency problem. smplogs' log signature clusters help you see this grouping automatically.

Check the downstream dependency's health independently. If the timeouts point to an RDS database, check its CloudWatch metrics: CPUUtilization, DatabaseConnections, ReadLatency, WriteLatency. For an ElastiCache Redis cluster, check EngineCPUUtilization and CurrConnections. For external APIs, check their status page or set up synthetic monitoring with CloudWatch Synthetics. The goal is to determine whether the timeout is on the caller's side or the callee's side.

Review and adjust timeout values. Default timeouts are often too generous or too aggressive. A 30-second database query timeout might be appropriate for a batch processing task but far too long for an API endpoint that needs to respond within 2 seconds. Set timeouts based on the expected response time of the dependency plus a reasonable buffer. For most synchronous API calls, 3-5 seconds is a good starting point. For database queries in a request path, 2-3 seconds. For batch operations, 30-60 seconds. Make timeouts configurable via environment variables so you can tune them without redeploying.

Increase your task's CPU allocation. On Fargate, CPU-throttled containers exhibit timeout-like symptoms because they cannot process network I/O fast enough. If your task is allocated 256 CPU units (0.25 vCPU) and handles concurrent requests, it may not have enough CPU cycles to read socket responses in time. Check the CPUUtilization metric from Container Insights — sustained utilization above 80% strongly suggests CPU starvation. Upgrade to the next CPU tier (512 or 1024 units) and observe whether timeouts decrease.

Implement circuit breakers for downstream calls. A circuit breaker tracks the failure rate of calls to a dependency. When the failure rate exceeds a threshold (e.g., 50% of calls fail within a 10-second window), the circuit "opens" and immediately rejects new calls to that dependency without waiting for the timeout. This prevents timeout cascades and preserves your container's resources for requests it can actually serve. Libraries like opossum (Node.js), resilience4j (Java), or gobreaker (Go) provide this pattern out of the box.

Network Connectivity Failures

Connection errors to downstream services

High Container Error Rate

Percentage of log windows containing container errors

Find which dependencies are timing out and how often. Drop in a CloudWatch JSON export.

Try smplogs free

What this means

Detection criteria

Example finding

How to fix

Related patterns