Lambda Timeout Debugging

How to find "Task timed out" errors in your CloudWatch logs, trace the root cause, and prevent timeouts.

TL;DR

Lambda kills your function when it hits the configured timeout (default 3 seconds, max 15 minutes). CloudWatch logs show Task timed out after X.XX seconds and the REPORT line will show Duration equal to the timeout value. The most common cause is a downstream call (database query, HTTP request) that hangs because no client-side timeout was set. Fix: always set HTTP/SDK timeouts shorter than your Lambda timeout, add checkpoint logging before slow operations, and move long-running work to Step Functions or SQS.

What happens when a Lambda times out?

Every Lambda function has a configured timeout. When you create a function, the default is 3 seconds. You can increase it up to 15 minutes (900 seconds). If your function's execution reaches this limit, Lambda forcibly terminates the invocation. There is no graceful shutdown, no chance to run cleanup code, no finally block. The process is killed mid-execution.

This abrupt termination has several consequences that make timeouts particularly painful to debug:

  • Your function gets no opportunity to close database connections, flush buffers, or send error responses. Any in-flight work is lost.
  • The invocation is billed for the full timeout duration. A function configured with a 15-minute timeout that hangs on a dead database connection is billed for 15 minutes, even though it accomplished nothing.
  • Downstream resources are left in an indeterminate state. If your function was writing to DynamoDB, halfway through a batch, some items are written and some are not. If it had an open connection to RDS, that connection may linger as a zombie until the database server times it out on its side.
  • For synchronous invocations (API Gateway, SDK invoke), the caller receives a 502 or timeout error with no meaningful error message from your code.
  • For asynchronous invocations (S3 events, SNS, EventBridge), Lambda will automatically retry twice before sending the event to the dead letter queue (if configured). Each retry hits the same timeout, tripling the wasted compute time.

CloudWatch records the timeout with a specific log line: Task timed out after X.XX seconds. This appears immediately after the last log line your function emitted before being killed. The accompanying REPORT line shows a Duration value exactly equal to your configured timeout, and the function's status is marked as an error.

Identifying timeouts in CloudWatch logs

When a Lambda invocation times out, CloudWatch records two critical log lines. The first is the timeout notification itself:

2024-01-15T10:30:00.000Z abc-123 Task timed out after 30.00 seconds

The second is the REPORT line, which shows Duration matching your configured timeout exactly:

REPORT RequestId: abc-123  Duration: 30000.00 ms  Billed Duration: 30000 ms  Memory Size: 512 MB  Max Memory Used: 256 MB

The hallmark of a timeout is that Duration in the REPORT line exactly matches your function's timeout configuration. If your function is configured for 30 seconds and the REPORT shows Duration: 30000.00 ms, that invocation was killed by the timeout, even if you don't see the "Task timed out" message (which can sometimes be missing in edge cases).

There is also a near-timeout category worth watching. If an invocation's Duration is within 95% of the configured timeout (for example, 28.5 seconds out of a 30-second timeout), the function completed just in time but is at risk. A slightly slower database response or a minor network hiccup next time will push it over. These near-misses are early warning signs.

To find timeouts at scale across multiple log groups, use CloudWatch Logs Insights:

fields @timestamp, @requestId, @message
| filter @message like /Task timed out/
| sort @timestamp desc
| limit 100

This query returns the 100 most recent timeout events. From here, use the @requestId to find all log lines for that specific invocation and trace what your function was doing right before it was killed.

Common causes of Lambda timeouts

Slow database queries

DynamoDB Scan operations on large tables, full table scans on RDS without proper indexes, or queries that accidentally return millions of rows. DynamoDB Scans read every item in the table and are O(n) on table size. An RDS query without an index on the WHERE clause column will do a sequential scan that gets slower as the table grows. If you're connecting to RDS without connection pooling (or without RDS Proxy), you may also exhaust the database's connection limit, causing new connection attempts to queue and eventually time out.

Downstream API calls without timeouts

This is the single most common cause of Lambda timeouts. When you make an HTTP request using axios, node-fetch, the AWS SDK, or Python's requests library, the default timeout is often infinite or unreasonably long. If the downstream service is slow or unresponsive, your Lambda function sits there waiting for a response until Lambda kills it at the timeout limit. The fix is simple but frequently overlooked: always set an explicit timeout on every outbound HTTP call, and make that timeout shorter than your Lambda's configured timeout.

Dead connections from VPC/ENI issues

Lambda functions in a VPC attach to Elastic Network Interfaces (ENIs). If the function's security group or subnet doesn't allow outbound traffic to the destination, the TCP connection attempt hangs silently until it times out at the OS level (typically 120 seconds on Linux). There's no "connection refused" error because the packets simply disappear. This manifests as your function appearing to freeze on the first network call. Common triggers include: missing NAT Gateway for internet access, security group not allowing outbound HTTPS (port 443), or VPC endpoint misconfiguration for AWS services.

Large payload processing

Downloading large objects from S3, processing multi-GB files line by line, or deserializing massive JSON payloads. If your function calls s3.getObject() on a 2 GB file and tries to buffer the entire response in memory, the download alone can exceed your timeout. Even streaming the data, CPU-intensive parsing (XML, CSV with millions of rows) can push execution time beyond the limit.

Cold start eating into the timeout budget

A cold start with a large deployment package or slow initialization code (connecting to databases, loading ML models, warming caches) can consume a significant portion of the timeout window. If your function has a 10-second timeout and cold start takes 6 seconds, you only have 4 seconds left for actual execution. Combined with a moderately slow downstream call, this easily exceeds the limit. This is especially problematic for Java and .NET runtimes where cold starts can reach 5-10 seconds.

Infinite loops or recursive calls

A bug in your code that creates an infinite loop, or a Lambda function that triggers itself (for example, an S3-triggered function that writes back to the same bucket, re-triggering itself). The loop doesn't crash because there's no stack overflow; it just runs until the timeout kills it. These are easy to spot in logs because the function's last log line will be inside the loop, and there will be no error message other than the timeout notification.

How smplogs detects timeouts

When you upload a Lambda log file to smplogs, the WASM analysis engine scans every log line for the Task timed out pattern and cross-references it with REPORT lines to count how many invocations hit the timeout limit. It produces a finding card like this:

ELEVATED Timeout Pattern

3 invocations timed out at the 30s limit.

-> Review downstream service latency and consider increasing timeout or optimizing slow operations.

The engine doesn't stop at counting timeouts. It also performs root cause correlation, looking at the full context of timed-out invocations to surface related patterns. For example, if your logs also contain DynamoDB throttling errors (ProvisionedThroughputExceededException) or connection errors (ETIMEDOUT, ECONNREFUSED), smplogs links these to the timeout finding in the root cause analysis panel.

The duration metrics card also highlights the relationship between timeouts and your overall latency distribution. If 5% of invocations are timing out at 30 seconds while the median duration is 200 milliseconds, that's a clear sign of a bimodal distribution caused by an intermittent downstream failure rather than a generally slow function.

Step-by-step fixes

Fix 1: Set HTTP client timeouts on every outbound call

The single most effective fix. Every HTTP request your function makes should have an explicit timeout that is shorter than your Lambda timeout. If your Lambda is configured for 30 seconds, set your HTTP client timeout to 25 seconds at most, leaving a buffer for cleanup and retries.

// Bad - no timeout, will hang indefinitely
const response = await axios.get('https://api.example.com/data');

// Good - explicit 5-second timeout
const response = await axios.get('https://api.example.com/data', {
  timeout: 5000,
  signal: AbortSignal.timeout(5000)
});

// AWS SDK v3 - set request timeout
const client = new DynamoDBClient({
  requestHandler: new NodeHttpHandler({
    connectionTimeout: 3000,
    requestTimeout: 5000
  })
});

Fix 2: Use connection pooling for databases

If your function connects to RDS or Aurora, use RDS Proxy to manage connection pooling. Without it, every Lambda container opens its own database connection, and under high concurrency you can exhaust the database's max_connections limit. New connection attempts queue up and eventually time out. RDS Proxy multiplexes hundreds of Lambda connections onto a smaller pool of database connections. For DynamoDB, connection pooling is handled by the SDK, but ensure you're creating the client outside the handler so the connection is reused across invocations.

Fix 3: Increase the Lambda timeout if the work is legitimately slow

Sometimes the function genuinely needs more time. If your function processes a batch of SQS messages and each message requires a few API calls, the total execution time might legitimately need 60 seconds. In this case, increase the timeout. But don't set it to 15 minutes "just in case." Set it to approximately 2x your expected maximum execution time. This gives headroom for variance while still failing fast if something goes wrong. A function that hangs for 15 minutes on a dead connection wastes money and delays error detection.

Fix 4: Add circuit breakers for downstream services

If a downstream service is consistently slow or down, every Lambda invocation that calls it will time out. A circuit breaker pattern prevents this cascade: after N consecutive failures, the circuit "opens" and immediately rejects requests without making the downstream call. This fails fast instead of burning through your timeout, saves cost, and prevents your function from amplifying the problem. Libraries like opossum (Node.js) or pybreaker (Python) make this straightforward to implement.

Fix 5: Move long-running work to Step Functions or SQS

If your function's workload genuinely requires more than a few minutes, Lambda may not be the right execution model. Step Functions let you orchestrate multi-step workflows where each step is a separate Lambda invocation with its own timeout. The overall workflow can run for up to a year. Alternatively, break the work into smaller chunks, publish each chunk to an SQS queue, and process them in separate Lambda invocations. This pattern turns one long-running function into many short-running ones, each well within the timeout limit.

Fix 6: Add structured logging before each slow operation

When a function times out, the last log line tells you where it was stuck. But if your function only logs at the start and end, you have no visibility into which step caused the hang. Add structured log lines (with timestamps) before and after each slow operation: database queries, HTTP calls, S3 operations, file processing.

console.log(JSON.stringify({
  step: 'dynamodb_query_start',
  table: 'orders',
  ts: Date.now()
}));

const result = await dynamo.send(new QueryCommand(params));

console.log(JSON.stringify({
  step: 'dynamodb_query_end',
  table: 'orders',
  items: result.Items.length,
  ts: Date.now()
}));

If the function times out and the last log line you see is dynamodb_query_start, you know the DynamoDB query is the culprit. Without these checkpoints, you're guessing.

Prevention best practices

Timeouts are easier to prevent than debug after the fact. These practices catch timeout risks before they reach production:

Set timeout to 2x expected max duration, not 15 minutes

Measure your function's actual execution time distribution. If P99 duration is 8 seconds, set the timeout to 15-20 seconds. Setting it to 900 seconds "for safety" means that when something does go wrong, you won't know about it for 15 minutes, and you'll be billed for every second of that wait. A tight timeout acts as a fast failure detector. Configure CloudWatch alarms on the Duration metric and alert when P95 or P99 exceeds 70% of the configured timeout.

Always set HTTP client timeouts shorter than Lambda timeout

This should be a team-wide rule enforced by linting or code review. The contract is: if your Lambda has a 30-second timeout, no outbound call should wait longer than 25 seconds. This ensures your code gets a chance to handle the failure (retry, fallback, error response) instead of being killed mid-flight. For AWS SDK calls, configure both connectionTimeout (how long to wait for a TCP connection) and requestTimeout (how long to wait for a response).

Use CloudWatch alarms on Duration approaching timeout

Create a CloudWatch alarm on your function's Duration metric. Set the threshold to 75-80% of your configured timeout. For a 30-second timeout, alarm when P99 Duration exceeds 22 seconds. This gives you advance warning that your function is getting close to the limit, so you can investigate and fix the issue before users are impacted. Also alarm on the Errors metric filtered to timeout errors specifically.

Implement dead letter queues for async invocations

For functions triggered by S3, SNS, EventBridge, or any async source, always configure a dead letter queue (SQS) or on-failure destination. When a timeout causes the invocation to fail after all retries, the event is sent to the DLQ instead of being silently dropped. Without a DLQ, timed-out events are lost forever. Monitor the DLQ's ApproximateNumberOfMessagesVisible metric and alarm when it's non-zero.

Log timestamps at key checkpoints to trace slow paths

Build structured logging into your function from day one, not after the first timeout incident. Log a timestamp before and after every external call (database, API, S3, SQS). Include the operation name and relevant parameters (table name, endpoint URL, S3 key). When a timeout occurs, the logs tell you exactly which operation was in progress and how long previous operations took. This turns a multi-hour debugging session into a five-minute diagnosis. For Python, the aws_lambda_powertools library provides structured logging with automatic correlation IDs. For Node.js, a simple JSON.stringify wrapper around console.log is sufficient.

Tired of hunting through logs for timeouts? Drop your CloudWatch JSON into smplogs to instantly see timeout patterns, duration breakdowns, and root cause correlations.

Try it free

Related guides