Connection Failures
Invocations failing to connect to downstream services
What this means
Connection failures occur when your Lambda function cannot establish a TCP connection to a downstream service. These appear in logs as ECONNREFUSED (the target host actively refused the connection), ETIMEDOUT (the connection attempt timed out with no response), ENOTFOUND (DNS resolution failed), or ENETUNREACH (no route to the network). Unlike application-level errors where a service responds with an error code, connection failures mean your function never even reached the service.
The root cause depends heavily on whether your Lambda function runs inside a VPC. For VPC-attached functions, the most common culprits are misconfigured security groups (outbound rules blocking the destination port), missing or misconfigured NAT Gateway (VPC functions have no internet access without one), subnet route table issues, or network ACLs that deny traffic. For non-VPC functions, connection failures almost always point to the downstream service being unavailable, a DNS issue, or the destination firewall blocking Lambda's IP ranges.
Connection failures are expensive because they often consume the full socket timeout before failing. If your HTTP client has a 30-second connect timeout and the target host is unreachable (no ICMP response, packet silently dropped by a firewall), each invocation will block for 30 seconds before throwing an error. Combined with SDK retry logic that may attempt the connection 3 times, a single invocation can burn 90 seconds of Lambda execution time on a call that was never going to succeed. This drives up costs and consumes concurrency capacity.
Detection criteria
smplogs triggers this finding when:
- ●HIGH — More than 5 invocations contain connection failure errors
- ●ELEVATED — At least 1 invocation contains a connection failure error
Example finding
28 invocations failed with connection errors. Primary error: "connect ETIMEDOUT 10.0.3.47:5432" (PostgreSQL). All failures originated from subnet-0a1b2c3d in us-east-1b. Subnet-0e4f5g6h in us-east-1a had zero connection errors.
How to fix
Verify security group outbound rules
Check the security group attached to your Lambda function's ENI. The outbound rules must allow traffic to the destination IP and port. A common mistake is restricting outbound traffic to specific security groups without including the destination service. For RDS, the database security group must also allow inbound from the Lambda security group. Use VPC Flow Logs to confirm whether traffic is being blocked — look for REJECT entries on the destination IP and port combination. Remember that security groups are stateful, so you only need to allow outbound; return traffic is automatically permitted.
Check NAT Gateway and route tables for internet access
Lambda functions in a VPC are placed in your specified subnets and have no internet access by default. To reach public endpoints (third-party APIs, public AWS endpoints without VPC endpoints), the Lambda subnets must be private subnets with a route to a NAT Gateway in a public subnet. Check the route table associated with each Lambda subnet: it should have a 0.0.0.0/0 route pointing to a NAT Gateway. If the NAT Gateway is in a different AZ from the Lambda subnet, verify cross-AZ routing works. Also confirm the NAT Gateway itself has an Elastic IP and its public subnet has an Internet Gateway route.
Use VPC endpoints for AWS services
If your Lambda function calls AWS services (DynamoDB, S3, SQS, Secrets Manager), create VPC interface endpoints (PrivateLink) or gateway endpoints for those services. This routes traffic through the AWS private network instead of through the NAT Gateway, eliminating a common failure point and reducing latency. Gateway endpoints (S3, DynamoDB) are free. Interface endpoints cost about $0.01/hour per AZ plus data processing fees, but they are significantly more reliable than routing through NAT and can handle much higher throughput.
Set aggressive connect timeouts
Configure short connect timeouts (3-5 seconds) on all outbound connections. If a host is unreachable, you want to find out in 3 seconds, not 30. The connect timeout is separate from the read timeout — a 3-second connect timeout with a 15-second read timeout means you wait at most 3 seconds to establish the TCP connection, then up to 15 seconds for the response. This limits the blast radius of unreachable hosts and preserves concurrency capacity for requests that can actually succeed.
Verify DNS resolution and network ACLs
If you see ENOTFOUND errors, DNS resolution is failing. VPC-attached Lambda functions use the VPC DNS resolver (the .2 address in your VPC CIDR). Ensure DHCP options are configured correctly and that you have not set a custom DNS server that is unreachable. For ENETUNREACH or persistent ETIMEDOUT errors, check network ACLs on both the Lambda subnet and the destination subnet — unlike security groups, NACLs are stateless and require explicit inbound and outbound rules for both the request and the ephemeral port range used by return traffic (1024-65535).
Related patterns
Spot connection failures fast. Upload a CloudWatch JSON export to smplogs.
Try it free