Network Connectivity Failures

What this means

Your ECS containers are failing to establish or maintain network connections to other services. These errors appear in the logs as "connection refused," "connection timed out," "no route to host," "ECONNRESET," or DNS resolution failures. Unlike application-level errors, connectivity failures indicate infrastructure or network-layer problems that prevent your code from reaching the resources it depends on.

ECS networking differs significantly between Fargate and EC2 launch types. On Fargate, each task gets its own Elastic Network Interface (ENI) in your VPC subnet. The ENI is subject to security group rules, route table entries, and subnet NACLs. On EC2 launch type with the default bridge network mode, containers share the host's network stack through Docker's port mapping. With awsvpc network mode on EC2, each task gets its own ENI just like Fargate, but the number of ENIs per instance is limited by the instance type (a t3.medium supports only 3 ENIs, for example).

Connection errors that appear in multiple log windows (not just a one-off blip) usually point to a structural networking issue. Common root causes include security group rules that do not permit outbound traffic to the target port, missing NAT gateway routes for tasks in private subnets trying to reach the internet, exhausted ENI capacity on EC2 instances, or a downstream service that is genuinely down. In ECS deployments with AWS Cloud Map service discovery, stale DNS entries during rolling updates can cause intermittent connection failures as tasks try to reach deregistered IP addresses.

If your ECS tasks connect to an RDS database, ElastiCache cluster, or any other VPC resource, the security group attached to that resource must explicitly allow inbound traffic from the task's security group. This is one of the most frequently missed configurations, especially when teams create new ECS services and forget to update the database security group's inbound rules.

Detection criteria

HIGH

Connection error patterns detected in more than 2 log windows. Includes "connection refused," "connection timed out," "ECONNRESET," "ENOTFOUND," and similar network-level error signatures.

Example finding

HIGH Network Connectivity Failures

Connection errors detected in 8 of 45 log windows. Primary targets: postgres.internal:5432 (connection refused), redis.internal:6379 (connection timed out).

Error: connect ECONNREFUSED 10.0.3.42:5432

Error: getaddrinfo ENOTFOUND orders-service.local

Error: connect ETIMEDOUT 10.0.5.18:6379

How to fix

Verify security group rules for both the ECS task and the target resource. The task's security group needs an outbound rule permitting traffic to the target's port. The target's security group needs an inbound rule permitting traffic from the task's security group (referencing it by security group ID, not by IP range). Use aws ec2 describe-security-groups --group-ids sg-task sg-target to inspect both sets of rules. For Fargate tasks, the security group is specified in the networkConfiguration block of your service definition.

Check DNS resolution. If the error is ENOTFOUND, the task cannot resolve the hostname. For Cloud Map service discovery names, verify the namespace exists and the service has healthy instances registered. For private hosted zones in Route 53, ensure the VPC is associated with the hosted zone. Use ECS Exec to connect to a running task and run nslookup target-hostname from inside the container to test resolution directly.

Confirm route table and NAT gateway configuration. Tasks in private subnets need a route to a NAT gateway for internet-bound traffic, or VPC endpoints for AWS services. Check the route table associated with your task's subnet. A common mistake is placing tasks in a public subnet without a public IP assignment (Fargate tasks in public subnets need assignPublicIp: ENABLED). For VPC endpoints, verify the endpoint policy allows the necessary API actions and that the endpoint's security group permits traffic from your task.

Check for ENI exhaustion on EC2 launch type. Each awsvpc task on EC2 consumes one ENI. If all ENIs on the instance are in use, new tasks cannot start and existing tasks may experience network issues. Monitor the NetworkInterfaceCount metric or check aws ec2 describe-network-interfaces --filters Name=attachment.instance-id,Values=i-xxx. Consider using ENI trunking (enabled via the ECS_ENABLE_TASK_ENI agent config) to increase the ENI limit per instance.

Implement connection retry logic with exponential backoff. Transient network issues are inevitable in distributed systems. Your application should retry failed connections with increasing delays (e.g., 100ms, 200ms, 400ms, up to a maximum of 5 seconds) and use circuit breakers to stop hammering a down service. Most connection pool libraries (like pg-pool for Postgres, ioredis for Redis) have built-in retry options that just need to be enabled in the configuration.

High Container Error Rate

Percentage of log windows containing container errors

Container Timeout Errors

Timeout errors across container log windows

Pinpoint which downstream connections are failing. Upload your CloudWatch JSON export.

Try it now

What this means

Detection criteria

Example finding

How to fix

Related patterns