High Container Error Rate

What this means

smplogs divides your CloudWatch log export into time-based windows and counts how many of those windows contain error-level messages. A high error rate means a significant fraction of your container's operational time is producing errors — not just an occasional blip, but a sustained pattern of failure across multiple time intervals.

Error messages in ECS container logs come from multiple layers. Application-level errors (unhandled exceptions, failed API calls, database query failures) are the most common. But you will also see infrastructure-level errors from the ECS agent, the container runtime, or the awslogs log driver itself. The distinction matters because application errors require code or configuration fixes, while infrastructure errors may require changes to your task definition, cluster capacity, or networking setup.

A 5% error rate in a low-traffic service might represent only a handful of errors per hour. The same 5% in a service handling thousands of requests per second means thousands of failed requests. Exceeding 20% is almost always a systemic problem. Between 5-10% can be normal for services handling unreliable upstream data like webhook receivers or log processors.

On ECS with service mesh (AWS App Mesh or a sidecar like Envoy), error rates can be misleading if the sidecar proxy logs its own errors to the same log group. A failing upstream service will cause the Envoy sidecar to log 503 responses, inflating the container's apparent error rate even though the container itself is healthy. Separating proxy logs into their own log group helps isolate the signal.

Detection criteria

CRITICAL

Error rate exceeds 20% of log windows. More than one in five time intervals contains error-level messages.

HIGH

Error rate exceeds 10% of log windows. Consistent errors across multiple intervals suggest a persistent issue.

ELEVATED

Error rate exceeds 5% of log windows. May be acceptable for some workloads but warrants monitoring.

Example finding

CRITICAL High Container Error Rate

23.4% of log windows contain error-level messages. 47 out of 201 windows affected. Dominant error signatures: database connection pool exhaustion, HTTP 503 from payment-service.

ERROR: pool exhausted - cannot acquire connection within 5000ms

ERROR: upstream service payment-service returned 503

ERROR: health check failed: connection refused on port 8080

How to fix

Use the log signature clusters to identify the top error categories. smplogs groups similar log lines into clusters. Sort by frequency to find the one or two error signatures that account for the bulk of the error rate. Fixing the most frequent error pattern often drops the overall rate below the threshold. Common culprits in ECS workloads are database connection pool exhaustion, DNS resolution failures for service discovery endpoints, and missing environment variables after a task definition update.

Check your ECS health check configuration. If the container health check is too aggressive (short interval, low failure threshold), ECS will mark tasks as unhealthy and restart them, generating more errors in the process. A reasonable starting point is a 30-second interval, 5-second timeout, 3 retries, and a start period of 60 seconds to allow the application to initialize. On Fargate, also verify the ALB target group health check — the two checks are independent and both can cause task replacement.

Review downstream connectivity. ECS tasks in private subnets need NAT gateways or VPC endpoints to reach AWS services and external APIs. A misconfigured route table or an exhausted NAT gateway port allocation will cause connection errors that show up as a high error rate in your logs. Check the VPC flow logs for rejected traffic from your task's ENI. If you use AWS Cloud Map for service discovery, verify that the namespace and service are correctly registered and that DNS TTLs are not causing stale records.

Scale your task count or resource allocation. If errors correlate with traffic volume, the service may be under-provisioned. Enable ECS Service Auto Scaling with target tracking on CPU or memory utilization, or use Application Auto Scaling with a custom CloudWatch metric based on request latency or queue depth. On Fargate, scaling is fast (sub-minute task launches), so aggressive scale-out policies with conservative scale-in cooldowns work well.

Implement structured logging with severity levels. If your application logs errors and warnings to the same stream without structured severity fields, the error rate calculation may be inflated by warnings that are not actionable. Adopt JSON structured logging with a level field so that smplogs (and CloudWatch Logs Insights) can accurately distinguish errors from warnings and info messages.

Network Connectivity Failures

Connection errors to downstream services

Container Timeout Errors

Timeout errors across container log windows

See which errors dominate your ECS logs and how often they appear. Upload a CloudWatch export.

Upload your logs

What this means

Detection criteria

Example finding

How to fix

Related patterns