CRITICAL

Elevated Error Rate

Percentage of invocations failing above the 1% threshold

What this means

An elevated error rate means a significant percentage of your Lambda invocations are completing with an error status rather than returning a successful response. Every Lambda invocation that throws an unhandled exception, returns an error response, or crashes due to an out-of-memory condition counts toward your error rate. When this rate climbs above 1%, it signals a systemic issue rather than the occasional transient blip.

Lambda errors fall into two broad categories: function errors and runtime errors. Function errors occur when your code throws an exception or returns a structured error response. Runtime errors happen at a lower level — the Lambda execution environment itself fails, typically due to memory exhaustion, a missing handler export, or a segmentation fault in a native dependency. Both types appear in CloudWatch as entries with an ERROR status in the REPORT line, but they require different remediation strategies.

Detection criteria

smplogs triggers this finding when:

  • CRITICAL — Error rate exceeds 20% of total invocations
  • HIGH — Error rate exceeds 10% of total invocations
  • ELEVATED — Error rate exceeds 5% of total invocations

Example finding

CRITICAL · 247 invocations Elevated Error Rate

23.4% of invocations returned errors (247 of 1,055). Top error: "Cannot read properties of undefined (reading 'Items')" appeared in 189 invocations.

-> Investigate error patterns. Add error handling or retry logic for transient failures.

How to fix

Identify the dominant error pattern

Before fixing anything, find which error message accounts for the majority of failures. A 20% error rate caused by a single TypeError is a very different problem from a 20% rate spread across dozens of distinct exceptions. Focus on the top contributor first — fixing one root cause often drops the overall rate dramatically.

Add structured error handling with try/catch

Wrap your handler logic in try/catch blocks and return meaningful error responses instead of letting exceptions propagate to the runtime. For Node.js, catch both synchronous and Promise rejection errors. For Python, use broad except clauses at the handler level that log the full traceback before returning a structured error. This prevents Lambda from marking the invocation as a runtime crash.

Implement retry logic for transient failures

If the errors are caused by downstream services (DynamoDB throttling, HTTP 503s from an API, or SDK timeout errors), add exponential backoff with jitter directly in your function code. The AWS SDK v3 has built-in retry configuration — set maxAttempts to 3-5 with standard retry mode. For non-AWS services, use a library like p-retry or Python's tenacity.

Validate input before processing

Many Lambda error spikes are caused by unexpected event payloads — a missing field from API Gateway, a malformed SQS message body, or a null value in a DynamoDB stream record. Add input validation at the top of your handler to catch these early and return a 400-level response rather than crashing midway through processing. Libraries like Zod, Joi, or Pydantic make this straightforward.

Drop your CloudWatch export into smplogs to check for this.

Try it free