Upstream Service Failures

What this means

Upstream service failures indicate that your Lambda function successfully connected to a downstream API or service, but that service returned an error response. These are HTTP 5xx errors (500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout) from REST APIs, gRPC status codes indicating server errors, or application-level error responses from services like Stripe, Twilio, Salesforce, or internal microservices. Unlike connection failures, the network is working — it is the remote service itself that is unhealthy.

These errors are outside your direct control, which makes them both inevitable and frustrating. Every external dependency will eventually return errors — it is not a question of if, but when and how often. A payment processor may return 503 during a deployment, an internal user service may return 500 due to a database deadlock, or a partner API may rate-limit your requests with a 429 response. What matters is whether your Lambda function handles these failures gracefully or lets them turn into a user-visible outage.

Grouping these errors by originating service helps you see which dependency is the least reliable. If 80% of your upstream errors come from a single service, that gives you a clear target for mitigation.

Detection criteria

smplogs triggers this finding when:

●ELEVATED — At least 1 invocation contains an upstream service error (HTTP 5xx, service error codes)

Example finding

ELEVATED · 12 invocations Upstream Service Failures

12 invocations received error responses from downstream services. Primary source: "POST https://api.stripe.com/v1/charges" returned HTTP 503 in 8 invocations. Secondary: internal user-service returned HTTP 500 in 4 invocations.

-> Implement circuit breaker and retry with exponential backoff. Add fallbacks.

How to fix

Implement retry with exponential backoff and jitter

Most upstream 5xx errors are transient — the service recovers within seconds. Retry the request with exponential backoff (100ms, 200ms, 400ms, 800ms) plus random jitter to avoid thundering herd problems when many Lambda instances retry simultaneously. Only retry on 5xx status codes and 429 (rate limited); never retry on 4xx client errors, which indicate a problem with your request. For 429 responses, honor the Retry-After header if present. Libraries like axios-retry, p-retry (Node.js), or tenacity (Python) handle this pattern cleanly.

Add a circuit breaker

A circuit breaker tracks the failure rate of a downstream service and stops making calls when the rate exceeds a threshold. When open, it returns a cached or default response immediately instead of waiting for a call that will likely fail. This protects your function from wasting time and concurrency on a degraded service. In Lambda, circuit breaker state needs to be shared across invocations — store it in a module-scoped variable (works within a single execution environment) or in a shared store like ElastiCache or DynamoDB for cross-environment coordination. The opossum library (Node.js) provides a well-tested implementation.

Design fallback responses

Not every upstream failure needs to be a user-facing error. Consider what your function can return when a dependency is unavailable. Can you serve cached data? Can you return a degraded response with a subset of information? For example, if a product recommendation service is down, show popular items instead of a blank section. If a currency conversion API fails, use the last known exchange rate. Define a fallback strategy for each dependency based on its criticality — some are essential (payment processing), while others are nice-to-have (analytics, recommendations).

Use asynchronous processing for non-critical calls

If the upstream call is not essential to the response your function returns (sending an email, updating analytics, logging to a third-party service), decouple it by sending a message to SQS or EventBridge instead of making the call synchronously. This removes the upstream service from your critical path entirely. The message is processed by a separate Lambda function that can retry independently without affecting user-facing latency. This pattern eliminates an entire category of upstream failures from your synchronous request flow.

Connection Failures

When the service is unreachable at the network level

Elevated Error Rate

Upstream failures contribute to overall error rate

Find which upstream dependencies are failing. Try smplogs with a CloudWatch export.

Try it free