Degraded API Endpoint

What this means

A degraded endpoint is a specific API Gateway resource-method combination (e.g., POST /api/orders or GET /api/users/{id}) where the error rate exceeds 20% across a meaningful sample of requests. While aggregate API-wide error metrics might look acceptable, individual endpoints can be completely broken without moving the overall numbers much. If your API has 50 endpoints and one of them is returning errors on every request, the aggregate 5xx rate might only be 2% -- well within normal bounds -- but that endpoint is effectively down.

This pattern often goes undetected by standard CloudWatch alarms that monitor aggregate metrics. A developer deploys a change that breaks the /api/payments/webhook endpoint, but since that endpoint receives relatively few requests compared to the rest of the API, the overall error rate barely moves. Meanwhile, payment confirmations are silently failing, orders are stuck in "pending" state, and customers are not receiving their purchases. Per-endpoint monitoring is the only way to catch these localized failures.

The detection requires a minimum of 5 requests to the endpoint to avoid false positives from low-traffic routes where a single error could produce a misleadingly high error rate. When an endpoint crosses the 20% error threshold with sufficient request volume, it strongly indicates a systematic problem: a code bug, a broken dependency, a misconfigured integration, or a permissions issue specific to that endpoint's Lambda function or backend resource.

Detection criteria

HIGH

A specific endpoint has more than 20% error rate with at least 5 requests.

Example finding

HIGH Degraded API Endpoint

Endpoint POST /api/orders has a 64.3% error rate (45 errors out of 70 requests). All errors are HTTP 502 Bad Gateway, indicating the Lambda integration is returning malformed responses or throwing unhandled exceptions.

Recommendation: Investigate the endpoint's backend for misconfigurations or integration failures. Check the Lambda function associated with POST /api/orders for recent deployments or dependency changes.

How to fix

Isolate the error type for this endpoint

Filter your API Gateway access logs to the specific endpoint (resource path + HTTP method) and examine the status code distribution. Are the errors all 502s (Lambda integration failures), 504s (timeouts), 500s (API Gateway internal errors), or a mix? Each type has a distinct root cause. Also check whether the errors started at a specific time -- correlate with recent deployments using aws lambda list-versions-by-function to see if a code change was deployed around when errors began.

Check the endpoint's Lambda function logs

Open the CloudWatch log group for the Lambda function that backs this endpoint. Search for ERROR, Task timed out, or Runtime.UnhandledPromiseRejection. Lambda functions that share code across endpoints may have a bug that only manifests for certain input shapes -- the broken endpoint may be the only one that exercises a particular code path. Check the event object for unusual request parameters, large payloads, or missing required fields that cause the handler to crash.

Verify the integration configuration

In the API Gateway console, inspect the endpoint's integration request settings. Confirm the Lambda function ARN or HTTP URL is correct and points to the intended target. Check that the request mapping template (for REST API non-proxy integrations) correctly transforms the incoming request. A common issue is a misconfigured stage variable in the integration URI: ${stageVariables.lambdaAlias} referencing an alias that no longer exists after a deployment pipeline change. Also verify that the HTTP method pass-through matches expectations.

Roll back the last deployment

If the endpoint was working before a recent change, the fastest remediation is to roll back. For Lambda, repoint the alias to the previous version: aws lambda update-alias --function-name <name> --name prod --function-version <previous-version>. For API Gateway configuration changes, redeploy the previous API stage. If you use canary deployments, promote the canary to 0% to fully roll back. Investigate the root cause after service is restored -- do not debug in production while the endpoint is down.

Add per-endpoint CloudWatch alarms

Prevent future blind spots by creating CloudWatch alarms for each critical endpoint. Use the API Gateway metric 5XXError with dimensions for ApiName, Resource, and Method. Set the alarm threshold to fire when the error rate exceeds 10% over a 5-minute period with at least 10 data points. Route alarms to SNS topics that trigger PagerDuty or Slack notifications. For APIs with many endpoints, use CloudWatch Contributor Insights to automatically surface the highest-error resources without manually creating individual alarms. See our 502 errors guide for alarm setup examples.

High Server Error Rate (5xx) — degraded endpoints contribute to the aggregate 5xx rate
Slow API Endpoint — endpoints can be both slow and error-prone simultaneously

Upload your CloudWatch access logs and see this pattern flagged automatically.

Try it free