intermediate·8 min read·Updated May 1, 2026

Retries and Timeouts: Testing APIs That Live on Unreliable Networks

The network is unreliable. Here's how clients should retry, how servers should behave, and how to test both.

The three failure modes

Network between client and server can fail in three ways:

  1. Request lost — client sent, server never received. Retrying is safe.
  2. Response lost — server received and processed, response never arrived. Retrying is not safe for non-idempotent operations.
  3. Slow server — server is processing, just slowly. Waiting is safe; retrying might cause duplicates.

From the client's perspective, all three look the same: a timeout. That's the core problem.

Timeout hierarchy

Three timeouts to set, in increasing duration:

  • Connection timeout — time to open the TCP connection. Short: 1–3 seconds. If this takes longer, DNS or routing is broken.
  • Request timeout — time from connection open to first byte of response. Medium: 5–30 seconds, depends on operation.
  • Overall timeout — wall-clock budget for the entire call including retries. Bounded by user patience: 30–60 seconds for user-facing, minutes for batch jobs.

Missing any of these means a single slow call hangs your system indefinitely. The number one cause of cascading outages.

Retry strategy: exponential backoff with jitter

Don't retry immediately. Don't retry at the same interval.

def retry_with_backoff(func, max_retries=5):
    for attempt in range(max_retries):
        try:
            return func()
        except RetriableError:
            if attempt == max_retries - 1:
                raise
            delay = min(60, 2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Why each piece:

  • Exponential — 1s, 2s, 4s, 8s — gives the server time to recover. Linear retries (1s, 1s, 1s) don't.
  • Cappedmin(60, 2 ** attempt) — don't wait 17 minutes on the 10th attempt.
  • Jitterrandom.uniform(0, 1) — if 10,000 clients all time out simultaneously, without jitter they all retry at the same second. With jitter, they spread out.

Which errors to retry

Not every error is retriable.

StatusRetry?Why
408 Request TimeoutYesServer says "try again"
429 Too Many RequestsYes (respect Retry-After)Rate limited
500 Internal Server ErrorSometimesCould be transient
502 Bad GatewayYesUpstream flaky
503 Service UnavailableYes (respect Retry-After)Server explicitly says so
504 Gateway TimeoutYesUpstream timed out
4xx (other)NoClient error, retrying won't help
Network error (no response)Yes, if idempotentSee below

Idempotency keys

For POST and other non-idempotent methods, retries risk duplicates. The solution: idempotency keys.

The client generates a unique key (UUID) for each logical operation and sends it in a header:

POST /api/v1/orders
Idempotency-Key: f47ac10b-58cc-4372-a567-0e02b2c3d479
{ "amount": 100 }

Server behavior:

  • First request with this key → process normally, cache {key → response}.
  • Second request with the same key → return the cached response, do not re-process.
POST/api/v1/orders
Create an order with an idempotency key — retries won't duplicate.
curl -X POST 'https://demo.totalshiftleft.ai/api/v1/orders' \
  -H 'Content-Type: application/json' \
  -d '{"idempotency_key":"test-123","amount":100}'

Send the same body twice — because the idempotency_key matches, the second call returns the same order ID as the first, with no duplicate created.

What to test — client side

If you own the client:

  1. Connection timeout fires. Point the client at a blackhole (firewall-drop); assert it gives up within the timeout.
  2. Request timeout fires. Point at a slow mock (sleeps 60s); assert it gives up within the timeout.
  3. Retry on 503. Mock returns 503 three times then 200; assert the client retries and eventually succeeds.
  4. Backoff timing. Instrument the client; assert retries happen with exponential delays.
  5. Jitter is present. Run 100 simultaneous retries; assert they don't all hit the server in the same second.
  6. Max retries respected. Mock always returns 503; assert client gives up after N attempts and surfaces the error.
  7. Idempotency key passed. Hit a POST; assert the header is present and unique per logical operation.
  8. Idempotency key reused on retry. On a retry, assert the same key is sent (not a new one).
  9. 4xx not retried. Mock returns 400; assert client gives up immediately.

What to test — server side

If you own the server:

  1. Idempotency happy path. Two POSTs with the same key → identical response, one side-effect.
  2. Idempotency with different bodies. Same key, different body — should 422 (IDEMPOTENCY_CONFLICT), not accept silently.
  3. Idempotency TTL. After the retention window (typically 24 hours), the key is reusable. Test at the boundary.
  4. Concurrent retries. Two requests with the same key at exactly the same time — one processes, the other waits and returns the same response. No duplicate creation.
  5. Retry-After sent on 429 and 503. Header present, value in seconds or HTTP-date.
  6. Graceful slow responses. Send a request that takes 10s — server shouldn't hold the connection open beyond its own request timeout and give a clean 504 if exceeded.
  7. Connection abort handling. Client hangs up mid-request — server shouldn't leak resources.

Chaos-style scenario tests

Once happy/negative are covered, go chaos:

  • Inject a 500ms delay on 10% of requests — assert p99 stays within budget.
  • Drop connections at random points (post-send, pre-response) — assert idempotency holds.
  • Return 503 for 5 seconds then resume — assert clients recover without manual intervention.
  • Slow the server to 5s per request; have 100 clients retry — assert the server doesn't topple from retry storms.

ShiftLeft (and similar tools) can inject these failures at the proxy layer during test runs without code changes.

Common bugs these tests catch

1. No timeout at all. Default HTTP clients in some languages have no timeout. A slow upstream freezes your service for hours.

2. Retrying non-idempotent POSTs without an idempotency key. Leads to double charges, duplicate emails, duplicate orders.

3. Retrying 400s. Wastes the server's time and your client's time. 400 means "the request is wrong" — retrying won't fix it.

4. No jitter. Thundering herd on every upstream hiccup.

5. Idempotency cache too small or too short. Retry 25 hours later, key expired, duplicate created.

6. Idempotency not per-user. Client A's key accidentally returns Client B's response. Serious data leak.

What's next

Retries cover "what if things fail transiently." Negative testing covers "what if a malicious or confused client does something weird on purpose."

Related lessons

Read more on the blog