Retries and Timeouts: Testing APIs That Live on Unreliable Networks
The network is unreliable. Here's how clients should retry, how servers should behave, and how to test both.
The three failure modes
Network between client and server can fail in three ways:
- Request lost — client sent, server never received. Retrying is safe.
- Response lost — server received and processed, response never arrived. Retrying is not safe for non-idempotent operations.
- Slow server — server is processing, just slowly. Waiting is safe; retrying might cause duplicates.
From the client's perspective, all three look the same: a timeout. That's the core problem.
Timeout hierarchy
Three timeouts to set, in increasing duration:
- Connection timeout — time to open the TCP connection. Short: 1–3 seconds. If this takes longer, DNS or routing is broken.
- Request timeout — time from connection open to first byte of response. Medium: 5–30 seconds, depends on operation.
- Overall timeout — wall-clock budget for the entire call including retries. Bounded by user patience: 30–60 seconds for user-facing, minutes for batch jobs.
Missing any of these means a single slow call hangs your system indefinitely. The number one cause of cascading outages.
Retry strategy: exponential backoff with jitter
Don't retry immediately. Don't retry at the same interval.
def retry_with_backoff(func, max_retries=5):
for attempt in range(max_retries):
try:
return func()
except RetriableError:
if attempt == max_retries - 1:
raise
delay = min(60, 2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
Why each piece:
- Exponential — 1s, 2s, 4s, 8s — gives the server time to recover. Linear retries (1s, 1s, 1s) don't.
- Capped —
min(60, 2 ** attempt)— don't wait 17 minutes on the 10th attempt. - Jitter —
random.uniform(0, 1)— if 10,000 clients all time out simultaneously, without jitter they all retry at the same second. With jitter, they spread out.
Which errors to retry
Not every error is retriable.
| Status | Retry? | Why |
|---|---|---|
| 408 Request Timeout | Yes | Server says "try again" |
| 429 Too Many Requests | Yes (respect Retry-After) | Rate limited |
| 500 Internal Server Error | Sometimes | Could be transient |
| 502 Bad Gateway | Yes | Upstream flaky |
| 503 Service Unavailable | Yes (respect Retry-After) | Server explicitly says so |
| 504 Gateway Timeout | Yes | Upstream timed out |
| 4xx (other) | No | Client error, retrying won't help |
| Network error (no response) | Yes, if idempotent | See below |
Idempotency keys
For POST and other non-idempotent methods, retries risk duplicates. The solution: idempotency keys.
The client generates a unique key (UUID) for each logical operation and sends it in a header:
POST /api/v1/orders
Idempotency-Key: f47ac10b-58cc-4372-a567-0e02b2c3d479
{ "amount": 100 }
Server behavior:
- First request with this key → process normally, cache
{key → response}. - Second request with the same key → return the cached response, do not re-process.
/api/v1/orderscurl -X POST 'https://demo.totalshiftleft.ai/api/v1/orders' \
-H 'Content-Type: application/json' \
-d '{"idempotency_key":"test-123","amount":100}'Send the same body twice — because the idempotency_key matches, the second call returns the same order ID as the first, with no duplicate created.
What to test — client side
If you own the client:
- Connection timeout fires. Point the client at a blackhole (firewall-drop); assert it gives up within the timeout.
- Request timeout fires. Point at a slow mock (sleeps 60s); assert it gives up within the timeout.
- Retry on 503. Mock returns 503 three times then 200; assert the client retries and eventually succeeds.
- Backoff timing. Instrument the client; assert retries happen with exponential delays.
- Jitter is present. Run 100 simultaneous retries; assert they don't all hit the server in the same second.
- Max retries respected. Mock always returns 503; assert client gives up after N attempts and surfaces the error.
- Idempotency key passed. Hit a POST; assert the header is present and unique per logical operation.
- Idempotency key reused on retry. On a retry, assert the same key is sent (not a new one).
- 4xx not retried. Mock returns 400; assert client gives up immediately.
What to test — server side
If you own the server:
- Idempotency happy path. Two POSTs with the same key → identical response, one side-effect.
- Idempotency with different bodies. Same key, different body — should 422 (
IDEMPOTENCY_CONFLICT), not accept silently. - Idempotency TTL. After the retention window (typically 24 hours), the key is reusable. Test at the boundary.
- Concurrent retries. Two requests with the same key at exactly the same time — one processes, the other waits and returns the same response. No duplicate creation.
Retry-Aftersent on 429 and 503. Header present, value in seconds or HTTP-date.- Graceful slow responses. Send a request that takes 10s — server shouldn't hold the connection open beyond its own request timeout and give a clean 504 if exceeded.
- Connection abort handling. Client hangs up mid-request — server shouldn't leak resources.
Chaos-style scenario tests
Once happy/negative are covered, go chaos:
- Inject a 500ms delay on 10% of requests — assert p99 stays within budget.
- Drop connections at random points (post-send, pre-response) — assert idempotency holds.
- Return 503 for 5 seconds then resume — assert clients recover without manual intervention.
- Slow the server to 5s per request; have 100 clients retry — assert the server doesn't topple from retry storms.
ShiftLeft (and similar tools) can inject these failures at the proxy layer during test runs without code changes.
Common bugs these tests catch
1. No timeout at all. Default HTTP clients in some languages have no timeout. A slow upstream freezes your service for hours.
2. Retrying non-idempotent POSTs without an idempotency key. Leads to double charges, duplicate emails, duplicate orders.
3. Retrying 400s. Wastes the server's time and your client's time. 400 means "the request is wrong" — retrying won't fix it.
4. No jitter. Thundering herd on every upstream hiccup.
5. Idempotency cache too small or too short. Retry 25 hours later, key expired, duplicate created.
6. Idempotency not per-user. Client A's key accidentally returns Client B's response. Serious data leak.
What's next
Retries cover "what if things fail transiently." Negative testing covers "what if a malicious or confused client does something weird on purpose."
Related lessons
Most API bugs live in input validation. Here's how to test it systematically.
Happy paths prove your API works. Negative paths prove it doesn't break. Both matter.
A contract is a promise. Contract testing keeps you honest. Here's how to do it right.
Read more on the blog
Learn how to implement reliability testing for microservices with SLO validation, chaos engineering, load testing, and failover verification to ensure consistent system uptime.
Master microservices testing with this complete guide covering strategies, tools, contract testing, observability, and best practices for distributed systems in 2026.