Manab's Notes - SECTION 3: Resilience & Fault Tolerance

Partial failure is the norm, not the exception. Design for it explicitly.

Strategies:

Timeouts everywhere. Never make an unbounded call. Set connection timeout and read timeout independently.
Circuit breaker. After N consecutive failures to a dependency, open the circuit and fail fast for a cooldown period. Prevents cascading failures from propagating. (Resilience4j, Hystrix, built-in in service meshes)
Bulkhead. Isolate resources per dependency. Separate thread pools / connection pools for each downstream service so a slow service doesn't exhaust shared resources.
Fallbacks. Define what to return when a dependency is unavailable: cached data, degraded response, default value, or a clear error that allows partial UI rendering.
Graceful degradation. A checkout page can still work if the product recommendation service is down. Design non-critical paths to be optional.
Retry with exponential backoff and jitter. Don't retry immediately or all at once — that thunders the failing service. Add randomised jitter to spread retry load.

12. What patterns help maintain consistency across multiple services?

Consistency in distributed systems is a spectrum — choose the right point for each use case.

Patterns:

Saga (compensating transactions) — Eventual consistency via choreography or orchestration. Appropriate when operations span services and strict atomicity is not required.
Outbox pattern — Guarantees that a DB write and a message publish happen atomically, eliminating lost events or phantom events. The service writes the database update and the outgoing event payload into an Outbox table within the same atomic database transaction. A separate background relay processor continually reads the outbox table and guarantees the event is successfully published to the message broker.
Event sourcing — The event log is the source of truth. All services derive their state by consuming the same log. Consistency is eventual but auditable and replayable.
Two-phase commit (2PC) — Strong consistency across distributed resources. Expensive, blocks resources during the commit phase. Only use when absolutely needed (rare in modern architectures).
Change Data Capture (CDC) — Treat the database transaction log as an event stream (Debezium + Kafka). Downstream services consume change events without polluting the domain with messaging concerns. CDC is a database-level alternative to the Outbox pattern that removes messaging logic from your application code entirely. A dedicated tool (like Debezium) monitors the source database’s transaction log files directly. Whenever a service updates a row, the CDC tool automatically detects the log change, converts it into a structured event message, and streams it out to downstream services.

General rule: Prefer eventual consistency at service boundaries; use strong consistency within a service's own data store.

13. How do you design a resilient system that survives external dependency failures?

Assume every external dependency will fail, be slow, or return garbage at some point.

Design patterns:

Circuit breaker + fallback — Open circuit on failure threshold; return cached/default response while open.
Cache-aside with stale-while-revalidate — Serve stale cached data when the source is unavailable. For many use cases, slightly stale data is far better than an error.
Timeout + deadline propagation — Set aggressive timeouts. Propagate deadlines across service calls so downstream services know when to abort.
Async + queue buffer — For write paths, queue requests so that if the dependency is temporarily down, work is not lost (e.g., write to SQS, process when downstream recovers).
Multi-region / multi-vendor — For critical external dependencies (payment gateways, SMS providers), maintain a secondary provider and fail over automatically.
Chaos engineering — Proactively inject failures (Chaos Monkey, Gremlin) to verify your resilience patterns actually work before production teaches you they don't.

14. What's your approach to detecting and isolating slow services?

Detection:

Distributed tracing (Jaeger, Zipkin, OpenTelemetry) — Trace every request across services with timing at each hop. Immediately identifies which service in a chain is adding latency.
P99 / P999 latency dashboards — Averages hide tail latency. Monitor 99th and 99.9th percentile response times per service.
Timeout alerts — Alert on timeout rate, not just error rate. A service returning 200 after 10 seconds is a problem even if it's technically "succeeding."

Isolation:

Bulkhead pattern — Give slow dependencies their own thread pool/connection pool. Slowness can't propagate to other dependencies.
Async offload — Move slow operations off the request path where possible.
Service mesh sidecar — (Istio, Linkerd) Enforce timeouts and circuit breakers at the infrastructure layer, independent of application code.
Hedged requests — For read-heavy systems, send the same request to two replicas simultaneously; use whichever responds first (tail latency elimination at cost of extra load).

15. How do you design caching layers (L1/L2) to avoid stale data and thundering herds?

Cache layers:

L1 (in-process/local cache): Sub-millisecond latency, no network hop. Risk: stale data per instance, high memory pressure. Use for immutable or rarely changing data (config, feature flags, reference data). Short TTLs (seconds).

L2 (distributed cache — Redis, Memcached): Shared across instances, network latency (~1ms). Source of truth for cached data. Use for session data, computed results, rate limit counters.

Stale data strategies:

TTL + cache-aside: Simple but creates stale windows. Accept or shorten TTL for critical data.
Write-through cache: On every DB write, also update cache. Keeps cache fresh but adds write latency.
Event-driven invalidation: On data change, publish an invalidation event; all cache instances evict that key. More complex but precise.

Thundering herd (cache stampede) prevention:

Probabilistic early expiration (PER): Start recomputing the cache before TTL expires, probabilistically, based on proximity to expiry. Eliminates the spike.
Lock-based recomputation: First request to find a cold cache acquires a lock to recompute; other requests wait or serve stale. (Redis SET NX as a recomputation lock)
Request coalescing / promise caching: Multiple concurrent requests for the same cold key are merged into one upstream call.

Google Sites

Report abuse