Partial failure is the norm, not the exception. Design for it explicitly.
Strategies:
Timeouts everywhere. Never make an unbounded call. Set connection timeout and read timeout independently.
Circuit breaker. After N consecutive failures to a dependency, open the circuit and fail fast for a cooldown period. Prevents cascading failures from propagating. (Resilience4j, Hystrix, built-in in service meshes)
Bulkhead. Isolate resources per dependency. Separate thread pools / connection pools for each downstream service so a slow service doesn't exhaust shared resources.
Fallbacks. Define what to return when a dependency is unavailable: cached data, degraded response, default value, or a clear error that allows partial UI rendering.
Graceful degradation. A checkout page can still work if the product recommendation service is down. Design non-critical paths to be optional.
Retry with exponential backoff and jitter. Don't retry immediately or all at once — that thunders the failing service. Add randomised jitter to spread retry load.
12. What patterns help maintain consistency across multiple services?
Consistency in distributed systems is a spectrum — choose the right point for each use case.
Patterns:
Saga (compensating transactions) — Eventual consistency via choreography or orchestration. Appropriate when operations span services and strict atomicity is not required.
Outbox pattern — Guarantees that a DB write and a message publish happen atomically, eliminating lost events or phantom events. The service writes the database update and the outgoing event payload into an Outbox table within the same atomic database transaction. A separate background relay processor continually reads the outbox table and guarantees the event is successfully published to the message broker.
Event sourcing — The event log is the source of truth. All services derive their state by consuming the same log. Consistency is eventual but auditable and replayable.
Two-phase commit (2PC) — Strong consistency across distributed resources. Expensive, blocks resources during the commit phase. Only use when absolutely needed (rare in modern architectures).
Change Data Capture (CDC) — Treat the database transaction log as an event stream (Debezium + Kafka). Downstream services consume change events without polluting the domain with messaging concerns. CDC is a database-level alternative to the Outbox pattern that removes messaging logic from your application code entirely. A dedicated tool (like Debezium) monitors the source database’s transaction log files directly. Whenever a service updates a row, the CDC tool automatically detects the log change, converts it into a structured event message, and streams it out to downstream services.
General rule: Prefer eventual consistency at service boundaries; use strong consistency within a service's own data store.