5. How do you choose between synchronous and asynchronous communication?
Synchronous (REST, gRPC): Use when the caller needs the result immediately to proceed, the operation is fast and bounded, and failure of the dependency is a hard failure (e't can't continue without it).
Asynchronous (queues, events, message buses): Use when the caller doesn't need an immediate result, operations are long-running, the system must tolerate downstream unavailability, or you need to decouple producers from consumers for independent scaling.
Decision signals:
Is the response needed to return to the user? → Sync
Can the user be notified later (email, webhook, polling)? → Async
Is the downstream slow or unreliable? → Async + queue
Is strict ordering important? → Sync or ordered streams (Kafka with single partition)
Are there fan-out requirements (one event → many consumers)? → Async pub/sub
Warning: Mixing sync and async carelessly produces the worst outcome — a synchronous call that blocks waiting for an async operation to complete, combining the latency of async with the coupling of sync.
6. How do you design idempotent operations for systems with retries?
An operation is idempotent if calling it N times produces the same result as calling it once.
Techniques:
Idempotency keys. The client generates a unique key (UUID) per logical operation. The server stores the key and result on first execution. On retry, it returns the stored result without re-executing. Used by Stripe for payments.
Natural idempotency. Design operations to be naturally idempotent: PUT /users/123 {email: "a@b.com"} is idempotent; POST /users/increment-count is not. Prefer state-setting over state-modifying operations.
Conditional writes. Use database constraints (INSERT ... ON CONFLICT DO NOTHING) or optimistic locking (WHERE version = X) to make duplicate writes safe.
Deduplication at the queue. Message brokers like SQS support deduplication IDs. Within a window, duplicate messages are discarded.
Critical for: payment processing, order creation, email sending, inventory mutations. The idempotency key TTL must exceed your maximum retry window.
7. How would you prevent race conditions in a high-throughput workflow?
Race conditions occur when correctness depends on execution order and that order isn't guaranteed.
Imagine the last concert ticket is available. Two people click “buy” at the exact same time. Our program checks the stock, sees one ticket left then both users get a “Purchase Confirmed” email, even though there was only one ticket.
Our logic wasn’t wrong, but timing can betray us. Race conditions can make correct code act incorrectly, breaking things unexpectedly.
A race condition happens when two or more operations try to access and modify the same data (shared resources) at the same time, and the final outcome depends on the exact order in which these operations run.
Think of two people editing the same document at the same time. Without coordination, changes can be lost or overwritten.
Approaches by scope:
Database-level: Pessimistic locking (SELECT FOR UPDATE) — holds a lock until transaction commits. Safe but reduces throughput. Optimistic locking — read with a version/timestamp, write only if version unchanged, retry on conflict. Better for low-contention scenarios.
Application-level: Serialise access to shared state through a single-writer pattern. Route all mutations for a given entity through one actor/thread/goroutine (e.g., Akka actors, Kafka consumer single partition per entity).
Distributed-level: Use distributed locks (Redis SET NX PX, or Redlock for multi-node). Always set a TTL to prevent deadlock on crash.
Design-level: The best race condition is one you've designed away. Use append-only event logs (event sourcing) — concurrent writers append events; a single projector applies them in order. No concurrent mutation, no race.
High-throughput key insight: Locking reduces throughput. Design workflows so that entities that need to be atomically updated are co-located (same DB row, same aggregate) rather than spanning services.