Distributed Systems

If you're building anything beyond a simple application, you'll eventually need to understand distributed systems—not because it's a trend, but because monolithic architectures have inherent scaling limitations that distributed systems are designed to solve.

What Is a Distributed System?

At its core, a distributed system is a collection of computers that work together to appear as a single coherent system to users. These computers communicate over a network, and this network communication introduces fundamental challenges that don't exist in single-machine systems. Networks are inherently unreliable; they have latency, and they can fail in unpredictable ways. The engineering challenge of distributed systems is managing this complexity while maintaining the illusion of a unified, reliable service for end users.

Most applications begin as monolithic architectures. A monolith is a traditional model where all business functions, the user interface, and database logic exist in a single, unified codebase. Although the business logic, user interface, and database logic can be separate inside one codebase.

However, monolithic architectures eventually encounter scaling limitations. Amazon's transformation in the early 2000s serves as a compelling case study. Their entire e-commerce platform was originally a single monolithic application. As the company grew and development teams expanded, the codebase became increasingly difficult to manage. Any update to one component required a complete redeployment of the entire site, leading to significant delays and reduced productivity. By 2001, Amazon recognised that its monolithic architecture could no longer scale to meet customer demand.

The Microservices Solution

Amazon's solution was to decompose the monolith into small, independently deployable services—what we now call microservices. Each service handles a specific business function: payment processing, product catalog, user reviews, recommendations, and so on. This architectural shift enabled independent scaling of individual services. If the payment service experienced heavy load, it could be scaled up without affecting the product catalog or review services. This modularity provided agility and the ability to adapt to rapidly changing business requirements. For developers, this represents a fundamental shift from a single unified codebase to a distributed network of communicating services.

The Fallacies of Distributed Computing

The transition from monolithic to distributed systems requires abandoning several assumptions that hold in single-machine programming. These misconceptions were identified at Sun Microsystems as the "Eight Fallacies of Distributed Computing." Understanding and designing around these fallacies is fundamental to building reliable distributed systems.

The Network Is Reliable
In a monolithic application, a function call succeeds unless the server itself fails. In distributed systems, a request sent over the network might never arrive, might arrive but have its acknowledgment lost, or might be processed multiple times due to retries. This ambiguity is the source of many distributed systems challenges and requires careful handling of timeouts, retries, and idempotency.

Latency Is Zero
Moving data between services across data centres—or across continents—takes significantly longer than in-memory function calls. What might be a microsecond operation in a monolith becomes a millisecond or even second-scale operation when network communication is involved. This latency must be accounted for in system design and user experience.

Bandwidth Is Infinite
While modern network infrastructure is powerful, the aggregate volume of data transferred between microservices can create bottlenecks. This can lead to dropped packets, increased latency, and degraded performance. Bandwidth constraints are real and must be considered when designing inter-service communication patterns.

The remaining fallacies include assumptions about

- security (the network is not inherently secure)

- topology (the network structure changes as servers are added and removed)

- administration (distributed systems span multiple teams and organisations)

- transport costs (network operations consume resources and have financial costs) and

- homogeneity (systems must interoperate across different platforms, operating systems, and protocols).

Recognising these as false assumptions enables engineers to design systems that handle the reality of heterogeneous, unreliable networks.

Communication Patterns

The choice of communication pattern fundamentally affects system performance, resilience, and complexity. There are two types

Synchronous Communication
Synchronous communication follows a request-response pattern where the client sends a request and blocks until the server provides a response. This pattern is intuitive and provides immediate feedback, which is essential for operations like user authentication where immediate confirmation is required. However, synchronous communication introduces tight coupling between services. If Service A makes a synchronous call to Service B, and Service B experiences performance degradation, Service A's performance is directly impacted.

This can lead to cascading failures where a single slow component at the bottom of the call stack causes the entire user-facing system to become unresponsive.

Asynchronous Communication
Asynchronous communication allows a service to send a message and continue processing without waiting for a response. This is typically implemented using message queues such as Kafka or RabbitMQ.

The primary advantage is decoupling—the sender and receiver don't need to be available simultaneously. If the receiving service is temporarily unavailable, messages remain in the queue until the service recovers. This architecture improves resilience and allows systems to handle traffic spikes by processing the queue at a sustainable rate rather than attempting to handle every request immediately.

The trade-off is increased complexity, as developers must handle eventual consistency and manage message ordering and delivery guarantees.

Synchronous communication is used for operations requiring immediate responses (authentication, checkout processes), while asynchronous communication handles background tasks (email notifications, analytics processing, data synchronisation). The choice depends on whether the operation requires immediate user feedback or can be processed in the background.

Load balancing and traffic distribution

In distributed environments, applications typically run across multiple server instances. Load balancers distribute incoming requests across this pool of servers to prevent any single instance from becoming overwhelmed. Load balancers operate at different layers of the OSI networking model, each offering distinct capabilities.

Layer 4 Load Balancing
Layer 4 load balancers operate at the transport layer (TCP/UDP) and make routing decisions based on network information such as source IP address and port numbers. Because they don't inspect packet contents, Layer 4 balancers are extremely fast and efficient, making them ideal for high-volume traffic scenarios such as video streaming or gaming applications where throughput is prioritised over intelligent routing.

Layer 7 Load Balancing
Layer 7 load balancers operate at the application layer (HTTP/HTTPS) and can inspect the actual content of requests. This enables intelligent routing based on URL paths, HTTP headers, or cookies. For example, a Layer 7 balancer can route requests for images to specialised image servers and requests for api to application servers. Layer 7 balancers also support session persistence (sticky sessions), ensuring that users maintain connections to the same server throughout their session, which is important for maintaining state, like shopping cart data.

Data Management

Data management presents the most complex challenges in distributed systems. While application code is stateless and straightforward to replicate, data has persistence requirements and must be kept consistent across multiple locations. Some strategies are:

Data Replication

Data replication involves creating and maintaining multiple copies of the same database across different servers to ensure that data remains accessible and performant. This strategy serves two primary purposes: high availability and read scalability. By distributing data across several nodes(servers), the system remains resilient; if one database node fails, others can immediately take over, ensuring service continuity. Furthermore, by spreading read queries across multiple replicas, the load on any single machine is significantly reduced, which drastically improves overall read throughput for the application.

However, replication introduces significant consistency challenges, often described as a trade-off between speed and accuracy. When a user updates data on a "Primary" node(used for writes), there is a "replication lag"—a temporal window where different replicas may still hold the old data. This leads to the concept of eventual consistency, where the system guarantees all replicas will eventually align, but temporary "time-travel" bugs (where a user refreshes a page and sees old data) are possible. While this trade-off is perfectly acceptable for social media likes or engagement metrics, it is often unacceptable for financial transactions or inventory management, where "strong consistency" is required.

To manage this, architectures typically split roles into Write and Read Replicas. The Write Replica (or Primary) acts as the single source of truth, handling all data modifications to prevent conflicts. In contrast, multiple Read Replicas stay synchronised with the Primary to handle the heavy lifting of search and retrieval. This "Read-Write Separation" allows a system to scale horizontally; if the website gets more traffic, the engineering team simply spins up more Read Replicas to handle the demand without putting additional stress on the core writing engine.

A fascinating real-world example of this is OpenAI’s strategy for scaling ChatGPT. Despite their massive global scale, they famously lean heavily on a Single-Primary PostgreSQL architecture for their core relational data. To support millions of concurrent users, they don't just use a few replicas; they utilise a massive fleet of nearly 50 read replicas. They often employ "Cascading Replication," where the Primary sends updates to a few "hub" replicas, which in turn propagate the data to others. This prevents the Primary from being overwhelmed by the overhead of talking to 50 different machines at once, allowing OpenAI to maintain a simple, reliable source of truth while serving a global audience.

Data Sharding

Sharding, or horizontal partitioning, involves dividing a large dataset into smaller, manageable segments called shards and distributing them across different physical servers. Unlike replication, where every server maintains a complete copy of the same data, sharding ensures that each server holds only a unique portion of the total dataset. This architectural shift becomes necessary when the sheer volume of data exceeds the storage capacity of a single machine, or when the write throughput—the frequency of data being saved or updated—reaches a level that a single database instance can no longer process.

The primary challenge introduced by sharding is routing complexity. Because the data is fragmented, the application or a middleware layer must know exactly which shard contains the specific piece of information requested. This requires a robust partitioning strategy to direct queries to the correct server. Furthermore, sharding complicates "cross-shard" operations; performing joins or aggregations that require data from multiple different servers becomes significantly more difficult and computationally expensive than it would be in a non-sharded, monolithic database.

To manage this distribution effectively, systems often use a Sharding Key—a specific field, like a User ID or Zip Code—to determine where data lives. If the sharding key is poorly chosen, it can lead to "hot spots," where one server is overwhelmed with traffic while others sit idle. To prevent this and ensure an even distribution of data as the cluster grows, engineers typically implement advanced routing techniques like consistent hashing, which allows the system to add or remove shards with minimal disruption to the existing data.

The CAP Theorem

The CAP theorem is fundamental to understanding distributed systems trade-offs. It states that during a network partition, a distributed system can provide only two of three guarantees: Consistency, Availability, and Partition Tolerance.

Consistency means all nodes see the same data at the same time. Availability means every request receives a response, even if some nodes are unavailable.
Partition Tolerance means the system continues operating despite network failures that prevent nodes from communicating.

In practice, partition tolerance is non-negotiable because network failures are inevitable in distributed systems. Therefore, system designers must choose between consistency and availability when partitions occur. A system optimised for consistency (CP) will refuse requests when it cannot guarantee all nodes have identical data, prioritising correctness over availability. Financial systems typically follow this model, as incorrect balances are unacceptable. Conversely, a system optimised for availability (AP) continues serving requests even when consistency cannot be guaranteed, accepting temporary inconsistencies in favour of uninterrupted service. Social media platforms often follow this model, as temporary discrepancies in like counts or follower numbers are tolerable compared to service unavailability.

Resilience Patterns

Distributed systems must be designed to handle failures gracefully. Several well-established patterns help build self-healing systems.

Retries with Exponential Backoff and Jitter
When a request fails, retrying is often appropriate, but naive retry strategies can exacerbate problems. If a service is failing due to overload, immediate retries increase the load further. Exponential backoff addresses this by progressively increasing wait times between retry attempts (for example, 1 second, then 2, then 4, then 8). Jitter introduces randomness to these intervals, preventing thousands of clients from retrying simultaneously—a phenomenon known as the thundering herd problem that can prevent service recovery.

Circuit Breaker Pattern
The circuit breaker pattern monitors the health of remote services and prevents cascading failures. When a service begins failing frequently, the circuit breaker trips to an open state, immediately failing all requests without attempting to contact the failing service. This gives the failing service time to recover while preventing the calling service from wasting resources on timeouts. After a cooldown period, the circuit enters a half-open state, allowing a test request to determine if the service has recovered before fully closing the circuit and resuming normal operation.

Idempotency
Idempotency is the property where an operation produces the same result whether executed once or multiple times. In distributed systems with unreliable networks, this property is essential. If a network failure occurs during a payment transaction and it's unclear whether the transaction was completed, the system needs to safely retry without charging the customer twice. Implementing idempotency typically involves using unique transaction identifiers (idempotency keys) that allow servers to recognise duplicate requests and return the original result rather than reprocessing the operation.

Observability
In monolithic applications, debugging typically involves examining logs from a single server. In distributed systems, a single user action can trigger requests across dozens of services, making troubleshooting significantly more complex. Observability provides the tools and practices to understand system behaviour in production.

The Three Pillars of Observability

1. Metrics provide quantitative measurements of system health, such as CPU usage, error rates, and latency percentiles. Metrics are aggregated numeric data that indicate when something is wrong. For example, a spike in error rate or an increase in 95th percentile latency signals that investigation is needed. Metrics are efficient to store and ideal for real-time monitoring dashboards and alerting systems.

2. Traces follow individual requests as they traverse the system. Each request is assigned a unique trace ID that links together all the service calls involved in handling that request. Distributed tracing reveals bottlenecks and pinpoints which specific service or operation is causing performance problems. Tracing systems like Jaeger and Zipkin enable engineers to visualize the entire request flow and identify where latency is introduced.

3. Logs provide detailed, event-level information about what occurred in the system. While metrics tell you that something is wrong and traces tell you where, logs provide the forensic detail needed to understand why. Logs contain error messages, stack traces, and contextual information that are essential for root cause analysis. However, logs are expensive to store at scale and typically require sampling or filtering strategies.

The observability workflow typically proceeds as follows: metrics indicate a problem exists, traces identify the problematic service or component, and logs provide the specific details needed to diagnose and fix the issue. Common tools include Prometheus for metrics, Jaeger for distributed tracing, and the ELK Stack (Elasticsearch, Logstash, Kibana) for log aggregation and analysis.

Conclusion

Any application requiring high availability, horizontal scaling, geographic distribution, or resilience to hardware failures is inherently a distributed system. The building blocks—asynchronous code, API design, state management—are familiar to most developers. What changes is the arrangement of these components and the explicit handling of failure scenarios.

The challenge lies in rewiring your mental model to prioritise failure handling, understand consistency trade-offs, and optimise for performance within these constraints. Once this mindset shift occurs, you gain the capability to build systems that serve billions of requests, survive data centre outages, and scale globally.

The journey from implementing a simple authentication flow to architecting petabyte-scale distributed databases is substantial, but it follows consistent principles: decomposing complex problems into small, manageable components that coordinate to achieve outcomes greater than the sum of their parts. This is the essence of distributed systems engineering.

This was the introduction to distributed systems. In the next articles, I will dig into each part of distributed systems.