This is an online book that is short yet fun to learn more about distributed systems. I didn't take distributed courses, so I started with this one. I did read Design Data Intensive Applications, but that one I need to keep revisiting oftenly.

It has only 5 chapters, so I read it in a day. I feel it is quite helpful in a way that it categorize and summarize the roadmap of learning distributed system pretty well, helping me get deeper into other books.

Chapter 1: Basics

Distributed programming is the art of solving the same problem that you can solve on a single computer using multiple computers.

Things become trick as size grows:

Scalability

  • Size scalability
  • Geographic scalability
  • Adminitrative scalability: adding more nodes should not increase the administrative costs of the system.

Performance (and latency)

  • Short response time/low latency for a given piece of work
  • High throughput
  • Low utilization of computing resources

The author emphasizes the difference between latent and latency.

Availability (and fault tolerance)

Distributed system can take a bunch of unreliable components and build a reliable system on top of them.

availability = uptime / (uptime + downtime)

Difference between an error and an anomaly - an error is incorrect behavior, while an anomaly is unexpected behavior.

Abstractions make things more manageable by removing real-world aspects that are not relevant to solving a problem. Models describe the key properties of a distributed system in a precise manner.

  • System model (async/sync)
  • Failure model (crash-fail, partions, Byzantine)
  • Consistency model (strong, eventual)

Design techniques

  • Partition: split date into multiple nodes to allow for more parallel processing
  • Replication: copied or cached on different nodes to reduce the distance between the client and the server for greater fault tolerance
    • Replication improves performance and availability.
    • Replication is also the source of many of the problems, since there are now independent copies of the data that has to be kept sync on multiple machines.

Chapter 2: Up and down the level of anstraction

What do we mean when say X is more abstract than Y? First, that X does not introduce anything new or fundamentally different from Y.

Reminder: using abstraction is very powerful.

A system model is used to enumerate the many assumptions associated.

Nodes

Mainly local.

Nodes can fail or misbehave. (my note: error and anomaly?)

Communication

Communication links nodes.

Generally it is preferable to consider the network to be unreliable.

Node partition vs Node failure: Partitioned modes just have network ossuess, other clients can still access. failed nodes can't be accessed.

Time/Ordering assumptions

The synchronous system model imposes many constraints on time and order. (Easier)

Asynchronicity is a non-assumption: it just assumes that you can't rely on timing (or a "time sensor").

Consensus problem (一致性)

FLP impossibility and CAP

FLP is a very weak form of the consensus problem, under a series assumptions.

The result: algorithms that solve the consensus problem must either give up safety or liveness when the guarantees regarding bounds on message delivery do not hold.

The CAP theorem is a related theorem that is more relevant to practitioners: it makes slightly different assumptions (network failures rather than node failures), and has more clear implications for practitioners choosing between system designs.

CAP Theorem

I always forget the definition, write down the notes here:

  • Consistency: all nodes see the same data at the same time.
  • Availability: node failures do not prevent survivors from continuing to operate.
  • Partition tolerance: the system continues to operate despite message loss due to network and/or node failure

A CA system does not distinguish between node failures and network failures, and hence must stop accepting writes everywhere to avoid introducing divergence

Partition tolerance is an important property for modern systems

Consistency Models

Sttong and weak.

Strong consistency models allow you as a programmer to replace a single server with a cluster of distributed nodes and not run into any problems.

On the other hand, weak consistency models behave in a way that is distinguishable from a non-replicated system.

Eventual consistency is a very weak one (I thought it was kinda strong).

  • how long os "eventually"?
  • how do replicas agree on the same value? one way to decide is that the value with the largest timestamp always win.

Chapter 3: Time and order

Order is important because we want a distributed system work like a single machine and the operation order on a single machine ispreserve.

Total and partisl order

The natural state in a distributed system is partial order.

If a ≤ b and b ≤ a then a = b (antisymmetry);
If a ≤ b and b ≤ c then a ≤ c (transitivity);

would lose information by forcing a total order where none existed.

Time

A spurce of order.

In some sense, time is just like any other integer counter.

Global clock

Examples: Facebook Cassandra and Google's Spanner (TrueTime API)

Imposing a total order is possible, but expensive.

"Local-clock" assumption

You cannot meaningfully compare timestamps from two different machines.

"No-clock" assumption

Logical time: essentially a counter, used to determine order with communications.

Vector clocks

Lamport clocks and vector clocks are replacements for physical clocks which rely on counters and communication to determine the order of events across a distributed system.

A Lamport clock is simple. Each process maintains a counter using the following rules:

  • Whenever a process does work, increment the counter
  • Whenever a process sends a message, include the counter
  • When a message is received, set the counter to max(local_counter, received_counter) + 1

Failure detectors

The amount of time spent waiting can provide clues about whether a system is partitioned or merely experiencing high latency.

A failure detector is a way to abstract away the exact timing assumptions. Failure detectors are implemented using heartbeat messages and timers.

Chapter 4: Replication

Mainly two ways to appraoches to deal with replication

  • Maintain a single copy
  • Allow divergence (multi-master systems)

Synchronous replication

It indicates that each request will wait until all the replication servers to reply. - Leading to no tolerence of server loss and the slowest server determines the system spee.

Asychronous replication

Opposite: system is fast and has high availability. And it has weak durability guarantees: if then only server that contains the data is lost before replicates to all servers, the data is permanently lost.

Chapter 5: Replication: weak consistency model protocols

This chapter focuses on allowing divergence and reach to an agreement evetually.

Amazon Dyanmo

Availability > Consistency:
Allow replicas diverge and when a key is read, there is a "read reconciliation" phase that attempts to reconcile differences between replicas before returning the value back to the client.

Consistent hashing

Used to locate the data to a set of nodes - a client can locate keys without having to query the system for the location of each key - saves system resources as hashing is gnerally faster than a RPC.

Partial quorums

Similar to quorum system but a majority is not required and different subsets of the quorum may contain different versions of the same data.

R + W > N, where N is the number of replicas for a value

  • R = 1, W = N: fast reads, slow writes
  • R = N, W = 1: fast writes, slow reads
  • R = N/2 and W = N/2 + 1: favorable to both

Note R + W > N is not equal to "Strong Consistency"

Conflict detection and read repair

Systems that allow replicas to diverge must have a way to eventually reconcile two different values.

One way to achieve this is by detecting the conflicts at read time, and in general it is done by tracking the causal history of a piece of data by supplementing it with some metadata. Client must keep the metadata information when they read data from the system and must return back the metadata value when writing to the database.

CRDT (convergent replicated data types)

  • Associative (a+(b+c)=(a+b)+c), so that grouping doesn't matter
  • Commutative (a+b=b+a), so that order of application doesn't matter
  • Idempotent (a+a=a), so that duplication does not matter

CALM (consistency as logical monotonicity)

It states that logically monotonic programs are guaranteed to be eventually consistent.

Monotonyif sentence φ is a consequence of a set of premises Γ, then it can also be inferred from any set Δ of premises extending Γ

I found it is really hard to understand but the examples are quite useful:
Non-monotonic: For example, if we learn that Tweety is a bird, we'll assume that Tweety can fly; but if we later learn that Tweety is a penguin, then we'll have to revise our conclusion.

Monotonic: Once we know that Tweety is a bird (and that we're reasoning using monotonic logic), we can safely conclude that Tweety can fly and that nothing we learn can invalidate that conclusion.

Another quote from the book: "Adding two numbers is monotonic, but calculating an aggregation over two nodes containing numbers is not. What's the difference? One of these is a computation (adding two numbers), while the other is an assertion (calculating an aggregate)."