Ever confused on how to prepare system design and wanted to know more about it? This book is a great start.

Part I. Foundation of Data Systems

Chapter 1: Reliable, Scalable and Maintainable Applications

The fact is that applications are shifting from Compute-intensive to Data-intensive.

Defines a couple metrics including
Reliability: the ability to continue to function
Scalability: the ability to work as scale goes up
Maintainability: does the future engineer hate the system ;) as well as how easy this system can evolve

A couple terminologies to review:
p50, p95 and p99
Falult vs Failure

Chapter 2: Data Models and Query Languages

NoSQL (interpreted as "Not Only SQL") was invented in 2010s?!?! (I thought it was earlier like 90s).

  • very high write throughput
  • specialized query that are not well supported by the relational model

Impedance mismatch: the conflict between data model and OO objects.

Some believe JSON is the cure for "impedance mismatch", but there are also problems with JSON.

One advantage of using ID: it means nothing to a human therefore it never need to change (hey I never thought about this point!).

Some side knowledge mentioned: MySQL copies the entire table on ALTER TABLE, while most relational database system execute the statement in a few milliseconds.

Network Model vs Relational DBMS

One is like a tree to model the relationship - updating and look up is very hard - fits one to many usually.

You know RDBMS, it won and took the whole world. - fits many to many data model better.

Imperative vs Declarative

One tells the actions to perfrom in certain order, the other specify the pattern and condition, but not how to achieve the goal.

Declarative language lend themselves to parallel execution; while imperative code is very hard to parallelize.

CSS is a wonderful example, as browser vendor can improve certain CSS functions, but it won't break how the page is rendered.

Chapter 3: Storage and Retrieval

Based on hashmap (in memory) as the index

Index的一个常用做法就是key-value pair,那么当更新次数越来越多的时候,就会占用很多空间。如果把每个key-value update log作为一个segment,这时候可以用所谓的“compaction”来合并记录。 每个segment都是只读的,所以要把合并的数据写入新的segment,然后删除旧的即可。

  • Compaction: Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key.

建立在log是“append only”这个基础上。append only显然有好处,就是又快又简单,并且concurrency和crash recovery都更加简单。

Limitation: poor sequential query and hash table has to fit in the memory, can't deal with large map; restore is time-costing when crash (since need read from the disk to restore the hashmap).

SSTable and LSM-Trees

SSTable: Sorted String Table, above hashmap but with key sorted (smart!).

SSTTable makes merge and search easy. But how to maintain such a table? - Use trees, such as red-black trees or AVL trees.

Still one problem left: if database crashes, the most recent writes are lost. - Use log.

LSM-Tree: Log-Structured Merge-Tree

B-Tree

branching factor: the number of references to child pages in one page of the B-Tree.

LSM-Tree VS B-Tree

write amplification: one write to db resulting in multiple writes to the disk.

LSM-trees are typically able to sustain higher write throughput than B-Trees, partly because they have lower write amplification and partly because they sequentially write compact SSTable files rather than having to overwrite several pages in the tree.

LSM-tress can be compressed better.

The downside is the compaction process can interfere with the performance. If throughput is high, it can also happen that compaction can't keep up with the rate of incoming writes, causing run out of disk space.

Column Compression

Only load columns from the disk can help reduce storage, furthermore, column-oriented storage often lends itself very well to compression.

One particularly effective technique is bitmap encoding.
9, 1 -> 9 zeros, 1 one, reset zeros
5, 4, 3, 3, -> 5 zeros, 4 ones, 3 zeros, 3 ones, reset zeros
etc.

(It's very similar to the compression technique I learned when I was in middle school, describe 1 and 0 instead of present them.)

Chapter 4: Encoding and Evolution

Backward compatibility & Forward compatibility

Newer code can read data that was written by older code. (However, I didn't realize the importance of Forward compatibility before, as older code should be able to read data written by newer code!).

Forward compatibility usually is trickier, as it requires older code to ignore additions made by a newer version of the code.

JSON, XML, CSV

Good for most cases, but has slight issues such as can't encode integer larger than 2^53, there are work around but are also hacky.

Besides, uses larger size than binary encoding.

Thrift & Protocol Buffers

I've used both. These are binary encoding libraries and both require a schema for any data that is encoded.

You can change the name of a field in the schema, since the encoded data never refers to field names, but you cannot change a field’s tag, since that would make all existing encoded data invalid.

This numbered tag field mechanism ensures both backward compatibility as the old fields stays the same; and forward compatibility as new fields can just be ignored by older cold when it can't recognize.

What about change data type?

It's doable, but has risk.

Avro

Avro is another lib for Java. Avro uses a bit different yet intersting approach: there is no number in the schema like Thrift. The schema resolution depends on the field name instead.

null is not a case for Avro. If you want to allow a field to be null, you have to use a union type.

Avro is friendlier to dynamically generated schemas, for its field name based not number tag based. In this sense, for Thrift, you have to assign by hands.

REST & SOAP

I know REST.

SOAP is an XML-based protocol for making network API requests. The API of a SOAP web service is described using an XML-based language called the Web Services Description Language, or WSDL. WSDL is not designed to be human-readable, therefore users of SOAP rely heavily on tool support, code generation and IDEs.

Problem with RPCs

One problem, as I encountered when write iOS test, is passing references. For example, you can't simply pass object references in RPC calls, like you can't use objects when use EDO in iOS test. You can only easily use primitivies like numbers or strings.

Part II. Distributed Data

Chapter 5: Replication

All of the difficulty in replication lies in handling changes to replicated data.

Since if data never change, then just to copy it to every node...

Leader-Followers (Master-Slave)

Read can query either leader or any of the follower, but write can only goes to the leader.

It's a model, not just restricted to database.

Sync VS Async

One disadvantage of complete async: if leader fails write, then it is not recoverable, any writes haven't been replicated to followers are lost.

Adding a new follower

It is not practical to lock the DB and copy all the data. One way to do is take a snapshot, copy to the new follower node. The follower connects to the leader and requests backlog of data change, once it's done (called caught up), then this node can continue process data as other nodes.

Handle Node Outages

Failed Follower Follower each keeps a log, can request from the leader to catch up from the failed transaction.

Failed Leader One of the followers needs to be promoted to be the new leader, clients need to be reconfigurated to send writes to the new leader and the other followers start consuming data from the new leader. - Called failover

split brain: two nodes both believe that they are the leader. It is dangerous as if both accept writes and there is no process for resolving conflicts, data is likely to be lost or corrupted.

Implementation of Replication logs

Problem with Write-ahead log (WAL): it is quite low level to the storage engine, therefore it is typically not possible to run different versions of the strage format of the database

An alternative is to describe writes to db: such replication log is called a logical log. It's decoupled from the storage engine internals.

Problem with replica

To achieve a sort thing called read-after-write consistency, which means users usually need to see their update. It can be even more complicate when the same user access service from multiple devices.

Monotonic reads: to guarantee the result won't go backwards, meaning to avoid the case that the user read data from a follower, then the second read from another follower which hasn't caught up yet causing the data that the user has seen seems "lost" and causes confusion.

consistent prefix: Guarantee if a sequence of write happens in a certain order, then anyone reading those writes will see them appear in the same order.

Multi-Leader Replication

One benefit is that can provide support for client offline operation. For example, calendar app. Each device can have a local db that acts as a leader, once the user has the internet access, write to the followers.

Collaborative editing is an application of such multi-leader model.

Handle Write Conflicts

Multi-Leader Replication Topologies

Leaderless Replication

One problem is to avoid reading from stale node which has stale data, may due to it was rebooting and didn't catch up. To solve this problem, when a client reads from the DB, it sends read requests to serveral nodes in parallel and pick the latest version of data.

Read repair: can be achieved since client reads from several nodes, so it can detect any stale responses.

Anti-entropy process: DB itself have a background process that consistantly looks for differences in the data between replicas and copies missing data from one replica to another.

Concurrent writes

We simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred.

There is a very nice example of shopping cart.

Chapter 6: Partitioning

If partitioning is unfair, that some partitions have more data or queries than others, call it skewed. However, random assign will cause it very hard to locate data, thus the solution is to use a hash function, that can distribute keys evenly between a big range and can consistently locate by the key.

Partitioning and Secondary Indexes

Downside: lose the ability to do efficient range queries. -> Refer to another key to index. For example, use (user_id, update_timestamp) for posts, where different users store on different partitions, but within each user, the updates are ordered on a single partition.

Still it could cause problem - What if the same key is read and write heavily (for example some celebrity on social media). -> One simple solution is to mitigate from the client side, adding simple random number to the key so that to split the write and read (because the key changed).

scatter/gather: different index key results the same data in different partitions, making read expensive.

Rebalance Partitions

Request Routing

This is for the client, it needs to know which node to connect.

Refer to ZooKeeper: maintain track of cluster metadata.

Chapter 7: Transactions

Harsh reality can cause many things to go wrong. However, fault tolerance mechanism is complicated to implement, therefore transactions have been the mechanism of choice for simplifying these issues.

There are arguments about whether using transactions, but the truth is not that simple, transactions like every other technical design choice, have advantages and limitations.

ACID

Atomicity, Consistency, Isolation and Durability.

BASE: Basically Available, Soft state and Eventually consistency. (lame :P)

Application can be implemented without transactions, however, error handling becomes much more complicated without atomicity, and the lack of isolation can cause concurrency problems.

Errors will inevitably happen, but many software developers prefer to think only about the happy path rather than the intricacies of error handling.

TRUE. As an engineer, need consider bad paths - to myself.

Weak Isolation levels

serializable: guarantee transactions have the same effect as if they ran serially. It also has a performance cost.

Snapshot Isolation

Typically use write locks to prevent dirty writes, which means that a transaction that makes a write can block the process of another transaction that writes to the same object. However, reads do not require any locks.

MVCC: Multiversion concurrency control.

It has different names in different databases. Oracle it is called serializable, and in PostgreSQL and MySQL, it is called repeatable read.

Lost update problem

Tow transactions modified the same value concurrently, one of the modifications can be lost, because the second write does not include the first modification.

Norm: read-modify-write cycles.

Compare-and-set: if current value does not match what you previously read, the upadte has no effect and the read-modify-write cycle must be retried.

commutative: Atomic operations that can apply in a different order on different replicas and still get the same result.

Write skew

One example is for hospital oncall: there are need to be at least 1 doctor oncall. Now Alice and Bob should be oncall, they both feel uncomfortable and decided to take off. Unfortunately they submit the offcall request at the same time, Two transaction both checked with the system and get there are "2" doctors oncall thus the system approves both to take off, resulting no one is oncall.

Generally, write skew happens when two transactions read the same objects and then update some of those objects.
It's definitely a race condition: if two transactions had run one after another, the second doctor would have been prevented from taking off. The anomalous behavior was only possible because the transactions ran concurrently.

phantom: A write in one transaction changes the result of a search query in another transaction.
materializing conflicts: Takes a phantom and turns it into a lock conflict on a concrete set of rows that exist in the database. (It is hard and ugly to implement, considered as a last resort if no alternative is possible.)

Chapter 8: The Trouble with Distributed Systems

Building a reliable system from unreliable components.

Assume everything that can go wrong will go wrong: unreliable network, clocks, etc.

Unreliable Network

The unreliable network could make falut hard to tell, as you can't tell if it is due to network or the node that is failing.

Unreliable Clocks

NTP: A protocol which allows the computer clock to be adjusted according to the time reported by a group of servers. The servers in turn get their time from a more accurate time source, such as a GPS receiver.

Time-of-day clocks have some problems. In particular, if the local clock is too far ahead of the NTP erver, it may be forcibly reset and appear to jump back to a previous ahead of the NTP server,. These jumps, as well as the fact that they often ignore leap seconds, make time-of-day clocks unsuitable for measuring elapsed time.

Monotonic clocks

A monotonic clock is suitable for measuring a duration, such as a timeout or a service's response time.

The name comes from the fact that they are guaranteed to always move forward

LWW: last write wins. Due to the time sync issue, newer data from other nodes can have an "earlier" timestamp, causing the node that should sync the data thinks that the data is "old", thus drop the write.

fending: an extra counter that is assigned when the lock server grants a lock or lease, which increases every time a lock is granted. Then every time a client sends a write request to the storage service, it must include its current fencing token. This is to prevent a client pause, for example due to garbage collection, and when it comes back, it thinks itself is still the leader thus make write requests, while its lease would've been expired. Such writes will corrupt the data (since there are other nodes assigned as leaders and issuing writes).

One example, zookeeper could be used as lock service, its zxid or the node cversion can be used as fencing token.

The problem becomes harder when the nodes can tell "lie" - known as Byzantine fault, but usually can safely assume that there are no such faults.

Chapter 9 Consistency and Consensus (Need re-read)

The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees.

Eventually consistent is actually a weak guarantee.

Linearizability

To make a system appear as if there were only one copy of the data and all operations on it are atomic.

To test a system is linearizable: recording the timing of all requests and responses, and checking whether they can be arranged into a valid sequential order.

Linearizability is important in a couple scenarios:

  • Locking and leader election -> make sure there is indeed only one leader
  • Constraints and uniqueness guarantees
  • Cross-channel timing dependencies -> Image upload, the resizer could fetch old image causing the thumbnail and actual impage inconsistent if the system is not linearizable.

Implement linearizable systems

Ordering Guarantee

logic clock: an algorithm to generate a sequence of numbers to identify operations, typically using counters that are incremented for every option. It provides a total order, which is important for linearizability.

Lamport timestamps

Each node has a uniqute identifier and each node keeps a counter of the number of operations it has processed. The id is as (counter, node ID).

If you have two timestamps, the one with a greater counter value is the greater timestamp, if the counter values are the same, the one with the greater node ID is the greater timestamp. Besides, each node carries the maximum counter value it has seen. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.

Total Order Broadcast

Lamport timestamps is still not enough. For example, it can't solve two users register the same user name at the same time, as you need to know the total order is finalized. Therefore a broadcast is needed.

Total order broadcast is exactly what you need for database replication: if every mes‐ sage represents a write to the database, and every replica processes the same writes in the same order, then the replicas will remain consistent with each other.

Part III. Derived Data

Chapter 10: Batch Processing

Introduced Unix philosophy in the beginning, which I already know most of it.

Separating the input/output wiring from the program logic makes it easier to compose small tools into bigger systems.

MapReduce

NAS: Network Attached Storage
SAN: Storage Area Network

HDFS as the example, which is based on the shared-nothing principle.