Thursday, March 7, 2013

The CAP theorem made simple

The basic idea of the CAP theorem is that in a distributed system, you must sacrifice either data consistency, availability, or partitioning support.  Here I'm going to make my attempt to show how the trade-off between these three goals quickly emerges in designing a simple distributed key-value store.

First, why do we want the key-value store to be distributed in the first place?  One good reason is performance -- we want to have multiple nodes available to service read requests on items in the data store. So fast performance is a fourth design goal, one that is not always highlighted.

Let's suppose clients are connected to each node in the cluster and, for each request, a server is chosen at random. Our first consistency requirement is that, in general, it should not matter which node receives a read request, the same value should still be returned.  One way to satisfy this is to require that nodes propagate write requests to other nodes in the cluster.  When a write request is received, the node writes the new value to its local copy of the store, and then forwards the request to the other nodes in the cluster.  After some period of time, barring node failure or loss of network connectivity, the updated value should be present on all nodes.

However, this leads to another consistency problem.  Suppose two clients write different values to the same key at nearly the same time, and these two write requests are processed by different nodes in the cluster.  The system can easily end up in an inconsistent state, where some nodes have one value for the key and other nodes have a different value, and the value that is returned for a given read request is unpredictable.

One way to address this consistency challenge is to impose the constraint that, at any point in time, only a single node is allowed to modify the store.  All writes must first be routed to this master node, which updates the store and then forwards the request to the rest of the cluster. Write requests which are concurrent will be serialized through the master, ensuring that the later request takes precedence.  A protocol like Paxos can be used to satisfy the requirement that only a single master is ever active, and that if this master goes down, a new one will be elected.

So now we have a system that has good consistency and read performance.  Requiring that all writes go through a single node means that we have had to sacrifice a little bit of write performance.  What about availability and partitioning?  Well, as long as network connectivity is good and no nodes fail, we have no problems with availability -- write and read requests should continue to be processed. We have C and A, but we have not yet explored what happens when P occurs.  Let's consider the case where a network partition takes place and the cluster is divided in to two groups of nodes which cannot communicate with each other.

Now we have a decision to make.  Will we allow a master to be elected on both sides of the partition?  In that case writes sent to either side of the partition will be processed, preserving both read and write availability.  However, we will quickly get in to an inconsistent state, where values on one side differ from those on the other.  We have A and P, but not C.

Another approach would be to say that on the side containing a minority of the system nodes, no writes or reads will be accepted.  Then our consistency requirements are met, but we have sacrificed some performance.  Also, for clients on the minority side of the partition, we have lost read and write availability. Here we have C and P, but not A.

Hope this post helps some folks get a better grasp of CAP.  Feedback welcome on how this description could be improved!