In the past few months i was involved in  many of the NoSQL discussions. I must admit that i really enjoyed those discussions as it felt that we finally started to break away from the “one size fit it all” dogma and look at the data management solutions in a more pragmatic manner. That in itself sparks lots of interesting and innovative ideas that can revolutionize the entire database market such as the introduction of document model, map-reduce and new query semantics that comes with it. As with any new movement we seem to be going through the classic hype cycle. Right now it seems to me that were getting close to the peak of that hype. One of the challenges that i see when a technology reaches its peak of the hype is that people stop questioning the reason for doing things and jump on new technology just because X did that.  NoSQL is no different on that regard.
In this post i wanted to spend sometime on the CAP theorem and clarify some of the confusion that i often see when people associate CAP with scalability without fully understanding the implications that comes with it and the alternative approaches.
I chose to name this post NoCAP specifically to illustrate the idea that you can achieve scalability without compromising on consistency at least not at the degree that many of the disk based NoSQL implementations imposes.

Recap on CAP

Quoting the definition on wikipedia:

The CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:[1][2]

  • Consistency (all nodes see the same data at the same time)
  • Availability (node failures do not prevent survivors from continuing to operate)
  • Partition Tolerance (the system continues to operate despite arbitrary message loss)


Many of the disk based NoSQL implementations was originated from the need to deal with write scalability. This was largely due to the changes in traffic behavior that was mainly a result of the social networking  in which most of the content is generated by the users and not by the site owner.

In a traditional database approach achieving data consistency requires synchronous write to disk and distributed transactions (known as the ACID properties).

It was clear that  the demand for write scalability would conflict with the traditional approaches for achieving consistency (synchronous write to a central disk and distributed transactions).

The solution to that was:  1) Breaking the centralized disk access through partitioning of the data into distributed nodes. 2) Achieve high availability through redundancy (replication of the data into multiple nodes) 3) Use asynchronous replication to reduce the write latency.

The assumptions behind point 3 above is going to be the center in this specific post.


Graphical representation of the Cap Theorem. (Source)


The Consistency Challenge

One of the common assumptions behind many of the NoSQL implementations is that to achieve write scalability we need to push as many operations on the write-path to a background process in order that we could minimize the time in which a user transaction is blocked on write.

The implication is that with asynchronous write we loose consistency between write and read operations i.e. read operation can return older version then that of write.

There are different algorithms that were developed to address this type of inconsistency challenges, often referred to as Eventual Consistency.

For those interested in more information on that regard i would recommend looking at Jeremiah Peschka post Consistency models in nonrelational dbs. Jeremiah provides a good (and short!) summary of the CAP theorem, Eventual Consistency model and other common principles that comes with it such as (BASE – Basically Available Soft-state Eventually, NRW, Vector clock,..).

Do we really need Eventual Consistency to achieve write scalability?

Before I'll dive into this topic i wanted to start with quick introduction to the term “Scalability” which is often used interchangeably with throughput. Quoting Steve Haines:

The terms “performance” and “scalability” are commonly used interchangeably, but the two are distinct: performance measures the speed with which a single request can be executed, while scalability measures the ability of a request to maintain its performance under increasing load

(See previous post on that regard:  The true meaning of linear scalability)

In our specific case that means that write scalability can be delivered primarily through point 1 and 2 above ( 1-Break the centralized disk access through partitioning of the data into distributed nodes. 2-Achieve high availability through redundancy and replication of the data into multiple nodes) where point 3 ( Use asynchronous replication to those replica’s to avoid the replication overhead on write) is mostly related with write throughput and latency and  not scalability. Which bring me to the point behind this post:

Eventual consistency have little or no direct impact on write scalability .

To be more specific my argument is that it is quite often enough to break our data model into partitions (a.k.a shards) and break out from the centralized disk model to achieve write scalability. In many cases we may find that we can achieve sufficient throughput and latency just by doing that.

We should consider the use of asynchronous write algorithms to optimize the write performance and latency but due to the inherited complexity that comes with it we should consider that only after we tried simpler alternative such as using db-shards, FLASH disk or memory based devices.

Achieving write throughput without compromising consistency or scalability

The diagram below illustrates one of the examples by which we could achieve write scalability and throughput without compromising on consistency.


As with the previous examples we break our data into partitions to handle our write scaling between nodes. To achieve high throughput we use in-memory storage instead of disk. As in-memory device tend to be significantly faster and concurrent then disk and since network speed is no longer a bottleneck we can achieve high throughput and low latency even when we use synchronous write to the replica.

The only place in which we’ll use asynchronous write is the write to the long-term-storage (disk).  As the user transaction doesn’t access the long-term storage directly through the read or write path, they are not exposed to the potential inconsistency between the memory storage and the long-term storage. The long-term storage can be any of the disk based alternatives starting from a standard SQL databases ending with any of the existing disk based NoSQL engines.

The other benefit behind this approach is that it is significantly simpler. Simpler not just in terms of development but simpler to maintain compared with the Eventual Consistency alternatives. In case of distributed system simplicity often correlate with reliability and deterministic behavior.


Final words

It is important to note that in this post i was referring mostly to the C in CAP and not CAP in its broad definition.  My points was not to say don’t use solution that are based on CAP/EventualConsistency  model but rather to say don’t jump on Eventual Consistency based solutions before you considered the implications and alternative approaches. There are potentially simpler approaches to deal with write scalability such as using database shards, or In-memory-data-grids.

As were reaching the age of Terra-Scale devices such as Cisco UCS where we can achieve huge capacity of memory, network and compute power in a single box the area’s in which we can consider to put our entire data in-memory get significantly broader as we can easily store Terra bytes of data in just few boxes. The case of Foursquare's MongoDB Outage is interesting on that regard.  10gen's CEO Dwight Merrimanargued that the entire set of data actually needs to be served completely in-memory:

For various reasons the entire DB is accessed frequently so the working set is basically its entire size Because of this, the memory requirements for this database were the total size of data in the database. If the database size exceeded the RAM on the machine, the machine would thrash, generating more I/O requests than the four disks could service.

It is a common misconception to think that putting part of the data in LRU based cache ontop of a disk based storage could yeild better performance as noted in the sanford research The Case for RAM Cloud

..even a 1% miss ratio for a DRAM cache costs a factor of 10x in performance. A caching approach makes the deceptive suggestion that “a few cache misses are OK” and lures programmers into con-figurations where system performance is poor..

In that case using pure In-Memory-Data-Grid as a front end and disk based storage as long term storage could potentially work better and with significantly lower maintenance overhead and higher determinism. The capacity of data in this specific case ( <100GB)  shouldn't be hard to fit into single UCS box or few of the EC2 boxes.



Tagged on:
  • I’m sorry but i think you’re missing the point. Your synchronous communication can still fail because of partitioning. Disk/Memory doesn’t influence the CAP theorem, it’s just a matter of speed-up.

  • Quoting from a previous response on that regard..

    How partition failure works:

    1) Network Partition between primary and replica node: if a replica node failed the partition that is still available continues to serve users requests and log all the transaction into a log buffer till the communication is restored. The failed node restore its state upon startup from the current state of the available node and by replaying the the log (to fill in the gap since it started its recovery process).

    2) If a primary node fails, users is routed to an alternate node (one of the replicas) that becomes the new primary. The recovery process of the failed primary is exactly the same as i mentioned above as it basically becomes the new backup when it comes back alive.

    In this model the asynchronous replication part happens only during recovery process and not in a continues basis and thus you don’t have to compromise between consistency and partition tolerance at the level that you would otherwise.

    There are even more efficient ways to deal with large failure then to keep many redundant nodes active all the time and pay the overhead in bandwidth, capacity, complexity (replica of 3 means that we have for every GB of data we have 2GB of redundant information, with large data systems that can turn into huge numbers and cost)

    A better way IMO is to add more backup on demand i.e. only when one of the nodes failed. So you are guarantied to have two nodes always available (with the exception of loosing the two nodes at the SAME time. Statistically i would argue that its the same chances of loosing two data centers at the same time as in most cases the two nodes would live into separate data centers)

    In GigaSpaces there is also an option to mix between synchronous and asynchronous replica under the same group. This is infact how we store data into a longterm storage so you have more options to control the tradeoff of throughput and redundancy on that regard.

    Over the years of dealing with those tradeoffs we found that even in the most mission critical systems that you can think of customers chose the backup-on demand option rather then having more backups even though they can.