Caches: an unpopular opinion, explained
I have an unpopular opinion: caches indicate a general failure, when used for live data. This statement tends to ruffle feathers, because caches are very common, accepted as a standard salve for a very common problem; I’ve been told that cloud-based vendors say that caches are the key to scaling well in the cloud.
They’re wrong and they’re right. In fact, they’re mostly right – as long as some crucial conditions are fulfilled. We’ll get to those conditions, but it’s important we get some important details clear.
Caches are fine for many situations.
In data, it’s important to think in terms of read/write ratios. Therefore, you have read-only data (like, oh, a list of the US States), read-mostly data (user profiles), read/write data (live order information), write-mostly and write-only data (audit transactions or event logging). Obviously someone might read audit logs some day, so the definitions aren’t purely strict, but we’re talking in the scope of a given application’s normal use, so it might be appropriate to think of audit logs as write-only, because a given application might not use them itself.
The read/write status is crucial for determining whether a cache is appropriate. It’s entirely appropriate to cache read-only data, given resources.
In addition, temporary data that’s reused can be cached. Think of, oh, a rendered stylesheet or a blog post: it changes rarely (unless they’re written by me, where they get constantly edited), is requested often (unless it’s written by me, in which case it has one reader), and the render phase is slow in comparison to the delivery of the content (because Markdown, while very fast, isn’t as fast as retrieving already-rendered content.)
Caches for Live Data
The use of cache on live data, where the data is read-mostly to write-only, is what I find distasteful. There are circumstances which justify their use, as I’ve already said, but in general the existence of a cache indicates an opportunity for improvement.
In my opinion, you should plan on spending zero lines of code on caching of live data. That said, let’s point out when that’s not true or possible.
Let’s look at the typical architecture in place for caches:
In this design, you have an application that, on read, checks the cache (a “side cache”) for a given key value; if the key exists in the cache, the cached data is used. If not, the database is read, and the cache is updated with the key value (for future use). When writes occur, the cache is updated at the same time the database is, so that for a given cache node the database and the cache are synchronized.
The database here is the “System of Record,” the authoritative data source for a given data element. The cache holds a copy of the data element, and shouldn’t be considered “correct” unless it agrees with the system of record.
You can probably see a quick issue (and one that’s addressed by many caching systems): distribution. If you have many client systems, you have many caches, and therefore many copies, each considered accurate as long as they agree with the system of record. If one system updates the system of record, the other cached copies are now wrong until they are synchronized.
Depending on the nature of the data, maintaining accurate copies could require polling the database even before the cached copy is used. Cache performance in this situation gets flushed down the tubes, because the only time a cache provides a real enhancement is if the data element is large enough that transfer to the application takes much longer than, well, it should. (A data item that’s 70 kb, for example, is probably going to take a few milliseconds to transfer – more than checking a timestamp would – and therefore you’d still see a benefit even while checking timestamps.)
Some caching systems (most of them, really) provide distributed cache, so that a write to one cache node is reflected in the other caches, too. This evades the whole “out of sync” issue, but introduces some other concerns… but this is something you should have before even considering caching.
If you’re going to use a cache, it should be distributed. You should look for a peer-to-peer distribution model, and transaction support; if your cache doesn’t have these two features, you should look for another cache. (Or you could just use GigaSpaces XAP, which does this and more; read on for an explanation.)
So what should you really do?
To me, the problem lies in the determination of the System of Record. A cached value is a copy; I don’t think it’s normally necessary to have that copy, and it’s actually fairly dangerous. So what’s the alternative?
Why not use a System of Record that’s as fast as a cache? If the data happens to be cached, you don’t care beyond reaping the benefits; your data architecture gets much simpler (no more code for caches). Your application speeds up (dramatically, actually, because data access time is in microseconds rather than milliseconds… or seconds). Your transactions collide less because the transaction times go down so much. Everybody wins.
The term for this kind of thing is “data grid.” Most data grids are termed “in-memory data grids,” which sounds scary for a System of Record, but there are easy ways to add reliability. Let’s work out how that happens for GigaSpaces, because that’s who I work for.
In a typical environment, you’d have a group of nodes participating in a given cluster. These nodes have one of three primary roles: primary, backup, and mirror (with the mirror being a single node.) A backup node is assigned to one and only one primary; a primary can have multiple backups. The primaries are peers (using a mesh-style network topology, communicating directly with each other.)
Let’s talk about reads, first, because writes want more examination than reads will require. A client application has a direct connection to the primaries; depending on the nature of the queries, requests for a given data element are either distributed across all primaries (much like a map/reduce operation) or routed directly to a calculated primary (i.e., the client knows where a given piece of data should be, and doesn’t spam the network).
Now, before we jump into writes, let’s consider my originating premise: stated too simply, caches are bad, because data retrieval is slow. In this situation, the reads themselves are very, very fast because you’re talking to an in-memory data grid, not a filesystem at all, but you still have to factor in network travel time, right?
No, not always. XAP is an application platform, not just a data grid. The cluster nodes we’ve been talking about can hold your application, not just your application’s data – you can co-locate your business logic (and presentation logic) right alongside your data. If you partition your data carefully, you might not talk to the network at all in the process of a transaction.
Co-located data and business logic comprise the ideal architecture; in this case, you have in-memory access speeds (just like a cache) with far more powerful query capabilities. And with that, we jump to data updates, because that’s the next logical consideration: what happens when we update our data grid that’s as fast as a cache? It’s in memory, so it’s dangerous, right?
No. Being in-memory doesn’t imply unreliability, because of the primary/backup/mirror roles, and synchronization between them. When an update is made to data in a primary, it immediately copies the updated data (and only the updated data) to its backups. This is normally a synchronous operation (so the write operation doesn’t conclude until the primary has replicated the update to its backups).
If a primary should crash (because someone unplugged a server, perhaps, or perhaps someone cut a network cable), a backup is immediately promoted, a new backup is allocated as a replacement for the promoted backup, and the process resumes.
The mirror is a sink; updates are replicated to it, too, but asynchronously. (If the mirror has an issue, mirror updates will queue up so when the mirror resumes function all of the mirror updates occur in sequence.)
In this configuration – again, this is typical – the database becomes fairly static storage. Nothing writes directly to it, because it’s secondary; the actual system of record is the data grid. The secondary storage is appropriate for read-only operations – reports, for example.
Does this mean that the data grid is a “cache with primacy”? No. The data grid in this configuration is not a cache; it’s where the data “lives,” the database is a copy of the data grid and not vice versa.
Does it mean we have to use a special API to get to our data? No. We understand that different data access APIs have different strengths, and different users have different requirements. As a result, we have many APIs to reach your data: a native API, multiple key/value APIs (a map abstraction as well as memcached), the Java Persistence API, JDBC, and more.
Does it mean we still might want a cache for read-only data? Again, no. The data grid is able to provide cache features for you; we actually refer to the above configuration as an “inline cache,” despite its role as a system of record (which makes it not a cache at all, in my opinion). But should you need to hold temporary or rendered data, it’s entirely appropriate to use the data grid as a pure cache, because you can tell the data grid what should and should not be mirrored or backed up.
The caching scenarios documentation on our wiki actually points out cases where a data grid can help you even if it’s not the system of record, too. For example, we’re a powerful side cache when you don’t have enough physical resources (in terms of RAM, for example) to hold your entire active data set.
Side Cache on Steroids
One of the things that I don’t like about side caches is that they typically infect your code. You know the drill: you write a service that queries the cache for a key, and if it isn’t found, you run to the database and load the cache.
This drives me crazy. It’s ugly, unreliable, and forces the cache in your face when it’s not really supposed to be there.
What XAP does is really neat: it can hold a subset of data in the data grid (just like a standard cache would), and it can also load data from secondary storage on demand for you with the external data source API. This means you get all the benefits of a side cache, with none of the code; you write your code as if XAP were the system of record even if it’s not.
So do I REALLY hate caches?
If the question is whether I truly dislike caches or not, the answer would have to be a qualified “no.” Caches really are acceptable in a wide range of circumstances. What I don’t like is the use of caches in a way that exposes their weaknesses and flaws to programmers, forcing the programmers to understand the nuances of the tools they use.
XAP provides features out of the box that protect programmers from having to compensate for the cache. As a result, it provides an ideal environment in which caches are available in multiple ways, configurable by the user to fit specific requirements and service levels.
A final note
You know, I say things like “caches are evil,” because it attracts attention and helps drive discussion in which I can qualify the statement. As I said above, I don’t actually think that – and there are lots of situations in which people have to adapt to local conditions regardless of the “right way” to do things.
Plus, being honest, there’s more than one way to skin a cat, so to speak. My “right way” is very strong, but your “right way” works, too.
And that’s the bottom line, really. Does it work? If yes, regardless of whether the solution is a bloody hack or not, then the solution is right by definition. Pragmatism wins. What I’ve been discussing is a shorter path to a solution that people run into over and over again, when they really shouldn’t. I think it’s the shortest path. (I’ve found no shorter, and I have looked.) It’s not the only path.