Hadoop is a popular system for batch processing massive amounts of data on potentially vast farms of commodity hardware. However, Hadoop deployments are rarely stand alone. Hadoop is a component of a big data stack that includes systems for delivering massive data streams, and delivering calculated results in an interactive model. As organizations become accustomed to using Hadoop, they often long for the ability to shrink or eliminate the “batch window”, and access at least some results immediately. Nathan Martz blogged (and later wrote a book) about an architecture for a system that provides high data throughput, velocity, and both batch and real time results called the “Lambda Architecture”. The Lambda Architecture is a useful way for thinking about big data systems. This post discusses how in memory computing (IMC) fits as a component in this type of architecture.
The Big Data stack and the Lambda Architecture
The Lambda architecture (https://www.manning.com/books/big-data) defines an architecture for big (and high velocity) data systems. It decomposes such a system into 3 layers of functionality:
1. The batch layer. Performs batch computations to create view and aggregations for later ingestion by users or other systems. In the context of this post, the batch layer is Hadoop. It is the dominant batch processing engine for very large unstructured datasets. Map/reduce processing creates outputs that are send to the serving layer for consumption.
2. Speed layer. Provides real time processing to compensate for the delays in the batch layer. This is an area where in-memory processing plays a big role. The same data destined for Hadoop and eventual batch processing is sent here to be subject to real time aggregations, calculations, enrichment, and filtering. Results are sent to the serving layer for immediate consumption.
3. Serving layer. Results from batch and speed layers are presented to end users. It may simply provide access to precomputed results, or provide ad-hoc querying capability. This is another area where IMC can play a role by providing caching, high performance distributed search, and additional processing.
An example implementation with Storm and Hadoop might look like this:
There is no “one size fits all” for Lambda architecture implementations. Technology decisions will be guided by the speed and volume of incoming data, the need (or lack of) for high availability in the different layers, retention times, consumption and ingestion patterns, and many other factors. “Kafka” could be replaced by your favorite message queue, “Storm” by “Spark”, “Druid” by “Cassandra”. If data velocity and volume are significantly smaller than the extremes, the technology choices are even more extensive.
IMC And Big Data
In Memory Computing (IMC) is typically introduced into architecture to overcome bottlenecks, and/or provide very low latency and high throughput. Storm, is an example of a technology focused on stream processing. Whereas the example above stored results in Druid for consumption, Storm itself could serve real time views from in-memory computed results, and/or store results in a separately queryable in memory database or grid (e.g. Memcached).
Depending on system requirements, this architecture can be simplified further substituting the purpose built streaming component (Storm) with an IMC store that supports streaming, storage, and computation:
Depending on the data volumes and velocity, a further simplification can be made by having the IMC store handle message queuing as well. This reduces the number of moving parts and scope of expertise required to manage and program such a system.
As Hadoop usage has become mature, the desire for closing the “batch window” with real time views has increased. IMC technology has a critical part to play on both the “speed” layer and the serving layer. There are several IMC solutions that support the scenarios above, one such solution for example is GigaSpaces XAP (I work for GigaSpaces). XAP is an in memory computing platform for real time transaction and event processing. It enables real time big data processing and real time insights into masses of data. Read more about XAP or come visit GigaSpaces; booth #451 at Strata+Hadoop world conference.