In my previous post, I discuss the architecture of Storm and described some Cloudify recipes that I wrote to deploy and manage Storm (and Zookeeper). In this post, I'll leave the Cloudify world and address the combination of Storm and XAP: why it makes sense, and how I did it.
How Storm Is Different From XAP For Streaming
You may wonder why I would use storm to stream if I have XAP, which can also stream, and has been battle hardened for years in demanding applications? Storm is very specifically directed at the streaming problem, and is optimized for that use case. In order to produce extremely high throughput, it pushes responsibility for reliability outside of its own framework. Also because of its streaming focus, it provides higher level abstractions that make reasoning about streaming easier than in XAP. An illustration may help.
If I wish to "stream" data through XAP, I effectively mean that I want to write data to XAP. XAP is a general purpose, in-memory data processing and storage platform, and as such wants your data to be safe. The architecture is oriented to making data in-memory nearly as reliable as that on disk. Thus, writing into XAP involves some level of serialization and perhaps a network hop as well. Storm doesn't aspire to this level of reliability, instead it provides the means for the suppliers and consumers of data to provide it instead. Storm is "optimistic" in roughly the same sense that an optimistic lock in a database is optimistic: it assumes success is far more likely than failure, and so is willing to big hits to performance when failures occur because they are so rare. XAP is more pessimistic in this sense. XAP is designed to be a source of truth for the data it holds, and goes to great lengths to achieve it.
When you write to XAP, you know the data is safe. Because of this, it exacts a small performace cost on each read and write. Storm (and to some degree the streaming use case as whole) doesn't care about persistence. Persistence and reliability can be provided externally, while the streaming framework concerns itself with efficiently processing tuples. In other words, the framework supports various flavors of reliability and accuracy, but doesn't implement them directly. For reasons sited above, there is no way, even in principle, for XAP to have a comparable thoughput to Storm: at least when there is no persistence. This caveat is critical however, since real world systems almost always need persistence, and ultra-fast in-memory persistence is one of XAP's main strengths. I also mentioned that Storm has higher level abstractions for Streaming, which make programming it more straightforward for streaming applications. Whereas in XAP you could implement streaming as a series of event driven processing stages, there is no concept of a "stream" or any kind of "flow" at the API level.
Spouts and XAPStream
In my previous post, I describe the function of spouts in a Storm data flow. Basically, Spouts provide the source of tuples for Storm processing. For spouts to be maximally performant and reliable, they need to provide tuples in batches, and be able to replay failed batches when necessary. Of course, in order to have batches, you need storage, and to be able to replay batches, you need reliable storage. XAP is about the highest performing, reliable source of data out there, so a spout that serves tuples from XAP is a natural combination. Recall that Storm is a stream processing framework and runtime, and this presupposes the existence of a stream for it to read from. So there are really two artifacts needed for XAP to provide a spout to Storm: a "stream" in XAP, and of course the spout that reads from it. Realizing this, I wrote a simple service for XAP that leverages XAP's FIFO capabilities called XAPStream. It is a standalone (Storm independent) service that lets clients dynamically create, destroy, and of course read and write from streams in both batch and non-batch modes. That sounds more difficult than it was, because XAP provides all the pieces to build it quite easily. To provide a "tuple" abstraction, I merely used the XAP SpaceDocument concept so the model could be flexible. A "stream" is merely the FIFO ordered SpaceDocument-based tuples. The source is available on github.
The client API is simple, and adequate to support the needs of a Storm spout.
The basic operations are obvious. For working with a stream, the goal is to get a XAPStream reference, which is used for reading and writing to/from stream.
It is this interface that the XAP spout uses to connect to and read tuples to supply to Storm.
The XAP Spout
With the XAPStream service in place, the path to a XAP based spout was cleared. Storm has a few flavors of Spouts that can be implemented, ranging from a "fire and forget" model all the way to a reliable, processed-once, batched, and partitioned model. The spout I implemented has pretty much the kitchen sink, with the exception of partitioning. This may be added at some point in the future; there is no technical barrier on the XAP side. The source for the spout is on github.
Frankly, there isn't much to it. Since the XAPStream project was written to be compatible with Storm's expectations, all the spout needs to do is invoke the XAPStream API to read or replay batches. The spout implements the ITridentSpout interface, which provides for batching, pipelining, and transactional stream reads.
XAP And State Storage
Storm provides a mechanism (Trident State) for persisting arbitrary state during stream processing. Details are available on the Storm wiki site. In a Storm topology that is persisting state via this mechanism, the overall throughput is almost certainly constrained by the performance of the state persistence. This is a good place where XAP can step in and provide extremely high performance persistence for stream processing state. The initial XAP Trident state implementation is available on github. This particular implementation supports the now standard "word count" processing demo.
The integration of XAP with Storm can provide great value both for current users of XAP, and those wishing to improve the performance of Storm, especially when persisting state. In my next post, I'll move back into the Cloudify world to describe the complete set of recipes that implement the full XAP integrated Storm system, running a word count topology, and running in the Cloud.