Facebook Twitter Gplus LinkedIn YouTube E-mail RSS
Home Data Grid Mirroring GigaSpaces XAP to NoSQL

Mirroring GigaSpaces XAP to NoSQL

Ah, caches[0]. Caches are easy to use, but not always easy to load; what typically happens is that you modify your reads and writes to check a cache first (using data from the cache if present, and loading the cache if not; updates store data in the cache as well as the data store, in the case of writes).

GigaSpaces XAP is not a cache, however. It’s a data grid; data grids by their nature can be seen as caches, but XAP adds features that a cache would find problematic.

For example: XAP is not a key/value store! You can query the data grid for any data you like with the support of indexed data (as you configure it); you can even use JDBC or a SQL-like language to query data. Also, XAP supports both write-through and write-behind synchronization to external datastores[1], so you can write a piece of data into the data grid, and have it mirrored transparently (and asynchronously, if desired) to a backend datastore.

The backend datastore has traditionally been a relational database, as the relational database is the de facto standard for data repositories today. The default mirroring mechanism in XAP reflects this reality, and writes to a relational database.

However, in today’s environments, some applications have exceeded the optimal capabilities of the relational database. That doesn’t mean a relational database won’t work for a given problem, it only means the relational database won’t work optimally. For these applications, enter NoSQL.

NoSQL is an unfortunate name, honestly. It sounds like “No SQL,” denying the presence of SQL altogether, when what it really means is “not necessarily a relational model,” but “NNARM” doesn’t roll off the tongue like “NoSQL” does[2]. Nati Shalom has described “NoSQL” as “not only SQL,” which is far more appropriate, but I don’t think it’s caught on; too many hotheads have latched onto the more incendiary meaning.

Anyway, NoSQL is a broad term referring to almost any datastore that is non-relational. It can mean schemaless data models (and usually does); it can mean hierarchical models; it can also mean strict key/value stores. It encompasses all of these and more; it’s easy (but wrong) to assume that one definition applies to all NoSQL products.

XAP is a NoSQL product, literally speaking, even though it provides APIs that are definitely SQL – JDBC and JPA, as well as a SQL-like query language. Because XAP supports POJOs, it can provide a definite schema, even while providing schemaless document models (including converting between the two, which is a neat trick.)

However, back to our regularly scheduled programming, eh?

XAP can, as stated, mirror data to a traditional relational model. However, because it exposes the APIs it uses to implement that mirroring process, it provides the ability to synchronize data to any data source – including NoSQL data sources.

The mirror-parent project on our best practices project on github contains two external data sources, one for MongoDB and one for Cassandra. (It’s called “mirror-parent” because that project name makes more sense for Maven.) The data sources implement two different pieces of functionality that users will find relevant, with a few caveats.

First, let’s describe what the datasources do, before describing potential future enhancements.

A data grid can define a reference to an external synchronization point (a write-only mechanism) which receives all updates for the data grid (meaning writes, updates, and deletions). These updates are then mirrored as a bulk update to the external data store, which then executes the updates.

A data grid can also declare an initial load data source (which would normally mirror the external synchronization point, one would imagine.) When the data grid is started up, this data source is asked for all of its data, which then gets loaded into the datagrid.[3]

Therefore, an external data source has to implement reads (from the external data to the data grid, via the initial load) and writes (for changes mirrored from the data grid.)

There are two interfaces that provide this capability: BulkDataPersister, and ManagedDataSource. The BulkDataPersister interface controls the updates (facilitated through the executeBulk(List<BulkItem>) method); the primary entry point for the ManagedDataSource is the initialLoad() method, which returns a DataIterator that provides an iteration through all of the data to be loaded.

The mirror project consolidates most of the core functionality for these interfaces in a class called “AbstractNoSQLEDS,” which then provides some lifecycle methods as well as abstract update stubs; the initialLoad() is implemented, but only logs an error if called.

One facet of the AbstractNoSQLEDS is the provision of an artificial primary key for NoSQL, which embeds a source data type (from the data grid) as well as the data grid element’s primary key. This is currently the only piece of metadata preserved from the data grid into the persisted object, and it’s really quite important if the EDS is to recreate the object properly.

The MongoEDS class accepts a connection to an external MongoDB process, and works with a predefined collection named “IGSEntry.”[4] In order to write data into the datastore, it looks at the updated data (provided through the IGSEntry interface) and writes each data element into a MongoDB document.

For updates, it does almost the same thing – the difference is that it reads the data into a document and modifies that, instead of creating a new document from whole cloth.

The initial load from MongoDB is simply a matter of iterating through the entire IGSEntry collection; if it’s able to create a XAP data element[5] from a given document from MongoDB, it does so and returns it to the data grid, where the data element becomes available for clients.

The Cassandra EDS does almost the same thing. The primary differences are that the Cassandra EDS treats writes and updates as the same operation, because Cassandra overwrites data in rows (without replacing missing columns) with writes. (In other words, every write is an update as well; it’s just that a write isn’t overwriting already-existing data.)

Future Directions

The EDS implementations in the github project are not complete. In scope, this is fine, because they’re really meant to serve as proofs of concept more than complete implementations, but that (obviously) doesn’t mean there’s not room for improvement.

For one thing, neither one handles nested objects; they’re designed for simple POJOs (or documents, in MongoDB’s case) where there are single levels of data representation. In other words, if you have a Person object, with an embedded Address object (which itself has a street address, city, etc.), the serialization will fail, because it can’t handle the nested Address.

In addition, the Cassandra EDS requires a schema; while the MongoEDS can handle schema-free data models without problems (assuming they’re single-level, of course), the Cassandra EDS can not. The reason goes back to metadata and primary keys; documents don’t have the same kinds of primary keys that POJOs do, and it’s not trivial to infer the right object schema from the Cassandra row.

It’s not impossible, of course; nothing is. However, given the scope of the EDS as implemented in the github project, it’s more work than is justified.

For end users, ideally you’d implement a custom EDS for your data model. The implementations as written in the github project are a good first step, but they’re not full-featured enough to be production-ready, and they make assumptions about the form of the data in MongoDB and Cassandra with which you, as an end user, might not want to comply. By controlling your own schema, you gain more flexibility in types and more reliability in representation.

The EDS examples as written could be enhanced; the best thing they could do is embed metadata (or write it in a separate column family/collection) so the EDS can properly reproduce the data as it was written. However, this isn’t trivial and has speed implications as well.

Any questions? Feel free to ask!


 


[0] Full disclosure: for anything other than rendered content – i.e., stuff that’s temporary and everyone knows it – I really, really dislike caches. I find them unnecessary; they’re basically this giant red flag saying “I chose the wrong data store because mine’s too slow” to everyone who looks. By the way, this is “footnote 0″ because I added it after all the others and didn’t feel like renaming them all. This would have been easier in this post than any other, since it has fewer footnotes… but I was feeling lazy.

[1] There’s also a read facility – namely, an initial load capability – that factors heavily here. But we’ll get there, I promise. And soon.

[2] Plus, “narm” is a meme making fun of overly dramatic acting. Noone wants to see that when talking about data… I hope.

[3] And at least we get to “initial load.”

[4] MongoDB is a document store. It organizes data into “collections,” which contain documents. It uses BSON to manage the representation of the data with clients.

[5] This is where the artificial primary key from the EDS comes in; the primary key is decoded to derive the actual datagrid type.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
© GigaSpaces on Application Scalability | Open Source PaaS and More