At the heart of every big data solution, such as XAP, is data synchronization. While there are a number of solutions and approaches available today, they are either overly complex or incredibly inefficient. Fortunately, we have been able to combine a number of these protocols to achieve what is a simple, yet incredibly efficient data synchronization framework.

The Popular Data Sync Technologies of Today

Programmers have a handful of data synchronization technologies available. Below we will explore the many pros and cons of these different solutions, starting with what is arguably the most popular: Oracle RMI.

Oracle RMI

The Java  RMI (or Remote Method Invocation) allows programmers to create distributed Java technology based to Java technology based applications. This allows users to invoke remote Java objects from other virtual machines (and sometimes from different hosts).

Why It Works

  • Java RMI offers a number of benefits, such as:
  • Looking at it as a process, there is no need for pre-compilation stage like with CORBA or Google Protobuf
  • The program seamlessly integrates into Java (users can send a combination of serializable and RMI exported objects)
  • It support dynamic code loading, that is the class definition is transferred when needed between the client and the server
  • The framework is well known and is easy for Java programmers to use
  • The framework performs distributed garbage collection (i.e. if a user allocates an object on the server and you pass reference to it to the client, the JVM will keep the object alive on the server until the last client disappears. The object will be released once no client references it)

Why It’s Not Ideal

Java  RMI does come with a few notable drawbacks:

  • It is not built with NIO, and thus support only the thread per request model, that is not scalable, furthermore, every expected object open its own server socket which is waste of machine resources.
  • It is tied to Java serialization, meaning it is not payload agnostic
  • The API is blocking (i.e. if you do a remote call, the calling client thread will be blocked until it gets a result)
  • Java serialization does not support data or extension. This differs from XML or JSON, which just ignore new fields
  • It is difficult to configure if a firewall is in place
  • It uses dynamic port binding on the server side, which makes it non-ideal to deploy on the cloud (where explicit security group rules are needed) or put into a Docker container.

REST

Unlike RPC which is more action focused, REST is about resources. Through REST, users only reference the resource in the URL and then the user can define what to do with that resource through the use of HTTP and the body of the request. It is a great tool for public facing APIs as they do not require a ton of pre-existing knowledge about the service to be used effectively.

Why It Works

Programmers use REST for a variety of reasons:

  • REST is built on top of HTTP (i.e. users can use all of the HTTP headers for caching, redirects, and standard error codes)
  • It is firewall friendly
  • There are numerous tools available for viewing and debugging REST communication
  • It is very easy to debug
  • It can be triggered easily from any programming language and environment (even directly from the browser)
  • It is payload agnostic (it is not opinionated about how to encode the payload other than HTTP limitations)

 Why It’s Not Ideal

There are a few reasons why REST may not work for a user:

  • It is built on top of HTTP (which is relatively slow and verbose) and does not support pipelining
  • The non-blocking API is only a thin wrapper on top of blocking API—this is an inheritance issue and not specific to some frameworks. It stems from the fact that REST is based on HTTP and http does not support

Apache Thrift

Apache Thrift is another protocol and framework which is used as a foundation for RPC.  It was originally developed at Facebook for “scalable cross-language services development.” It is now being developed under the Apache Software Foundation and is used among others by Cloudera and last.fm.

Why It Works

Apache Thrift delivers a handful of benefits, including:

  • It consistently generates both the client interfaces and the server for a given service (this means that client calls will generally be less prone to error)
  • The framework supports a variety of protocols and not just HTTP
  • It is a well-tested and widely used piece of software
  • It is portable across programming languages and has bindings to all major languages
  • It’s binary and therefore more efficient that text based protocols

Why It’s Not Ideal

For all of its benefits, there are some drawbacks:

  • The framework is poorly documented
  • It takes more work to get it running on the client side (though less work for the service owner if they are building libraries for clients)
  • It requires to generate the client side proxies
  • It’s binary and therefore often hard to debug

Google Protocol Buffers

This serialization format was developed for internal use and there have been protocol compilers developed for Java, C++, and Python made available to the public under an open source license. It is the only binary serialization you will need to use GRPC, although it does have an extra step to create programming language bindings. It is similar to Apache Thrift except that it does not include a concrete RPC protocol stack for defined services. Otherwise the pros and cons are extremely similar, with Protobufs and Thrift operating at almost the

Exploring Your Other Options

There are other data serialization options apart from the above, such as, BSON (not nearly as compact as other binary formats), and XML messaged-based RPC (which is not very readable). These four options also require users to add a transport.

The Challenges in Serialization of Large Scale Data Cluster

Numerous challenges arise when attempting serialization and synchronization in a large scale data cluster environment.

The first challenge is achieving low latency. There will be delays when data is being recalled and synchronized, but the delays should not be noticeable to users.

Another problem is scalability. The server needs to be able to handle a large number of concurrent requests and connections, but it has to do so without adding too much overhead. But high level abstraction adds overhead. This means a very fine balancing act must take place in order to keep systems running smoothly and efficiently.

There are also a number of properties required from the protocol (ten, to be exact):

  • Authentication
  • Encryption
  • Backward Compatibility (i.e. add new values to objects)
  • Schema
  • Easy Language Interoperability
  • Simplicity
  • Payload Agnostic (transfer the data in the format you choose. In RMI for example, you’re bound to Java)
  • Blocking and Non-Blocking I/O support
  • Cancellation and Timeout
  • Standardized Status Codes

Unfortunately, here is no generic solution which solves all of the above challenges, which is why most developers eventually end up developing their own proprietary RPC protocol (such as MongoDB and Cassandra).

Introducing  AsyncRMI

In light of the above, we created our own open source framework to this all too common problem which programmers face. We call it AsyncRMI.

What Is AsyncRMI and How It Addresses the Above Challenges

AsyncRMI uses Java RMI interfaces, dynamic code loading, and is an in-place replacement for Java RMI, making it incredibly user-friendly. Users can publish all objects using one server socket, meaning it takes up little resources and is firewall friendly, and its resource usage is also kept low through the use of NIO to serve request using a fixed amount of threads. Users can limit the number of open connections and the latency of the call round trip using a pipelining, allowing for high throughput, low latency, and a reduced use of resources. It also supports asynchronous invocation by using returning a standard Java CompletableFuture when making  async calls.

This framework  delivers other important benefits, such as:

  • Cancellation and Timeout
  • Blocking and Non-Blocking
  • Authentication
  • Encryption
  • Easy Language Interoperability
  • Standardized Status Codes (using Java exceptions)

For the full list of AsyncRMI features, click here.

Most importantly, through AsyncRMI, users can send many notifications regardless of the clients delays (i.e. even though some of the clients receiving notifications may be very slow, the server can still send out multiple notifications without any issue).

List<CompletableFuture<Void>> pendings = new List<>(listeners.size());

 for(Listener listener : listeners){

  // asynchronously notify.

  CompletableFuture<Void> pendingResult = listener.notify(event);

  // register future listenr to cancle the client listener

  // in case of notification failuer.

  pendingResult.exceptionally(throwable -> cancelListener(listener));

  // store the future to be processed by a

  // timer thread after the notify timeout expired.

  pendings.add(pendingResult);

}

Here is a complete walkthrough of this example .

At some other time from a timer thread when sufficient time has passed for the notify call to be sent to the client and back to the server call cancelPending(pendings);

privat void cancelPending(List<CompletableFuture<Void>> pendings){

 for(CompletableFuture<Void> pending : pendings){

   if(!pending.isDone()){

               pending.cancel();

   }

}

Because we are able to use the principles from RMI but in a performance-optimized manner, we are able to keep the process simple but without paying the performance cost.

In short, AsyncRMI is able to deliver the best of both worlds: it has the simplicity of Java RMI and the performance of Java NIO based framework. It looks just like Java RMI (unlike other proprietary protocols) and it can be easily replaced with an RMI-based solution. It doesn’t introduce a lock-in, and provides a variety of other features needed in production grade software.

Building a Low Latency Highly Scalable Data Serialization Protocol
Barak Bar-Orion on GithubBarak Bar-Orion on LinkedinBarak Bar-Orion on Twitter
Barak Bar-Orion
Head of In Memory Computing R&D Group @ GigaSpaces