Facebook Twitter Gplus LinkedIn YouTube E-mail RSS
Home Application Architecture Big-Data-Real-Time-Performance

Big-Data-Real-Time-Performance

Big-Data-Real-Time-Performance - Enjoying both worlds with one Architecture!

The efficiency of business processes is everything. Companies must be in a position to quickly react to new opportunities that arise. A successful organization today must be able to extract critical business information out of the incoming raw data and have it available at fingertips of the decision makes. This process ensures the organization keeps running and stays competitive.    

Companies are dealing with huge amounts of raw incoming data. The analytics process may take minutes, hours, days or even longer to get information extracted from the raw data. Providing the right information in the right context to the right location at the right time, gives an organization the insight they need to achieve real business agility. It is not good enough anymore to just perform analytics; real-time analytics is needed. New ways need to be found to fulfill these requirements.

Analytics performed today on big data typically requires the use of a NoSQL DB. How can real-time analytics be accomplished?

GigaSpaces allow you to combine its In Memory Data Grid (IMDG) with a NoSQL DB such as Apache Cassandra from DataStax to perform real-time analytics.  A great example would be processing market data events arriving in an incredible speed (few million events/sec) from the different market feeders. These events need to be processed in real time to perform decisions on buy/sell activities. Later these are analyzed via backtesting systems to construct better portfolio distribution strategies.

Combining IMDG and NoSQL creates a two-tiered architecture where the IMDG provides the real time data processing engine that very different applications using different programming languages and software frameworks can access in real time. The Apache Cassandra NoSQL DB provides the long-term storage for Business Intelligence (BI) use in real time analytics (via Cassandra) or batch (via Hadoop and Hive). DataStax Enterprise 3.0 is a big data platform that utilizes a production-ready version of Cassandra for real-time analytics with an integrated Hadoop distribution for batch analysis. It also provides Apache Solr for enterprise search operations.

 Combining the two technologies poses some interesting questions:

- What does IMDG real-time data access mean in absolute numbers?

- What overhead is introduced by having an IMDG in front of a No SQL big data solution?

Read Throughput Benchmark

The benchmark results below demonstrate how combining the Cassandra NoSQL DB with GigaSpaces IMDG may improve your real time analytics performance for data retrieval operations. The benchmark simply reads data based on a particular key:

Client Threads

Cassandra w/o GigaSpaces TPS

Cassandra with GigaSpaces TPS

1

3,279

3,400,320

2

6,441

7,306,737

3

9,039

12,302,141

4

11,371

18,496,255

5

13,391

21,572,181

6

15,810

30,330,604

7

17,981

34,354,142

8

21,487

39,576,531

10

20,900

44,381,324

- Cassandra w/o GigaSpaces – This measures TPS to read data from Cassandra directly (using CQL) without using GigaSpaces IMDG. 10 threads can deliver up to 20,900 read/sec.

- Cassandra with GigaSpaces – This measures TPS to read data from the IMDG once loaded from Cassandra. With 10 threads such combination can deliver more than 44.3 million read/sec. Blazing fast! 

Here is how the performance improvement looks visually when combining Cassandra with GigaSpaces:

Write Latency Benchmark

The following benchmark measures GigaSpaces write Latency, which is the time in microseconds to write data into the IMDG directly. This measures the overhead pushing data into the IMDG one loaded from Cassandra (without taking any network latency into consideration). 

As we can see it’s basically negligible.

All benchmarks were running on a CISCO UCS server with 8 core using raid 6 HD , Apache Cassandra-1.2.1 and GigaSpaces XAP 9.1 using JDK 1.7.

The architecture used with this benchmark is a side-cache with an external Data source:

The system writes data into Cassandra, storing large amounts of data on the file system. Read operations using a key/value data retrieval approach done via the IMDG. If the item was already read, it will be cached within the client local cache or master IMDG. If it is not present within the IMDG, it will be automatically loaded from Cassandra. The item will be cached for a specified amount of time (time to live) or will be evicted based on some eviction policy (LRU) taking into consideration the amount of memory available for the IMDG. If needed, the IMDG can scale dynamically based on a pre-defined SLA in elastic manner to increase its capacity in a transparent manner. This can done by starting new VMs on a private cloud, public cloud or your regular in-house data-center infrastructure and expanding the IMDG while the system is running.

From the benchmarks we learn:

1.        Both Cassandra and GigaSpaces IMDG scale nicely, each with different absolute numbers.  

2.        Leveraging GigaSpaces IMDG local cache with Cassandra allows you to improve the read operations performance to be 3 orders magnitude faster.

3.        GigaSpaces impact on the write performance is negligible. It takes few microseconds to write the data into the IMDG once it was loaded from the No SQL DB.

GigaSpaces provides deployment and management fabric both for its IMDG and also for Cassandra via Cloudify. Any failure of the system will be automatically detected and GigaSpaces will recover any failed process automatically and guarantee 100% system availability.

Summary

Let’s enjoy both worlds with one Architecture!

The GigaSpaces-NoSQL hybrid architecture combines two technologies into a powerful mix. It still allows you to use each separately when needed.

Use IMDG for:

- Fast Key based data access.

- Real-time pre-processing of incoming data leveraging IMDG event based engine.

- In-Memory SQL Queries.

- Real-Time Map-reduce on the data in-memory.

- Transactional updates.

- Collocate business logic with the data in-memory avoiding network overhead and serialization.

Use Cassandra/DataStax Enterprise for:

- Large scale real time queries and data loading

- Batch operations

- Map-reduce activities on the entire data

 

Shay Hassidim

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
2 Comments  comments 
© GigaSpaces on Application Scalability | Open Source PaaS and More