Analytics for Big Data – Venturing with the Twitter Use Case

Performing analytics on Big Data is a hot topic these days. Many organizations realized that they can gain valuable insight from the data that flows through their systems, both in real time and in researching historical data. Imagine what Google, Facebook or Twitter can learn from the data that flows through their systems. And indeed the boom of real time analytics is here: Google Analytics, Facebook Social Analytics, Twitter paid tweet analytics and many others, including start-ups that specialize in real time analytics.

But the challenge of analyzing such huge volumes of data is enormous. Mining Terabytes and Petabytes of data is no simple task, one that traditional databases cannot meet, which drove the invention of new NoSQL database technologies such as Hadoop, Cassandra and MongoDB. Analyzing data in real time is yet another challenging task, when dealing with very high throughputs. For example, according the Twitter’s statistics, the number of tweets sent  on March 11 2011 was 177 million! Now, analyzing this stream, that’s a challenge!

Standing up to the challenge

When discussing the challenge of real time analytics on Big Data, I oftentimes use the Twitter use case as an example to illustrate the challenges. People find it easy to relate to this use case and appreciate the volumes and challenges of such a system. A few weeks ago, when planning a workshop on real time analytics for Big Data, I was challenged to take up this use case and design a proof of concept that meets up to the challenge of analyzing real tweets from real time feeds. Well, challenge is my name, and solution architecture is my game…

Note that it is by no means a complete solution, but more of a thought exercise, and I’d like to share these thoughts with you, as well as the code itself, which is shared on GitHub (see at the bottom of the post). I hope this will only be the starting point of a joint community effort to make it into a complete reference example. So let’s have a look at what I sketched up.

What kind of analytics?

First of all, we need to see what kinds of analytics are of interest in such use case. Looking at Twitter analytics, we see various types of analytics that I group in 3 categories:

Counting: real-time counting analytics such as how many requests per day, how many sign-ups, how many times a certain word appears, etc.

Correlation: near-real-time analytics such as desktop vs. mobile users, which devices fail at the same time, etc.

Research: more in-depth analytics that run in batch mode on the historical data such as what features get re-tweeted, detecting sentiments, etc.

When approaching the architecture of a solution that covers all of the above types of analytics, we need to realize the different nature of the real time vs. historical analysis, and leverage on appropriate technologies to meet each challenge on its own ground, and then combine the technologies into a single harmonious solution. I’ll get back to that point…

In my sample application I wanted to listen on the public timeline of Twitter, perform some sample real-time analytics of word-counting, as well as preparing the data for research analytics, and combining it into a single solution to handle all aspects.

Feeding in a hefty stream of tweets

I chose to listen on the Twitter public timeline (so I can share the application with you without having to give you my user/password).

For integration with Twitter I chose to use Spring Social, which offers a built-in Twitter connector integrating with Twitter’s REST API. I wanted to integrate with Twitter’s  Streaming API, but unfortunately it appears that currently Spring Social does not support this API, so I settled for repetitive calls to the regular API.

A feeder took in the tweets and converted them in an ETL style to a canonical Document-oriented representation, which semi-structured nature makes it ideal for the evolving nature of tweet structure, and wrote them to an in-memory data grid (IMDG) on the server side.

The design needs to accommodate a very high throughput of around 10k tweets/sec, with latency of milliseconds. For that end I chose to implement the feeder as an independent processing unit in GigaSpaces XAP, so that I can cope with the write scalability requirement by simply adding more parallel feeders to handle the stream. Since the feeder is a stateless service, scaling out by adding instances is relatively easy. Trying to do the same with stateful services will prove to be much more challenging, as we’re about to find out …

Let’s pick the brains of my accumulated tweets

On the server side, I wanted to store the tweets and prepare them for batch-mode historical research. For the same reason of semi-structured data, I also chose a Document-oriented database to store the Tweets. In this case, I chose the open source Apache Cassandra, which has become a prominent NoSQL database that is in use by Twitter itself, as well as by many other companies. The API to interact with Cassandra is Hector.

To avoid tight coupling of my application code with Cassandra, I followed the Inversion of Control principle (IoC) and created an abstraction layer for the persistent store, where Cassandra is just one implementation, and provided another implementation for testing purposes of persistence to the local file system. Leveraging on Spring Framework wiring capabilities (see below), switching between implementations becomes a configuration choice, with no code changes.

Easy configuration

For easy configuration and wiring I used Spring Framework, leveraging on its wiring capabilities as well as properties injection and parameter configuration. Using these features I made the application highly configurable, exposing the Twitter connection details, buffer sizes, thread pool sizes, etc. This means that the application can be tuned to any load, depending on the hardware, network and JVM specifications.

What can I learn from the tweet stream on the fly?

In addition to persisting the data, I also wanted to perform some sample on-the-fly real-time analytics on the data. For this experiment I chose to perform word counting (or more exactly token counting, as token can also be an expression or a combination of symbols).

At first glance you may think it’s a simple task to implement, but when facing the volumes and throughput of Twitter you’ll quickly realize that we need a highly scalable and distributed architecture that uses an appropriate technology and that takes into account data sharding and consistency, data model de-normalization, processing reliability, message locality, throughput balancing to avoid backlog build-up etc.

Processing workflow

I chose to meet these challenges by employing Event-Driven Architecture (EDA) and designed a workflow to run through the different stages of the processing (parsing, filtering, persisting, local counting, global aggregation, etc.) where each stage of the workflow is a processor. To meet the above challenges of throughput, backlog build-up, distribution etc., I designed the processor with the following characteristics:

  1. Each processor has a thread pool (of a configurable size) to enable concurrent processing.
  2. Each processor thread can process events in batches (of a configurable size) to balance between input and output streams and avoid backlog build-up.
  3. Processors are co-located with the (sharded) data, so that most of the data processing is performed locally, within the same JVM, avoiding distributed transactions, network, and serialization overhead.

The overall workflow looks as follows:

For the implementation of the workflow and the processors I chose XAP Polling Container, which runs inside the in-memory data grid co-located with the data and enables easy implementation of the above characteristics.

The events that drive the workflow are simple POJOs on which I listen and which state changes trigger the events. This is a very useful characteristic of the XAP platform, which saved me the need to generate message objects and place them in a message broker.

Atomic counters

For the implementation of atomic counters I used XAP’s MAP API, which allows using the in-memory data grid as a transactional key-value store, where the key is the token and the value is the count, and each such entry can be locked individually to achieve atomic updates, very similarly to ConcurrentHashMap.

Making it all play together in harmony

So we have a deployment that incorporates a feeder, a processor and a Cassandra persistent store, each such service having multiple instances and needing to scale in/out dynamically based on demand. Designing my solution for the real deal, I’m about to face 10s-100s of instances of each service. Manual deployment or scripting will not be manageable, not to mention the automatic scaling of each service, monitoring and troubleshooting. How do I manage that automatically as a single cohesive solution?

For that I used GigaSpaces Cloudify, which allows me to integrate any stack of services by writing Groovy-based recipes describing declaratively the lifecycle of the application and its services.

I can then deploy and manage the end-to-end application using the CLI and the Web Console.

Conclusion

This was a thought exercise on real-time analytics for big data. I used Twitter use case because I wanted to aim high up the big data challenge and, well, you can’t get much bigger than that.

The end-to-end solution included a clustered Cassandra NoSQL database for the elaborated batch analytics of the historical data, GigaSpaces XAP platform for distributed In-Memory Data Grid with co-located with real-time processing, Spring Social for feeding in Tweets from twitter, Spring Framework for configuration and wiring capabilities, and GigaSpaces Cloudify for deployment, management and monitoring. I used event-driven architecture with semi-structured Documents, POJOs and atomic counters, and with write-behind eviction.

This is just the beginning. My design hardly utilized the capabilities of the chosen technology stack, and it barely scratched the surface of the analytics you can perform on Twitter. Imagine for example what it would take to calculate not just real-time word counts but also the reach of tweets, as done on the tweetreach service.

This project is just the starting point, and I would like to share this project with you and invite you to stand up to the challenge with me and together make it into a complete reference solution for real-time analytics architecture for big data.

The project is found on GitHub under git@github.com:dotanh/rt-analytics.git.

You’re welcome to contribute!


Posted in Big Data, Cassandra, Cloudify, EDA, GigaSpaces, Real Time Analytics, Solution Architecture, syndicated, Twitter | Tagged , , | Leave a comment

GigaSpaces Cloudify & VMware CloudFoundary the new PaaS Jailbreaker

I was reading Krishnan Subramanian's post, Two Events That “Clouded” Our Thinking In 2011. The thing that caught my attention was Krishnan's comments on why PaaS is a superior alternative to DevOps: 

The shift in the thinking about the enterprise cloud consumption also poured water into the “DevOps” concept advocated by vendors and pundits with their foot in the IaaS world. When organizations embrace PaaS instead of infrastructure services, we don’t need the DevOps marriage and the associated cultural change (believe me, this cultural change is giving sleepless nights to many IT managers and some consultants are even making money helping organizations realize this cultural change). With PaaS, organizations can keep the existing distinction between the Ops and Dev teams without worrying about the cultural change. In fact, with cloud computing, the role of the Ops is not going away but it stays in the background offering an interface which developers can manage themselves.

Krishnan represents one of the common attitudes and subjects of debate between two main paradigms for developing and managing applications on the cloud:

  • PaaS -- PaaS takes a developer, application-driven approach. A PaaS platform provides generic application containers to run your code. The PaaS platforms deals with all the operational aspects needed to run your code such as deployment, scaling, fail-over, etc.
  • DevOps -- DevOps takes a more operations-driven approach. With DevOps, you get tools to automate your operational environment through scripts and recipes, and keep full visability and control over the underlying infrastructure.

The Difference Between PaaS and DevOps

Both PaaS and DevOps aim toward the same goal -- reducing the complexity of managing and deploying applications on the cloud. But they take a fairly different approach to deliver on that promise.

Christoper Knee summarized it in his blog post DevOps and PaaS -- Friend or Foe? as the difference between Developers and SysAdmin:

  • Developers may ask: "if I have a self-service portal for deploying applications (aka PaaS), do I need SysAdmins at all?"
  • SysAdmins may ask: "isn't PaaS just a monstrous black box that prevents me from provisioning the specific services we need to deploy real-world apps?" ... The typical SysAdmin thinks that they can get to 75% of PaaS functionality with DevOps tools like Chef without giving up any systems architecture flexibility.

In one of my earlier post on the subject, Productivity vs. Control tradeoffs in PaaS, I tried to outline the main limitation of most of the current "blackbox PaaS" implementations:

I thought that Carlos Ble's post Goodbye Google App Engine (GAE) is a good example that illustrates why the initial perception behind GAE as a simple platform that provides extreme productivity can be completely wrong.

...developing on GAE introduced such design complexity that working around it pushes us 5 months behind schedule.

Part of the reason that brought Carlos through that experience IMO is that in the course of trying to make GAE extremely productive the owner made the platform too opinionated, to the point where you lose all the potential productivity gains by trying to adopt their model. In addition, with a platform like GAE you have very little freedom to leverage existing frameworks such as your own database, or messaging system, or any other third-party service that can in itself be a huge contributor to productivity.

Instead, you're completely dependent on the platform provider's stack and pace of development, and that in itself can work against agility and productivity in yet another dimension. In this specific example, Carlos couldn’t use a specific version of a Python library that would have made his productivity higher, and instead had to work around issues that were already solved elsewhere. This is a good example how the lack of flexibility leads to poorer productivity even in the case of simple applications.

Putting DevOps & PaaS togatehr 

It looks like more people in the the industry have come to recognize that rather than looking at DevOps and PaaS as two competing paradigms, it might be best to combine the two, as Christoper Knee pointed out in his post:
 
What if you could get a PaaS that wasn't a black box, enabling developers to deploy apps easily while still giving SysAdmins the ability to provision any services they needed (a la Cloud Foundry)?
I also came to that realization myself, as I pointed out in my  2012 predictions
In 2012, we’ll see many of the DevOps tools, such as Chef and Puppet, integrated into application platforms, making it easier to deploy complex applications onto the cloud. In the same way, we’re going to see more Application Platforms adopting the automation and recipe model from the DevOps world into the application platform. The latter have the potential to transform the opinionated PaaS offerings as we know them today, with Heroku and GAE leading that trend, into a more open PaaS offering that better fits the way users develop apps today, and provide more freedom to choose your own stack, cloud, and application blueprint.

What Makes a Cloudify and CloudFoundary a PaaS Jailbreaker?

The CIO Maginze article Cloud computing disrupts the vendor landscape defines a new class of PaaS platforms that I think represent the definition for a DevOps PaaS:
A PaaS that allows developers to use whatever tool they want to build their cloud applications and the platform tackles the deployment, scaling and management of these apps in the cloud data center.
VMware CloudFoundry is one of the more notable references in that category. Quoting Christoper Knee:
CloudFoundry runs anywhere, incuding on your laptop. CloudFoundry's service container concept is particularly strong, kind of an appliance on steroids.

These ideas were the founding concept behind Cloudify, i.e. putting DevPops & PaaS togather in a single framework. As with CloudFoundry, Cloudify enables you to break away from the "blackbox PaaS". However, even though CloudFoundry is significantly more open than most other PaaS alternatives, at the core it is still based on the "my way or the highway" approach (aka opininated architecture), which forces you to fit into a specific blueprint that is mandated by the platform. Cloudify, on other hand, pushes the envelope even further by adopting the concpet of recipes that was first introduced by DevOps frameworks such as Chef and Puppet. It introduces  more application-driven recipes through the introduction of Domain Specific Language that extends upon the groovy language.

The Cloudify recipes give you the full power to plug in any application stack on any cloud (including a non-virtualized environment) and manage them in similar way to the way you would manage them in your own datacenter or machines. You can also call Chef and Puppet from within the recipes. All this, without hacking the framework itself. As with other similar DSLs, the Cloudify DSL was designed to express even the most complex application management tasks, such as <recovery from a data center failure> in a single line and avoid all the verbose script and API calls that are often involved when you work at the infrastructure level.

All this makes Cloudify an even more open alternative that fits a large variety of the current enterprise application stacks, including:

  1. JEE applications
  2. Big Data applications
  3. Multi-tier applications
  4. Native applications (C++,..)
  5. .Net, Ruby, Python, PHP applications
  6. Multi-site applications
  7. Low-latency applications (that can't run on VMs)

It also makes Cloudify more open for special customization of the existing stack, like in the case of:

  1. An application that needs a certain version of MySQL (not the one that comes with the framework)
  2. An application that needs to run on Redhat (not Ubuntu). or even more interesting -- a case where there are mutiple applications, each needing a different OS served at the same time.

Cloudify also provides a more advanced level of control geared for mission-critical apps, including:

  1. Monitoring the application stack and topology
  2. Adding custom application metrics
  3. Adding custom SLAs

It can also work on wide variety of cloud environments, including Microsoft Azure and non-virtual enviroments.

One of the great powers of the recipe is that it is a great collaboration tool. Once you develop a recipe, it is very easy to share it wtih different groups -- whether internal groups like developement, QA, and operations, where the recipie provides a programmatic definition of thier environment, or it can be shared between the product team and the pro-services team, where your sales and pro-services can easily install and update product versions in a consistent way, as well as reproduce customer scenarios and share them with the support team. Recipes are also a great tool for collaboration through a wider community network, where users can collaborate by sharing common recipes and best practices over the web.

Quick Introduction to Cloudify Recipes

Below is a short snippet that shows a simple application recipe, of a typical java-based web application based on JBoss as the web container and MySQL as a database. The application recipe describes the services that comprise the application, and thier dependencies. The details on how to run MySQL and JBoss is provided in a seperate recipe for each of the individual services. A more detailed description of how a service recipe would look like can be seen here

application {
       name="simple app"
       service {
              name = "mysql-service"
      }
       service {

              name = "jboss-service"
              dependsOn = ["mysql-service"]
       }
}

To get a taste, you can try one of the available recepes such as Cassandra, MongoDB, Tomcat, JBoss, Solar, etc., just as a simple way to try out these products on your own desktop or on any of the supported clouds, without the hassle that is often involved in doing so and without even direct relation to Cloudify per-se.

 

References

Posted in Cloud, GigaSpaces, JavaSpaces, syndicated | Tagged , | 2 Comments

Terabyte Elastic Cache clusters on Cisco UCS and Amazon EC2

Overview

Last week I was working on a new opportunity. The prospect needs to store 1 Terabyte of data in memory to address scalability challenges and were interested in using GigaSpaces. I was tasked with creating a demonstration of this and want to share my experience as a blog post.

Scope

The project also needed to demonstrate high-availability and full lifecycle aspects of any data grid used for a mission-critical application. This application serves millions of simultaneous users worldwide. Loading the cache with data, read throughput (max reads/second), write throughput (max writes/second) and automatic failover were most important aspects.

The first issue I had to address was hardware. The concern was that I only had about a week to deliver the demo, and I needed machines with over 1 terabyte of available memory. We have some lab machines that I could hobble together to come up with a terabyte of memory, but these are used for internal tests by our R&D and QA teams and getting these exclusively for the demo might not be an option. An Amazon EC2 cluster was the next best option that seemed viable. We also reached out to our contacts in Cisco asking for help.

The very first requirement of this demo is the need to run on any deployment environment given that where it would run was still an open question. This capability includes being able to develop on a laptop and then deploy to the actual demo hardware without any code changes.

Cloudify

For addressing the dynamic deployment issue, luckily we have Cloudify, which does exactly that. It makes any application agnostic to the underlying deployment environment. With Cloudify, the same application can be deployed to a cloud infrastructure, a non-cloud infrastructure or even a local machine without any changes to code. Deploying, managing and monitoring an application on any of this environment becomes very easy, and the look and feel of the management and monitoring tools remains consistent.

Hardware

EC2

For the EC2 version, I used the stock Amazon Linux image on the largest configuration available, which is the High-Memory Quadruple Extra Large Instance. Since these instances come with 68 GB of memory, we needed 16 machines for a 1 Terabyte cluster. Our prospect also wanted to see hot backup and automatic failover, so that required another 16 machines, for 2 Terabytes in total. Running these 32 machine instances on EC2 costs $64 per hour, which is pretty cheap for a demo or development environment.

Cisco UCS

Cisco was kind enough to lend us a C260 UCS Server, (referred to as UCS below) for our tests. The UCS offers large memory capacity and 40 ultra-fast cores, so they are an ideal platform for memory-bound and CPU-intensive applications. These were exactly the type of conditions that the demo is subject to and this server was the ideal hardware we needed. Since UCS machines have 1 Terabyte of memory and retail for about $40,000, they are also very attractive for production environments.

Cloudify Recipe

An application deployed using Cloudify has to define Application and Service Recipes, which are Groovy scripts. The Application Recipe describes the application components, which are its services and their dependencies. The Service Recipe describes service characteristics, such as the number of service instances, lifecycle events, scripts that handle these events, monitoring configurations, scaling rules and custom alert rules. More information on Recipe and how Cloudify uses it is described here.

The Application Recipe (datagrid-application.groovy) consists of a single service:

application {

   name="datagrid-application"

   service {

      name = "datagrid-space"

   }

}

This service is an Elastic Cache Service (Data Grid) and the following Application Recipe describes the EC2 version of it:

service {

icon "icon.png"

name "datagrid-space"

statefulProcessingUnit {

binaries "datagrid-space" //can be a folder, jar or a war file  

sla {

memoryCapacity 2048000

maxMemoryCapacity 2048000

highlyAvailable true

memoryCapacityPerContainer 64000

}

}

}

For the UCS version, the application code remains the same and only the memory capacity settings are adjusted to the available memory on the machine (1000000 MB).

Cloud Driver

The cloud driver acts as the specification for the new machines that Cloudify provisions. Cloudify spins up new machines when an application is deployed, scales out or on failure of a machine. The cloud driver for EC2 is configured as follow:

cloud {

provider "aws-ec2"

user "YOUR_EC2_ACCESS_KEY_ID"

apiKey "YOUR_EC2_SECRET_ACCESS_KEY_ID"

// relative path to gigaspaces directory

localDirectory "tools/cli/plugins/esc/ec2/upload"

remoteDirectory "/home/ec2-user/gs-files"

imageId "us-east-1/ami-1b814f72"

machineMemoryMB "68100"

hardwareId "m2.4xlarge"

// Security group which has the appropriate ports configured to be open for incoming and outgoing traffic

securityGroup "default"

// YOUR keypair file and name of the keypair

keyFile "cloud-demo.pem"

keyPair "cloud-demo"

// S3 URL location where GigaSpaces is saved. Update the access properties of this location to everyone

cloudifyUrl "https://s3.amazonaws.com/cloudify/gigaspaces.zip"

machineNamePrefix "gs_esm_gsa_"

dedicatedManagementMachines true

managementOnlyFiles ([])

connectedToPrivateIp false

sshLoggingLevel java.util.logging.Level.WARNING

managementGroup "management_machine"

numberOfManagementMachines 2

zones (["agent"])

reservedMemoryCapacityPerMachineInMB 1024

No cloud driver was needed for the UCS version because it ran on existing machines and no hardware provisioning was needed.

Deployment and monitoring

Starting the Cloudify infrastructure is straightforward: Log in to the Cloudify shell and run either “bootstrap-localcloud” for starting a local cloud or “bootstrap-cloud ec2” for starting EC2 cloud. This starts the GigaSpaces management infrastructure which includes a GSA, GSM, LUS, web-ui and a rest service.

For EC2, this bootstrap process takes about 2-3 minutes. This includes time to provision the new machines on EC2, copy GigaSpaces software and start the processes listed above.

For UCS the bootstrapping was much faster and took less than 1 minute.

Once the management infrastructure is ready, the application can be deployed. This is done using the “install-application” command with an argument for the location of the Application Recipe folder. It’s the same command for both EC2 and UCS versions of the demo.

EC2 deployment took about 10 minutes. In this time, 32 new machines were provisioned on EC2, GigaSpaces software was copied, GigaSpaces agent processes were started, GSC’s were started and the application was deployed across all the machines.

Voilà, a 2TB cluster was up and running!

Deployment of the UCS version was just as easy, and the cluster was up and running even faster as no machine provisioning is involved here.

Performance

Other objective of the demo was to show performance numbers that meet the application requirements. This was the easy part.

We were able to demonstrate easily the peak load requirements of the application and how either of the clusters can keep up with their expected loads.

In both the environments (Cisco UCS and EC2) we had very good results – initial load task on EC2 managed to load 500,000 objects per second (1k size). During the initial load all the machines consumed 99% of their CPU capacity. Based on the initial load throughput numbers we saw when loading 320 million objects, it can be projected that 1 Billion objects can be loaded to the cluster in around 36 minutes (if cluster had enough memory to hold these objects as pointed by Michal Frajt in comments below). Objects were loaded into both the primary and backup which was running on a different VM (on EC2 also a different machine). The data was synthetic account data generated during the load (not loaded from a database). 

In read test we had single client with 50 threads performing a read operation based on a random key. The data grid handled 10,000 read per second when the client used sleep of 4 milliseconds after each read and over 100,000 reads per second without any sleep (2000 reads per second per client thread). During read tests we were also running a writer client which was creating new objects at the rate of 2000 objects per second. Grid nodes on EC2 consumed about 5% CPU during this test. We can project that the grid capacity is about 2 million reads per second from remote clients.

We also tested reads using local cache (aka near cache). With local cache enabled, test client managed to read data at the rate of 5 million objects per second (with local cache size of 1 million objects). As client caches recently read data in local JVM, it avoids remote calls, improving performance dramatically. During local cache tests, client machine consumed 80% CPU as the data was being served from the local cache.

Failover

GigaSpaces + Cloudify make the automatic failover in cloud environments a reality. GigaSpaces detects machine failures and automatically provisions new machines to meet the application SLA (which can be available memory or CPU cores).

To demonstrate this, we simulated a machine failure using the AWS console (“Terminate” function) and then watched as the application automatically recovered from this event by spinning up a new machine. This all occurred transparently and with no performance impact to the clients.

Conclusion

As applications have to manage and manipulate more data (thanks to Big Data and the analytics that can be unearthed out of larger and larger datasets), using in-memory access greatly helps to speed things up. Using GigaSpaces you can manage Terabytes of data across any number of machines, and on any platform.

For applications that have to work with heavy, constant loads, using a Cisco UCS Server infrastructure is a perfect fit.

For applications that only have to work with these large data set occasionally, using a Amazon EC2 infrastructure (or any other cloud provider like RackSpace) is a really good option.

You can download the datagrid recipes I used from here.

Please contact me using comments if you are looking for the demo source code.

Updated 1/5/12Performance section was updated to clarify.

Facebook Twitter Linkedin Reddit Buzz Email
Posted in Benchmarks, Big Data, Caching, Cloud, Data Grid, GigaSpaces, Web UI | 7 Comments

Architecting Massively-Scalable Near-Real-Time Risk Analysis Solutions

Recently I held a webinar around architecting solutions for scalable and near-real-time risk analysis solutions based on the experience gathered with our Financial Services customers. In the webinar I also had the honor of hosting Mr. Larry Mitchel, a leading expert in the Financial Services industry, who provided background on the Risk Management domain. Following the general interest in the webinar, I decided to dedicate a post to the subject.

What goes on in the Risk Management domain?

The Finance world continually undergoes changes driven for the most part by the lessons learned from the 2008 financial crash, in an attempt to prevent such catastrophes from reoccurring. Regulations such as Dodd-Frank, EMIR, and Basel III have further formalized it, imposing tighter control and supervision. We see financial institutions addressing these conformance goals by assigning dedicated projects with dedicated budgets (which means more work for solutions architects, lucky me). One of the aspects of this conformance is reducing the risk by shortening the settlement cycles to near-real-time, as seen by initiatives such as Straight-Through Processing.

Traditional architectures, new challenges

Conforming to the new regulations mandates an entirely different approach to risk analysis. This means that the old systems, which relied on overnight batch risk calculations and predefined queries, can no longer suffice, and a more real time approach to risk calculation, with on-the-fly queries, is needed.

From a solution architecture point of view, Risk Analysis is a compute-intensive and a data-intensive process. Looking at our customers’ systems, we see ever-increasing volumes (number of calculated positions and assets, number of re-calculations, data affinity, etc.) and on the other hand we see an ever-increasing demand to reduce the response time, to conform with the regulations or for competitive edge. That makes it a classic Big Data analytics problem.

From a technology point of view, risk analysis solutions traditionally relied on designated compute grid products for the calculations and on relational databases as the data store. That was fine for overnight batch processing, but with the introduction of the new real-time demands databases tend to become bottlenecks under the load, due to the disk and network resources.

Risk Analysis solution architecture revisited

Our experience with such solutions shows that the effective architecture to meet these challenges is a Big Data multi-tiered architecture, in which intraday data is cached in-memory for low-latency response, while historical data is kept in a database for more extensive data mining and reporting. Simple caching solutions cannot provide the scalability of the intraday data under such write-intensive flows (streaming market data, calculation results, and such), and it is therefore an In-Memory Data Grid that has become the standard technology in modern solutions for storing intraday data. Intelligent data grids such as GigaSpaces XAP also provide on-the-fly SQL querying capabilities, which overcome the limitation of predefined queries in traditional architectures.  As for historical data, we see a clear shift from relational databases to NoSQL databases, which perform much better for mining these volumes of semi-structured data.

A piece of the architecture that is often overlooked on initial architecture discussions is the system orchestration. Surprisingly, many of the customers I visit tend to think of risk analysis solutions as the mere sum of a Compute Grid product (for computation scalability) and a Data Grid product (for data scalability). But they neglect to consider the orchestration logic to handle the intersection between the data grid and the compute grid, taking care to avoid duplicate calculations, handling cancellation of calculations, monitoring the state of ongoing calculations, feeding ticks and updates to the client UI, end more. All this amounts to a significant orchestration layer that is traditionally developed in-house.

A much more effective architecture is to embed the orchestration logic together with the data grid within one platform, thereby abstracting the complexities from the clients and removing the need of the clients to interact with anything but the unified platform. GigaSpaces XAP offers the co-location of processing and messaging together with the data, which makes implementing such architectures quite easy. This also enables pre-/post-processing on the data, such as data formatting prior to processing, and result aggregation after calculations, which are requirements often seen in such solutions.

Event-Driven Architecture is highly useful for streaming calculation results to the awaiting clients as they arrive and streaming ticks and other updates to the UI. Using GigaSpaces XAP the implementation of such architecture is made simple by leveraging on the Asynchronous API and on the messaging layer which can treat each data mutation as an event.

To address the real time analytics challenge on the end-to-end Big Data architecture, across both the intraday data (which resides in-memory within the data grid) and the historical data (which resides within a relational/NoSQL database), requires a holistic view of the multi-tier architecture. Intraday data is changed at an extremely high rate with frequent event feeds, whereas historical data can be written in a more relaxed manner, using a write-behind (write-back) caching architectural approach, and consolidating queries across the data stores, making it seem as one unified source for query purposes. Such consolidation is traditionally achieved by combining the various products, but GigaSpaces offers a Real-Time Analytics solution, enabling you to focus on your business logic and leave the rest to the platform.

Future directions

There’s more to discuss in such architectures, such as multi-site deployments over WAN, support for cloud bursting, and more, which should be considered when approaching such solutions. I will not get into these concerns on this post, but you can see coverage of future directions on my webinar.

To get more information on the domain and its challenges, and to hear more on the suggested architecture for Big Data risk analysis solutions I’d recommend watching the full webinar.


Posted in architecture, Big Data, Financial Services, GigaSpaces, Market Analytics, Real Time Analytics, Risk Management, Risk Managment, Scalability, syndicated | Tagged , | Leave a comment

Moving away from Mainframe to Commodity – How?

Moving away from Mainframe to Commodity – How?

Mainframe (Z/OS) based systems running COBOL programs are legacy systems in many organizations. These are planned to be replaced with low cost commodity servers running Java or .Net based systems, saving the cost of the expensive mainframe MIPS and COBOL-based development.

Using GigaSpaces XAP can simplify the migration effort from mainframe based systems and reduce the cost of the legacy applications. In addition, having GigaSpaces XAP act as a front-end layer for mainframe based systems may boost the system performance and improve the overall system response time on peak load.

GigaSpaces' ability to deploy, manage and scale services along with the data (that can be partitioned and replicated across multiple commodity machines) will enable your mainframe applications to access GigaSpaces XAP's In-Memory Data Grid (IMDG) with minimal re-factoring of existing application code without having to introduce additional third party products, dramatically reducing implementation times and minimizing incremental costs in software licenses and hardware.

GigaSpaces Intelligent Mainframe Front-end Architecture

GigaSpaces XAP provides an extremely flexible persistence layer (known as the mirror service) that enables transparent communication between the GigaSpaces IMDG and virtually any type of back-end application or database system.

When used with a database, the Mirror service is one of the primary reasons allowing GigaSpaces XAP to overcome database locking issues experienced on peak load periods. The Mirror service offloads the database access, since the IMDG operates as the primary interface to the application data while handling persistence as a back-end durable ordered activity, delegating in-memory transactions to the database running on the mainframe.

mainframeIntegration

Any access to the data done primarily from the IMDG using one of the standard interfaces GigaSpaces XAP supports (POJO/Spring, JPA, JDBC, key/value, or Document APIs). If the desired data item cannot be found within the IMDG, it will be accessed through the database running on the mainframe, retrieving the relevant data item, loading it into the IMDG to be reused for subsequent transactions and passing it back to the client application. This approach saves the need for accessing the mainframe on every application data access by using an in-memory layer that may scale on demand.

Controlled, Reliable, and Optimized Mainframe Access

XAP's Mirror service has a central coordinator for all back-end store updates, enabling you to batch data and persist in-memory transactions via a continuous background access to the mainframe where the frequency of access is pre-configured. This allows the system to minimize the number of mainframe connections and interactions reducing MIPS consumption while controlling the data consistency level and synchronization between the in-memory representation of the data and the copy on the mainframe.

Many mainframe-based applications that perform nightly batch jobs drive a large number of data updates being made to back-end stores. In this context, GigaSpaces' inherent ability to maintain transactional integrity is critical. In-Memory transactions can be fully committed preserved in multiple different physical locations using GigaSpaces' high-availability mechanism, and ultimately persisted to the database with zero risk of the mainframe and GigaSpaces being out of sync for a long duration.

For more details see the Mainframe Integration Best Practice.

 
Shay Hassidim

Facebook Twitter Linkedin Reddit Buzz Email
Posted in Application Architecture, Caching, Cloud, Data Grid, Development, GigaSpaces, sba, SOA, space-based architecture, Spring Framework | Leave a comment