Terabyte Elastic Cache clusters on Cisco UCS and Amazon EC2
Last week I was working on a new opportunity. The prospect needs to store 1 Terabyte of data in memory to address scalability challenges and were interested in using GigaSpaces. I was tasked with creating a demonstration of this and want to share my experience as a blog post.
The project also needed to demonstrate high-availability and full lifecycle aspects of any data grid used for a mission-critical application. This application serves millions of simultaneous users worldwide. Loading the cache with data, read throughput (max reads/second), write throughput (max writes/second) and automatic failover were most important aspects.
The first issue I had to address was hardware. The concern was that I only had about a week to deliver the demo, and I needed machines with over 1 terabyte of available memory. We have some lab machines that I could hobble together to come up with a terabyte of memory, but these are used for internal tests by our R&D and QA teams and getting these exclusively for the demo might not be an option. An Amazon EC2 cluster was the next best option that seemed viable. We also reached out to our contacts in Cisco asking for help.
The very first requirement of this demo is the need to run on any deployment environment given that where it would run was still an open question. This capability includes being able to develop on a laptop and then deploy to the actual demo hardware without any code changes.
For addressing the dynamic deployment issue, luckily we have Cloudify, which does exactly that. It makes any application agnostic to the underlying deployment environment. With Cloudify, the same application can be deployed to a cloud infrastructure, a non-cloud infrastructure or even a local machine without any changes to code. Deploying, managing and monitoring an application on any of this environment becomes very easy, and the look and feel of the management and monitoring tools remains consistent.
For the EC2 version, I used the stock Amazon Linux image on the largest configuration available, which is the High-Memory Quadruple Extra Large Instance. Since these instances come with 68 GB of memory, we needed 16 machines for a 1 Terabyte cluster. Our prospect also wanted to see hot backup and automatic failover, so that required another 16 machines, for 2 Terabytes in total. Running these 32 machine instances on EC2 costs $64 per hour, which is pretty cheap for a demo or development environment.
Cisco was kind enough to lend us a C260 UCS Server, (referred to as UCS below) for our tests. The UCS offers large memory capacity and 40 ultra-fast cores, so they are an ideal platform for memory-bound and CPU-intensive applications. These were exactly the type of conditions that the demo is subject to and this server was the ideal hardware we needed. Since UCS machines have 1 Terabyte of memory and retail for about $40,000, they are also very attractive for production environments.
An application deployed using Cloudify has to define Application and Service Recipes, which are Groovy scripts. The Application Recipe describes the application components, which are its services and their dependencies. The Service Recipe describes service characteristics, such as the number of service instances, lifecycle events, scripts that handle these events, monitoring configurations, scaling rules and custom alert rules. More information on Recipe and how Cloudify uses it is described here.
The Application Recipe (datagrid-application.groovy) consists of a single service:
name = "datagrid-space"
This service is an Elastic Cache Service (Data Grid) and the following Application Recipe describes the EC2 version of it:
binaries "datagrid-space" //can be a folder, jar or a war file
For the UCS version, the application code remains the same and only the memory capacity settings are adjusted to the available memory on the machine (1000000 MB).
The cloud driver acts as the specification for the new machines that Cloudify provisions. Cloudify spins up new machines when an application is deployed, scales out or on failure of a machine. The cloud driver for EC2 is configured as follow:
// relative path to gigaspaces directory
// Security group which has the appropriate ports configured to be open for incoming and outgoing traffic
// YOUR keypair file and name of the keypair
// S3 URL location where GigaSpaces is saved. Update the access properties of this location to everyone
No cloud driver was needed for the UCS version because it ran on existing machines and no hardware provisioning was needed.
Deployment and monitoring
Starting the Cloudify infrastructure is straightforward: Log in to the Cloudify shell and run either “bootstrap-localcloud” for starting a local cloud or “bootstrap-cloud ec2” for starting EC2 cloud. This starts the GigaSpaces management infrastructure which includes a GSA, GSM, LUS, web-ui and a rest service.
For EC2, this bootstrap process takes about 2-3 minutes. This includes time to provision the new machines on EC2, copy GigaSpaces software and start the processes listed above.
For UCS the bootstrapping was much faster and took less than 1 minute.
Once the management infrastructure is ready, the application can be deployed. This is done using the “install-application” command with an argument for the location of the Application Recipe folder. It’s the same command for both EC2 and UCS versions of the demo.
EC2 deployment took about 10 minutes. In this time, 32 new machines were provisioned on EC2, GigaSpaces software was copied, GigaSpaces agent processes were started, GSC’s were started and the application was deployed across all the machines.
Voilà, a 2TB cluster was up and running!
Deployment of the UCS version was just as easy, and the cluster was up and running even faster as no machine provisioning is involved here.
Other objective of the demo was to show performance numbers that meet the application requirements. This was the easy part.
We were able to demonstrate easily the peak load requirements of the application and how either of the clusters can keep up with their expected loads.
In both the environments (Cisco UCS and EC2) we had very good results – initial load task on EC2 managed to load 500,000 objects per second (1k size). During the initial load all the machines consumed 99% of their CPU capacity. Based on the initial load throughput numbers we saw when loading 320 million objects, it can be projected that 1 Billion objects can be loaded to the cluster in around 36 minutes (if cluster had enough memory to hold these objects as pointed by Michal Frajt in comments below). Objects were loaded into both the primary and backup which was running on a different VM (on EC2 also a different machine). The data was synthetic account data generated during the load (not loaded from a database).
In read test we had single client with 50 threads performing a read operation based on a random key. The data grid handled 10,000 read per second when the client used sleep of 4 milliseconds after each read and over 100,000 reads per second without any sleep (2000 reads per second per client thread). During read tests we were also running a writer client which was creating new objects at the rate of 2000 objects per second. Grid nodes on EC2 consumed about 5% CPU during this test. We can project that the grid capacity is about 2 million reads per second from remote clients.
We also tested reads using local cache (aka near cache). With local cache enabled, test client managed to read data at the rate of 5 million objects per second (with local cache size of 1 million objects). As client caches recently read data in local JVM, it avoids remote calls, improving performance dramatically. During local cache tests, client machine consumed 80% CPU as the data was being served from the local cache.
GigaSpaces + Cloudify make the automatic failover in cloud environments a reality. GigaSpaces detects machine failures and automatically provisions new machines to meet the application SLA (which can be available memory or CPU cores).
To demonstrate this, we simulated a machine failure using the AWS console (“Terminate” function) and then watched as the application automatically recovered from this event by spinning up a new machine. This all occurred transparently and with no performance impact to the clients.
As applications have to manage and manipulate more data (thanks to Big Data and the analytics that can be unearthed out of larger and larger datasets), using in-memory access greatly helps to speed things up. Using GigaSpaces you can manage Terabytes of data across any number of machines, and on any platform.
For applications that have to work with heavy, constant loads, using a Cisco UCS Server infrastructure is a perfect fit.
For applications that only have to work with these large data set occasionally, using a Amazon EC2 infrastructure (or any other cloud provider like RackSpace) is a really good option.
You can download the datagrid recipes I used from our github repository.
Please contact me using comments if you are looking for the demo source code.
Updated 1/5/12 – Performance section was updated to clarify.
Updated 2/6/12 – Cloudify Recipes moved to github repository.