Facebook Twitter Gplus LinkedIn YouTube E-mail RSS
Home Benchmarks Terabyte Elastic Cache clusters on Cisco UCS and Amazon EC2

Terabyte Elastic Cache clusters on Cisco UCS and Amazon EC2

Overview

Last week I was working on a new opportunity. The prospect needs to store 1 Terabyte of data in memory to address scalability challenges and were interested in using GigaSpaces. I was tasked with creating a demonstration of this and want to share my experience as a blog post.

Scope

The project also needed to demonstrate high-availability and full lifecycle aspects of any data grid used for a mission-critical application. This application serves millions of simultaneous users worldwide. Loading the cache with data, read throughput (max reads/second), write throughput (max writes/second) and automatic failover were most important aspects.

The first issue I had to address was hardware. The concern was that I only had about a week to deliver the demo, and I needed machines with over 1 terabyte of available memory. We have some lab machines that I could hobble together to come up with a terabyte of memory, but these are used for internal tests by our R&D and QA teams and getting these exclusively for the demo might not be an option. An Amazon EC2 cluster was the next best option that seemed viable. We also reached out to our contacts in Cisco asking for help.

The very first requirement of this demo is the need to run on any deployment environment given that where it would run was still an open question. This capability includes being able to develop on a laptop and then deploy to the actual demo hardware without any code changes.

Cloudify

For addressing the dynamic deployment issue, luckily we have Cloudify, which does exactly that. It makes any application agnostic to the underlying deployment environment. With Cloudify, the same application can be deployed to a cloud infrastructure, a non-cloud infrastructure or even a local machine without any changes to code. Deploying, managing and monitoring an application on any of this environment becomes very easy, and the look and feel of the management and monitoring tools remains consistent.

Hardware

EC2

For the EC2 version, I used the stock Amazon Linux image on the largest configuration available, which is the High-Memory Quadruple Extra Large Instance. Since these instances come with 68 GB of memory, we needed 16 machines for a 1 Terabyte cluster. Our prospect also wanted to see hot backup and automatic failover, so that required another 16 machines, for 2 Terabytes in total. Running these 32 machine instances on EC2 costs $64 per hour, which is pretty cheap for a demo or development environment.

Cisco UCS

Cisco was kind enough to lend us a C260 UCS Server, (referred to as UCS below) for our tests. The UCS offers large memory capacity and 40 ultra-fast cores, so they are an ideal platform for memory-bound and CPU-intensive applications. These were exactly the type of conditions that the demo is subject to and this server was the ideal hardware we needed. Since UCS machines have 1 Terabyte of memory and retail for about $40,000, they are also very attractive for production environments.

Cloudify Recipe

An application deployed using Cloudify has to define Application and Service Recipes, which are Groovy scripts. The Application Recipe describes the application components, which are its services and their dependencies. The Service Recipe describes service characteristics, such as the number of service instances, lifecycle events, scripts that handle these events, monitoring configurations, scaling rules and custom alert rules. More information on Recipe and how Cloudify uses it is described here.

The Application Recipe (datagrid-application.groovy) consists of a single service:

application {

name="datagrid-application"

service {

name = "datagrid-space"

}

}

This service is an Elastic Cache Service (Data Grid) and the following Application Recipe describes the EC2 version of it:

service {

icon "icon.png"

name "datagrid-space"

statefulProcessingUnit {

binaries "datagrid-space" //can be a folder, jar or a war file  

sla {

memoryCapacity 2048000

maxMemoryCapacity 2048000

highlyAvailable true

memoryCapacityPerContainer 64000

}

}

}

For the UCS version, the application code remains the same and only the memory capacity settings are adjusted to the available memory on the machine (1000000 MB).

Cloud Driver

The cloud driver acts as the specification for the new machines that Cloudify provisions. Cloudify spins up new machines when an application is deployed, scales out or on failure of a machine. The cloud driver for EC2 is configured as follow:

cloud {

provider "aws-ec2"

user "YOUR_EC2_ACCESS_KEY_ID"

apiKey "YOUR_EC2_SECRET_ACCESS_KEY_ID"

// relative path to gigaspaces directory

localDirectory "tools/cli/plugins/esc/ec2/upload"

remoteDirectory "/home/ec2-user/gs-files"

imageId "us-east-1/ami-1b814f72"

machineMemoryMB "68100"

hardwareId "m2.4xlarge"

// Security group which has the appropriate ports configured to be open for incoming and outgoing traffic

securityGroup "default"

// YOUR keypair file and name of the keypair

keyFile "cloud-demo.pem"

keyPair "cloud-demo"

// S3 URL location where GigaSpaces is saved. Update the access properties of this location to everyone

cloudifyUrl "https://s3.amazonaws.com/cloudify/gigaspaces.zip"

machineNamePrefix "gs_esm_gsa_"

dedicatedManagementMachines true

managementOnlyFiles ([])

connectedToPrivateIp false

sshLoggingLevel java.util.logging.Level.WARNING

managementGroup "management_machine"

numberOfManagementMachines 2

zones (["agent"])

reservedMemoryCapacityPerMachineInMB 1024

}

No cloud driver was needed for the UCS version because it ran on existing machines and no hardware provisioning was needed.

Deployment and monitoring

Starting the Cloudify infrastructure is straightforward: Log in to the Cloudify shell and run either “bootstrap-localcloud” for starting a local cloud or “bootstrap-cloud ec2” for starting EC2 cloud. This starts the GigaSpaces management infrastructure which includes a GSA, GSM, LUS, web-ui and a rest service.

For EC2, this bootstrap process takes about 2-3 minutes. This includes time to provision the new machines on EC2, copy GigaSpaces software and start the processes listed above.

For UCS the bootstrapping was much faster and took less than 1 minute.

Once the management infrastructure is ready, the application can be deployed. This is done using the “install-application” command with an argument for the location of the Application Recipe folder. It’s the same command for both EC2 and UCS versions of the demo.

EC2 deployment took about 10 minutes. In this time, 32 new machines were provisioned on EC2, GigaSpaces software was copied, GigaSpaces agent processes were started, GSC’s were started and the application was deployed across all the machines.

Voilà, a 2TB cluster was up and running!

Deployment of the UCS version was just as easy, and the cluster was up and running even faster as no machine provisioning is involved here.

Performance

Other objective of the demo was to show performance numbers that meet the application requirements. This was the easy part.

We were able to demonstrate easily the peak load requirements of the application and how either of the clusters can keep up with their expected loads.

In both the environments (Cisco UCS and EC2) we had very good results – initial load task on EC2 managed to load 500,000 objects per second (1k size). During the initial load all the machines consumed 99% of their CPU capacity. Based on the initial load throughput numbers we saw when loading 320 million objects, it can be projected that 1 Billion objects can be loaded to the cluster in around 36 minutes (if cluster had enough memory to hold these objects as pointed by Michal Frajt in comments below). Objects were loaded into both the primary and backup which was running on a different VM (on EC2 also a different machine). The data was synthetic account data generated during the load (not loaded from a database).

In read test we had single client with 50 threads performing a read operation based on a random key. The data grid handled 10,000 read per second when the client used sleep of 4 milliseconds after each read and over 100,000 reads per second without any sleep (2000 reads per second per client thread). During read tests we were also running a writer client which was creating new objects at the rate of 2000 objects per second. Grid nodes on EC2 consumed about 5% CPU during this test. We can project that the grid capacity is about 2 million reads per second from remote clients.

We also tested reads using local cache (aka near cache). With local cache enabled, test client managed to read data at the rate of 5 million objects per second (with local cache size of 1 million objects). As client caches recently read data in local JVM, it avoids remote calls, improving performance dramatically. During local cache tests, client machine consumed 80% CPU as the data was being served from the local cache.

Failover

GigaSpaces + Cloudify make the automatic failover in cloud environments a reality. GigaSpaces detects machine failures and automatically provisions new machines to meet the application SLA (which can be available memory or CPU cores).

To demonstrate this, we simulated a machine failure using the AWS console (“Terminate” function) and then watched as the application automatically recovered from this event by spinning up a new machine. This all occurred transparently and with no performance impact to the clients.

Conclusion

As applications have to manage and manipulate more data (thanks to Big Data and the analytics that can be unearthed out of larger and larger datasets), using in-memory access greatly helps to speed things up. Using GigaSpaces you can manage Terabytes of data across any number of machines, and on any platform.

For applications that have to work with heavy, constant loads, using a Cisco UCS Server infrastructure is a perfect fit.

For applications that only have to work with these large data set occasionally, using a Amazon EC2 infrastructure (or any other cloud provider like RackSpace) is a really good option.

You can download the datagrid recipes I used from our github repository.

Please contact me using comments if you are looking for the demo source code.

Updated 1/5/12Performance section was updated to clarify.

Updated 2/6/12 – Cloudify Recipes moved to github repository.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
9 Comments  comments 
  • Michal Frajt

    Hi Sean,

    You were using 16 EC2 machines (partitions) with 68GB memory each. The total memory given to your space was 1088GB ~ 1TB. Could you please explain how do you manage to store 1TB of application data with the 24GB overhead only? Just your test with the 1 billion objects with the 1k size is filling complete memory.

    To help you: your EntryCacheInfo, EntryHolder, EntryData and ConcurrentHashMap entry have around 256B overhead for each cached object (using 64GB heap space without compressed oops). The Java VM you are using has another enormous overhead. Plus I don't think you were able to run the Java heap with the 99% occupied. The CMS cycle run could be around 20 minutes while you had to keep another extra free perm heap space for the new written objects (2000/sec) promoted by the newgen. All together minimum 50% overhead.
    Thanks,
    Michal

    • http://www.gigaspaces.com sean

      Michal,
      Good catch. Billion objects loading in 36 minutes is a projected number. I updated the blog to clarify this.

      Thanks.
      Sean

  • Pingback: Cisco UCS “Cloud In A Box”: Terabyte Processing In RealTime | blog.garyberger.net

  • Michal Frajt

    Hi Sean,
    Your test client description says that you are reading random objects out of the 380 million objects in the partitioned space. The test case with the local cache (configured to cover 1 million objects) is presented with the 5 million reads per second. The local cache size is able to cover 0.3% percent of all space objects (1m / 380m). It means that only 0.3% random reads can be answered from the local cache, all other reads are remote reads. The theoretical maximum improvement is than not more than 0.3% but your local cache test result is 50 times faster.
    Could you please explain how the local cache improves the test client to read 5 million random objects per second?

    Thanks
    Michal

    • http://www.gigaspaces.com sean

      Local cache should not be used if you are going to read the entire data set of 320 million objects. It is recommended only if your client is expected to read a subset of data. In local cache test above, client was reading randomly between 1 million objects.

  • Michal Frajt

    Hi Sean,
    Would you agree with this – the GigaSpaces data grid technology has 220% memory overhead? Or is there something I misinterpreted?
    The requirement was to store 1TB application data. The 1TB GigaSpace data grid can handle 320 millon 1K application objects. The necessary data grid memory to handle 1 billion objects (1TB) is than 220% x 1TB (overhead) + 1TB = 3.2 TB to be given to the GigaSpaces data grid.
    Would you be able to provide a memory overhead comparison with other data storage/grid technologies? Is the 220% memory overhead a good or bad result?
    Thanks
    Michal

  • Pingback: Driver for Deploying Any App to Any Cloud Available for Free – - Tech News AggregatorTech News Aggregator

  • Pingback: Driver for Deploying Any App to Any Cloud Available for Free

© GigaSpaces on Application Scalability | Open Source PaaS and More