Multi-tenancy: does it have to be that hard?

Posted 15 March 2010 @ 4:30 pm by Nati Shalom

Multi-tenancy is a term that is used often in the context of SaaS applications. In general, multi-tenancy refers to the ability to run multiple users of an application on a shared infrastructure. The main motivation for doing this is efficiency, or in other words -- reducing the cost per user in comparison to a dedicated system where each user has their own dedicated environment.

image

Over the past few months I've had discussions with various SaaS providers, in which I tried to learn about the main challenges that SaaS providers face when they deal with multi-tenancy. As it happens, I came across an interesting article on this very topic, Why Multi-tenancy Matters in the Cloud by Alok Misra. The article later sparked an interesting discussion thread on the cloud mailing list, which I also find to be a good reference for diverse thoughts on this subject.

In this post I want to summarize some of my findings on multi-tenancy, as well as suggest a model that can make multi-tenancy significantly simpler.

The common challenge: Multi-tenancy at the data layer

There are many approaches and levels of multi-tenancy, depending on the application layer and type of application you're dealing with. For example, trying to separate user accounts in a CRM application is quite different than separating various batch processing jobs in a risk management application. That said, what seems a common challenge in almost every SaaS application is multi-tenancy at the data layer. To put it simply, the challenge is how to share data resources (database in most cases) among multiple users, while at the same time ensuring data isolation between those users, as if they are running on completely physically seperated servers.

In the remaining part of this post I'll address how multi-tenancy is currently being dealt with at the data layer, and how the existing challenges can be met.

How is multi-tenancy being done today?

For this part of the story, I think it might be useful to first look at the two extreme approaches: The fine-grained approach and the cross-grained approach.

Fine-grain multi-tenancy (Salesforce.com model)

For many SaaS providers the “poster child” of multi-tenancy is Salesforce.com. There is a very useful videocast explaining the Salesforce approach to multi-tenancy. In a nutshell, their approach is to share the same database instance between different users, and define the data model so that each table has a primary key identifying the customer ID. In addition, there is a separate metadata table that translates every query into a user-specific query before the query hits the database.

  • Pros:
    • Very fine-grained multi-tenancy resulting in optimum cost efficiency per user.
  • Cons
    • Complexity: Requires changes in the data model.
    • Doesn’t fit to existing application: Existing applications require a complete rewrite.
    • Modest degree of isolation: Users sharing the same resource can impact one another under heavy load situations. A maintenance/patch release required for a particular user might affect other users. In a more extreme scenario, there is also exposure to a higher degree of vulnerability. For example, a bug in the query mapping could point one user to data belonging to other users just by picking up the wrong customer ID.

image

Cross-grain multi-tenancy (hosted service model)

Most hosting providers (including hosted IT providers) enable outsourcing of certain IT applications. Efficiency is gained by sharing maintenance overhead, labor costs, and other data center costs such as network, electricity, and more.

  • Pros:
    • No code changes are required, resulting in greater simplicity.
    • High degree of isolation.
  • Cons
    • Lower efficiency and therefore limited cost margins.

image

Challenges with today's approach to multi-tenancy

The challenge with the fine-grained model is that it is fairly complex to maintain and implement. If you have an existing application, it requires a complete rewrite and also forces fairly significant changes in your existing data model. The challenge with the cross-grained model is that cost margins per user are still rather high. In a competitive environment, even a single competitor offering similar servicse to yours using the fine-grained approach could easily put you out of business.

The elasticity challenge

The other challenge that neither approaches addresses today is dynamic elasticity. What happens when a customer grows beyond their allocated shared capacity? Both approaches seem to rely on certain capacity assumptions, and can scale by splitting multiple users between their shared resources. However, none of the approaches seem to handle a situation where a particular user's capacity grows beyond the allocated resources.

Jayarama Shenoy provides a good introduction to that challenge in his comment in the discussion thread:

So, how does elasticity work in a multi-tenancy case? Say 100 tenants share an app on a server (by which I would think they go all the way down to a shared database). Then,say, five of them grow quite rapidly such that they need to be moved off that server, and the other 95 can continue to exist on the original server.
So 'something' in the application has to know how to pick off the records that belong to these five, copy it off into a second instance of that database and kill, move & restore application service for those tenants. (Of course, without being privy to the data contained in those records). And of course there's an implied load balancer element in front of all this to keep up appearances to the clients.

Based on this analysis I believe that the main challenge with the current approaches is that we have too few options available. No option seems to provide a good solution to the elasticity challenge, and this is what is making multi-tenancy “so damn hard”. This is especially true for existing businesses with existing products.

The solution: Making multitenancy simple

Learning from the past operating system experience

Some of us probably remember the days when Bill Gates came up with DOS, which was basically a single-user system. At the time, if you wanted to provide a multi-user application you had to write it yourself, and sometimes that required going down to the core interrupt level and program in assembly language. This made multi-user systems on DOS very uncommon. Unix, on the hand, was built for multitasking, multi-user environments, and therefore it was quite common to build multi-user systems in the Unix environment. Later on, when Windows NT came out, all that changed. Today we can’t even imagine a scenario where we would build an application that wouldn’t support multiple users. 

Lesson learned:

As long as the underlying infrastructure (the operating system in the example, and the database in our specific discussion) is not built for multi-tenancy, trying to solve the multi-tenancy challenge at the application level is doomed to be as complex as writing a multi-user system on DOS. 

In fact, once multi-tenancy was dealt with properly at the operating system level, writing a multi-tenant application became significantly simpler and therefore widely adopted. The same should apply to databases as well.

The database is the problem!

Almost all of the attempts I've seen in both the Salesforce approach and the hosted environment approach were built under the assumption that the database is a fixed “concrete” construct that cannnot be broken down easily into “smaller bricks”, let alone scale dynamically. It is therefore not surprising that most of the attempts to deal with multi-tenancy have tried to address the challenge outside the database.

Now, imagine what your approach to the challenge would be if a database could be easily broken down into small “bricks” without huge overhead, and if it could scale dynamically?

Multi-tenant data service

Similar to the experience we had with operating system, I believe that the solution for multi-tenancy has to be dealt with at the infrastructure level. In our specific discussion, that would be the database. In this section I’d like to draw the principles for a multi-tenant data service:

  • Simplicity: With a multi-tenant database, applications can be written to a single database tenant just as they are today. How tenants are allocated and shared with other tenant needs would be completely abstracted from the application.
  • Efficiency: Efficiency is maintained by the fact that all data tenants can share the same hardware (memory, CPU) with other tenants.
  • Dynamic scaling: Each database tenant can be distributed and scale dynamically across multiple machines, if necessary, to meet demand. This includes moving data to other machines if necessary. Scaling must be supported seamlessly without any downtime.
  • Support for multiple isolation/sharing levels: With a multi-tenant database, the level of sharing/isolation is set on demand, based on the specific requirements. At that level, a user can choose a dedicated tenant, in which case the multi-tenant database would allocate tenants on a set of machines that is dedicated to that specific user. Clearly, the benefit of this is isolation, at the cost of efficiency. Another type of sharing is available to the user at the same time: Choosing to share with any individual or group under the user's control. In other words, rather then dictating either extreme (full isolation or full sharing) for all users, with a multi-tenant database we can dynamically choose the right level that best fits our purpose.

image

Where we are today?

Cloud providers such as Amazon, Microsoft, and Google have already realized that they can’t expect their users to deal with multi-tenancy themselves. So they've come up with data services that were designed with multi-tenancy in mind. Amazon recently introduced Amazon RDS (Relational Database as a Service) and Microsoft introduced SQL Azure. Both are an attempt to take existing databases (MySQL in the case of Amazon and SQL Server in the case of Microsoft) and add some of the characteristics that I discussed above. For example In both cases the user works with a single tenant and is completely abstracted from the underlying resource sharing optimizations. 

And yet, both solutions are still fairy limited when it comes to providing elasticity: With the Amazon solution, you're expected to scale-up (use a bigger machine) and go through a period of downtime when you switch between one degree of scaling and another. Google AppEngine has taken a different approach. Instead of relying on an existing RDBMS and wrapping it up with multi-tenancy and dynamic scaling, they took their own NoSQL alternative (BigTable) that was already built for dynamic scaling and, through a mapping layer, provided SQL semantics on top of it. As with RDS and SQL Azure, the user is completely abstracted from the way Google manages multi-tenancy in their underlying infrastructure.

Niether Amazon, Microsoft, nor Google provide WRT flexibility to the isolation level, and more importantly, all these solutions seem to be completely tied into their own cloud data center, which means that I cannot port it to my own local environment. From a throughput efficiency perspective, SQL Azure and Google seem to trade scaling for throughput by introducing an additional layer of indirection (load balancer in the case of Azure) that makes their multi-tenant solution fit only low-end applications.

At GigaSpaces we took an approach very similar to the Google approach with our new Elastic Middleware Services, where we use our existing distributed NoSQL In-Memory Data Grid and used it as a muti-tenant data service.

That is it for similarity. Here are the differences:

  • You can easily download and install GigaSpaces in your local environment or use it in the cloud and get the full multi-tenancy and dynamic scaling experience. We provide an open SPI that enables you to plug in any resource pool, such as starting from a simple SSH call through a complete virtualization pool.
  • You can control the level of isolation, scaling, and other SLA directly through an API.
  • Our data services are designed for extreme efficiency by maximizing the utilization of memory resources. This normally translates into the ability to run at least 10x more user transactions on the same hardware, which in turn translates into huge cost savings.
  • We provide the same consistent approach for other layers of the stack such as messaging and map/reduce through the same environment.

You can learn more about the Elastic Middleware solution in my previous post.

This week (Tuseday) I will be speaking about some of these challenges at the CloudConnect conference where i will be joining the Care and Feeding of a Cloud Application panel led by Shlomo Swidler, Founder, Orchestratus, together with  Thorsten von Eicken, CTO and Founder, RightScale, Ezra Zygmuntowicz, Senior Fellow & Co-Founder, Engine Yard, Brian Lucas, Lead Architect, Web Technology, Sling Media, Thor Muller, CTO and Co-Founder, Get Satisfaction,  Also joining me is one of our SaaS customers, Abbas Valliani, Managing Director, Primatics Financial, who will be speaking about their experience building a risk management solution as a service.

Later in the week (Thursday) I will be speaking about Elastic Data on the Cloud: Hype or Reality? during TheServerSide conference in Las Vegas together with Uri Cohen, our product manager, were we are also planning to discuss and hopefully demonstrate a combination of the two approaches that I discussed earlier regarding the NoSQL approach with cloud-based RDMS. I'm hoping to get enough interesting feedback to include in my next post.

If you happen to be at any of these events, I'd be happy to hear about your experience and your view on this topic. I plan to be near the GoGrid booth through most of the CloudConnect conference.

References


Read more...

New Service Grid Admin API for .NET

Posted 10 March 2010 @ 6:46 pm by .Net Team

New to XAP.NET 7.1 is the service grid admin API. This API is vast and provides capabilities for managing the entire GigaSpaces environment in a simple fashion. It can be used for many different uses such as monitoring statistics and the state of the different components. Another common usage would be to write agents programatically that monitor the deployed applications state and can automatically scale up or down according to demand. Such agent is demonstrated in the Scaling Agent example.

The API is very straight forward, for instance this is how you can programatically start a new Grid Service Container and increase the number of instances of an already deployed processing unit:

.NET service grid admin api sample

For full details about this new powerful feature, please read the Administration and Monitoring API page.

Eitan


Read more...

The Missing Piece in the Virtualization Stack (Part 2)

Posted 25 February 2010 @ 6:40 pm by Nati Shalom

In the first part of this post, I discussed how virtualization and cloud computing, as we know it today, is only a small part of the solution for today’s IT inefficiencies. While new technologies and delivery models have made it much simpler to manage the infrastructure, this is not where our core inefficiencies lie. Virtualization principles must be extended to higher levels of the application stack, to make it easier for all of us to manage, tune and integrate applications. Otherwise we will continue to spend most of our time on things that don’t provide real value to the business – infrastructure, installation, management, tuning, and integration.

In this post, as promised, I’ll show how this missing piece can be filled in practice, by something called Elastic Middleware Services. These are middleware services that can be deployed with one API call, and are completely abstracted from the applications that use them. Elastic Middleware Services are very similar to cloud-based middleware like Amazon SimpleDB, but the major difference is that they are enterprise-grade, and can be purchased and installed in your private data center, for use by your internal enterprise applications.

To better illustrate this idea, I’ll discuss a reference implementation of Elastic Middleware Services which will shortly be released as part of GigaSpaces XAP 7.1. The discussion is intended to illustrate in concrete terms how to fill the missing piece in the virtualization stack. I’ve intentionally focused on the conceptual and fundamental elements of the GigaSpaces implementation, not on specific features and benefits, to make the discussion useful even if you are not using or considering GigaSpaces products.

By the way, I recently hosted a webinar on this topic, together with Uri Cohen, GigaSpaces’ product manager. In the webinar we presented GigaSpaces’ upcoming Elastic Middleware Services. The recorded version of this presentation is available below:


Elastic Middleware Services

The idea is to provide middleware services (messaging, data, Map/Reduce, and more), like those the ones provided by cloud providers such as Amazon and Microsoft Azure, in your own local environment. These services simply extend the basic principles of virtualization, as I outlined in the previous post, to the middleware stack:

Basic virtualization principles

  1. Break big physical resources into smaller logical units
  2. Decouple the application from the physical resources
  3. Provide an abstraction that makes all the small units look like one big unit

From a middleware perspective, this means that instead of having one big database, you have lots of small units of data services (also called partitions or shards), which are grouped together through a client side proxy that exposes all of these small units as one big unit to the client that is using them. The client proxy is also responsible for abstracting the client from the physical location of those resources. To scale, all you need to do is add more of those units. The client experiences these additional units as extra capacity added to the service.

Deploying middleware with one API call

From an API perspective, this is a much simpler approach than traditional middleware, because it introduces a higher-level abstraction to the user, as can be seen in the code snippet below:

clip_image002

With this new elastic middleware API, users define requirements in a language that is closer to their domain. For example, “I want to create a data service that will have 10GB of data and can grow up to 100GB of data, I need hot failover and I'm willing to share my deployment with other users of my own organization but not with users of other organizations.”

In response, the Elastic Middleware Services would automatically do the following:

  • Allocate the right number of partitions, based on machine availability and the available memory in each machine, to address the capacity required.
  • Automatically launch a hot backup in for each partition and ensure that it runs on separate machine than the primary partition, to address the high-availability requirement. 
  • Make sure that if another organization’s application is already running on the machine, the application won’t deploy on that machine, to address the security requirement.

Fully automated through built-in SLA

As I noted in my previous post, one of the major costs of enterprise applications over a period of three years is the cost of maintenance and operations. The main component of maintenance costs is the labor cost, closely tied to the amount of manual work you have to put into your system to make sure that it meets the application’s SLA. The cost grows proportionally to the demand for scale.

With Elastic Middleware, maintenance cost can be substantially reduced through built-in SLA and automation that covers the following aspects:

  • Scaling SLA – scale when there is a memory shortage, high CPU utilization, Garbage Collector hiccups, etc. Users can choose to scale automatically when any of these events occurs, and customize their own thresholds (e.g. scale when CPU utilization hits 90%). The Elastic Middleware is integrated with a cloud or virtualization framework, to enable it to automatically pull or release of machines as needed.
  • Failover – when a machine fails, the Elastic Middleware does one of two things:
    • Scale down – if there are available resources to meet the current workload, the system will automatically scale down and continue to service the application.
    • Rebalance – if there are not enough resources to serve the application, the Elastic Middleware calls the cloud/virtualization pool and starts a new machine. As soon as the new machine starts, it is added to the existing pool of resources, and the application is re-balanced to take advantage of the additional capacity available through the new machine. It is important to point out that if you don’t have a dynamic pool of resources available through virtualization or cloud computing, you could still start a machine manually. Once the machine is started, adding it to the application’s resources and rebalancing would happen in exactly the same way.
  • Continuous rebalancing – when new machines are added to the Elastic Middleware, it immediately detect them and re-balances the assets currently running on the new machines (if necessary), ensuring optimum utilization of all services.

Carrying out this dynamic SLA was deigned to happen while the application is running, ensuring no transaction or data loss during the process. In-flight transactions continue to be served with no noticeable hick-ups. 

Concerned of losing control?

Whenever I introduce the concept of built-in SLA and automation, I get two type of responses. One is “Wow, that’s cool!”, the other is “It sounds like I’ll lose control.” The concern about losing control is very valid, as in many cases, especially in non-cloud environments the concept of spare capacity rarely exist and therefore adding a new machine would often involve some manual intervention. In addition, when something goes wrong, other parts of the system or even other parts of the organization need to be involved, so we can’t always assume that full automation is possible.

The cruise control analogy

The approach GigaSpaces has taken is similar to the way cruise control works in our cars. You can choose to give up full control under a certain threshold, but at each point in time you can take the wheel back and resume full control. Like many of the cloud providers, we provides this type of cruise control for our services through an open API that enables management and control of every aspect of our cluster. You can query the available node, CPU, application partitions and service components. GigaSpaces Elastic Middleware was built as a layer on top of that API, which means that if the built-in functionality that comes with the Elastic Middleware doesn’t fit your needs, you can go one layer down and write your own custom behavior. For example, you can specify that in your system, scaling or failover events send an alert that will trigger a manual process for adding a new machine. In other words, you have full flexibility to choose when to turn on “cruise control” and when to take ownership.

SaaS-enabled with built-in multi-tenancy

Multi-tenancy often means that can run multiple users/customer applications on a shared resource, thereby reducing the cost of ownership per user. This is considered an area of high complexity, specifically in SaaS applications, because with today’s middleware, the application itself is responsible for mapping between users and application tenants. With Elastic Middleware Services, multi-tenancy is built into our implementation and API. Thus the burden of sharing resources is moved from the application down to the middleware. Each application works with its own dedicated middleware service (data, messaging, etc.) but that middleware service can share resources with other middleware services. In other words, instead of having one big database running on a dedicated machine and split it out between users at the application level, you can have lots of small databases spread between machines, where each application can have its own dedicated database and still have that database shared with other database instances that runs on that same machine.

This not only significantly simplifies your code, it also provides better isolation between multiple users, as well as independent life cycle management for each tenant.

There are various tradeoffs between sharing and isolation. With isolation you get better security and control of your own environment, but lower utilization and higher cost. With sharing, you can reduce cost and still achieve reasonable isolation, but not at the same level as when running on a dedicated resource. The Elastic Middleware Services make it possible to define the isolation/sharing at the application level. It does this by introducing the notion of an isolation level:

image

Currently we support three isolation levels:

  • Dedicated – guarantees a dedicated machine allocated per instance of the application.
  • Shared private – multiple instances of the application or organization share the same resources, but other departments or organizations are isolated.
  • Shared public – everyone shares everything.

 

Available both on your local network and on the cloud

To try the new service you don’t need to have a private cloud or run your application on a public cloud. You don’t even need a virtualization layer. All you need is to launch a single GigaSpaces agent (Java process) per machine (normally that process will be started automatically at machine boot).

Once you’ve done this, you can start interacting with your machines and an create the desired middleware service through a simple API call. The following snippet shows how you might use this API:

Admin admin = new AdminFactory().createAdmin();

ElasticServiceManager elasticServiceManager = admin.getElasticServiceManagers().waitForAtLeastOne();

// Start a new data-grid

ProcessingUnit pu = elasticServiceManager.deploy(new ElasticDataGridDeployment("mygrid") // give it a name

.isolationLevel(IsolationLevel.DEDICATED) //isolation level

.highlyAvailable(true) // set the high availability level

.elasticity("2GB", "6GB") // set the required capacity range

.jvmSize("512MB") // configure the size per VM

.addSla(new MemorySla(70))); // define Memory SLA

That’s it! – you just got a full cluster deployed and ready to use.

What does it mean for you?

To sum up, I here are the main benefits that Elastic Middleware Services bring to each type of user:

For developers

As a developer, you can get access to the service you want just by calling an API, without worrying about installation or cluster setup. You get access to the service from your existing platform. In other words, you don’t need to run your application on GigaSpaces XAP to use any of those services. You can also pick and choose – use only the services you want, e.g. just the data, data and messaging, etc.

For the (private) data center

From a data-center perspective, you could install Elastic Middleware Services only once in your data-center. The same installation and resources would be shared amongst all users in your organization. If you happen to have virtualization in place, you can do this very easily using the GigaSpaces virtual appliances. 

Once you install the system, other users in the organization don’t need to go through the installation process – they can start consuming the middleware services just by calling an API, like they would do on the Amazon cloud when they launch a SimpleDB instance.

In a way, the Elastic Middleware Services gives the data center the power to provide higher-lever services to the business, rather than just plain infrastructure services, enabling the organization to become more agile.

For public cloud providers

Public cloud providers can be viewed as an outsourced version of the local data center, so they should experience much the same benefits as data centers. In addition, with Elastic Middlware Services, public cloud providers can offer a set of middleware services that are enterprise-ready with extremely high utilization per service. Because the Elastic Middleware Services are fully memory-based, you can use as much as 10X less machines to achieve the same throughput requirements, compared to a disk-based approach.

Ricky Ho published some interesting performance characteristics of Amazon SQS and SimpleDB in one of his recent posts:

  • Network latency and throughput: 20 - 100 ms for SQS access, SimpleDB domain write throughput is 30 - 40 items/sec.
  • Eventual consistency: 2 simultaneous requests to dequeue from SQS can both get the same message. SQS sometimes reports empty when there are still messages in the queue

While both SQS and SimpleDB provides a scalable data store and messaging services, their performance are at least 10-50X slower than an equivalent GigaSpaces data and messaging service. Implementing a Map/Reduce scenario with GigaSpaces is going to be significantly simpler and closer to real-time compared to Amazon Elastic Map Reduce (and that’s a topic for a separate post). Another cool feature is the ability to execute your code within the service container, which gives you an even bigger improvement in performance and latency.

The Elastic Middleware Services have much higher utilization and simpler maintenance, compared to equivalent services offered separately to the end-user, because they are provided using a shared cluster with a unified clustering model. With Elastic Middleware Services you get JPA, memcached, key/value (NoSQL), Spring, JMS, Remoting and Map/Reduce in Java or .Net, in a single deployment.

Existing GigaSpaces XAP users

Existing GigaSpaces XAP users can use the Elastic Middleware Services as part of XAP.  They’ll benefit from the simplicity provided through the high-level abstraction. In addition, the Elastic Middleware Services make it easier to plug-in the GigaSpaces components into existing application servers or development environments. They also make it significantly simpler to manage and deploy GigaSpaces XAP in various groups in the organization, because they can all share the same virtual pool of machines in the data center.

Availability

The Elastic Middleware Services will be available as part of our upcoming XAP 7.1 release, due for end of March 2010. It is currently available for private beta testing. We will also release a public beta of the Elastic Middleware Services through our upcoming 7.1 release candidate version. As always, we’ll welcome feedback and will be very happy to hear about your specific requirements. You could either post comment to this post or send an email to <pm at gigaspaces dot com>.


Read more...

Custom Matching-Two Dimensional Cartesian space Comparison using GigaSpaces

Posted 22 February 2010 @ 2:31 am by Shay Hassidim

Usually you index and execute queries using primitive fields (long, float, string, etc). The fields may be within the root level of the space object, or embedded within nested objects within the space object. You may construct a query using a template object or SQL to specify the criteria you would like to use when the matching phase is performed within the space when looking for the relevant objects.

In some cases you might want to use a custom data type with a custom business logic to find matching objects within the space, rather the usual primitive data type comparison. To allow the space to invoke your business logic when the matching process is conducted, the Comparable interface should be implemented for a class that stores the data you would like to use with your custom business logic.

Such custom business logic might be useful when comparing vector data (2 dimensional Cartesian space). These may represent sound, maps, pictures or any other 2 or 3 dimensional artifacts. You may use this technique to query data based on any other mathematical or financial related formulas such as Time value of money like Present Value of a Cash Flow Series, Future Value of a Cash Flow Series, etc. Other areas where such custom matching is relevant, are Pattern recognition, Sequence analysis, Surveillance, Forensic, Social network behavior etc.

The Best Practices Custom Matching page includes a detailed example how you may use this technique to locate matching vectors stored within the space using the Euclidean distance formula:

EuclideanDistance

Here is an example of a target vector, and a matching vector found using a custom matching implementation:

custommatching

Such a custom matching implementation can locate a matching vector within few milisec where the space was storing 100,000 vectors.

Impressive - ha?

Shay Hassidim


Read more...

Interview with Michael Di Stefano from Integrasoft on their CEP Cloud Services using Esper GigaSpaces

Posted 13 February 2010 @ 12:59 pm by Nati Shalom

During the past few weeks I had the honor to have a discussion with one of our partners Integrasoft who developed a distributed Complex Event Processing engine on top of Esper a popular opensource Complex Event Processing engine and recently integrated their solution into GigaSpaces.

As many of GigaSpaces users already design their application in an event driven fashion I thought that the CEP solution proposed by Integrasoft may  be a natural fit.

Below is a summary of a discussion i had with Michael Di Stefano where he provides an overview on Integrasoft and  CEP and later explain their specific solution and how GigaSpaces users can benefit from it:

Michael can you tell us a bit about yourself and Integrasoft?

We help clients and product companies optimally match business need with the technology that best meets that need. This is the fundamental philosophy is embedded in Integrasoft, the company I founded in 1997. Integrasoft embraced Grid computing when hit Wall Street almost a decade ago, and quickly promoted the importance of data affinity and the shift of the most important resource in an IT environment from CPU to the network. Today Clouds and Complex Event Processing (sometimes called CEP, Event stream Processing, and Business Event Processing) are the leading technologies addressing business need in not just Financial Services but across industry sectors.

True to form Integrasoft has taken CEP’s “server” approach to event processing and have adopted it to a pure distributed computing architecture. CEP Service Clouds abstracts CEP engines and provides a framework to network them together forming a Cloud of CEP that can host “services” in the cloud. These services can take advantage offered by CEP -AND- the virtualization and scale of a cloud. With CEP Service Cloud one can push the processing out to the fringes of the cloud eliminating unnecessary data movement from a data source to today’s centralized CEP Servers.

What CEP stands for?

A definition of Complex Event Processing can be found on Wikipedia

Complex event processing, or CEP, is primarily an event processing concept that deals with the task of processing multiple events with the goal of identifying the meaningful events within the event cloud. CEP employs techniques such as detection of complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership, and timing, and event-driven processes. CEP is to discover information contained in the events happening across all the layers in an organization and then analyze its impact from the macro level as "complex event" and then take subsequent action plan in real time.

And like all “buzz words” CEP is now commonly applied to many products. Is CEP a pure marketing buzz or something else? What is different about CEP form other middleware infrastructures of recent years (Grid, Data Grids, distributed caches, etc) is that CEP itself is an evolution of what IT professionals have been doing on Wall Street for almost 15 years. Custom code and rules engines are the fore runners to today’s CEP products. Today’s products are general-purpose tools that allow developers to take advantage of all it has to offer without having to know the intricacies happening inside the engine. Is this to say that a rules engine or custom code is not CEP? Not at all, they well address event processing in the scope of the problems they are designed to meet.

Events or transactions occur throughout the enterprise. These events may or may not be dependent on time, may or may not be dependent on sequence, and may come from one or many systems or applications within a business group or across business groups. So in order to consume the events regardless of source, understand the event, correlate it across other events with regards to time and source, or sequence in events, one need to leverage CEP as part of the architecture.

The CEP technology has been around for years and is leveraged by many firms. With the technology evolving over the years with a family of CEP engines available through vendors and open source, the choices of which engine(s) to leverage are wide. The key things to be concerned about as this technology is being leveraged is the consumption of these events in a way that is scalable and out of the box thus creating a distributed CEP Cloud Services.

What is a typical architecture of a CEP solution?

As with most architectures, the 50,000 foot view is similar to all other event driven systems. Event sources, processing of the events, event consumers and all the integration necessary to join (either tight or loose coupling) the systems together.

  • Event source: – This would be the actual messaging provider (GigaSpaces in our specific case)
  • CEP engine – This is the heart of the CEP. Matching engine is responsible for processing incoming event against users queries. Users queries may require a history of events to fulfill a given query. Users can express their queries in an SQL like semantics. The matching engine will trigger the appropriate listener when a query conditions has been met.
  • Event consumers– The specific logic components (e.g. external systems or other middleware components such as GigaSpaces) that are be triggered when a certain condition happens.

However as you dive closer into”CEP” architecture significant differences emerge. Joining event streams together in logically similar fashion as joining tables in a database greatly simplifies implementations while reducing latency and increasing throughput. The result is smaller and simpler code base therefore a faster delivery of function. However, if not careful these benefits can quickly erode by trying to do too much inside the CEP Engine where the complexity of the streaming rules affect performance and maintainability.

What are the differences between a complex event and regular event?

Quoting CEP provider Esper:

“Regular events normally represents a concrete state, a complex event is normally an aggregation of multiple events (not necessarily of the same type) that identify a meaningful event.”

What are the difference between standard queries and continues queries?

A standard query, associated with a database operates over the tables in the database and returns a result data set. The tables stand still and the query traverses the data set (tables in the database). They run once and exit.

Continuous Queries are always “on”. However the query itself is stationary and the data (events) streams through the query, when the conditions of the query are met a resulting event is generated into the event engine for further rule processing and/or output to subscribers to that event. Visualize the query as a 2 dimensional plane and the various data streams orthogonally passing through the query plane for evaluation. The resulting events emerging on the other side of the query plane are new events resulting from the query conditions evaluating as true.

What are the difference between CEP and other MOM such as JMS?

MOM and JMS are transports moving data across a network. CEP is a process that may uses these middleware tools as a mechanism to consume and publish event streams.

What does Integrasoft add to the current CEP engine?

CEP engines have been leveraged in many business applications as siloed solutions. With a great push into the Cloud infrastructure for High Performance Computing, many firms are finding the need to run across multiple CEP engines for Services regardless of the nature of the Cloud.

Integrasoft CEP Cloud Services product offers the infrastructure required to run CEP Cloud(s) to meet the business application demands and required services via CEP engines “networked” together running transparently for virtualized services. The solution offers holistic and intelligent CEP Services within the Cloud.

The CEP Cloud Services form “Clouds of CEP” to be leveraged as needed by applications within a heterogeneous environment that is typically found in any infrastructure. Integrasoft CEP Cloud Services allows for the processing of complex events, scalable with the cloud, and transparent to the applications (see diagram below).

clip_image002



As illustrated in the diagram below, the solution offers:

Transparency – The applications need not know about the underlying architecture.

Scalability – “Sub-Clouds of CEP Services” within the cloud if needed.

Holistic & Intelligent Services – Intelligent processing of all events against defined rules. Holistic view of what is going on in the cloud.

Inter-CEP Engine communication – CEP engines framework to establish the needed communication for total transparency, scalability, and manageability.

clip_image004

What is the value for GigaSpaces users?

Leveraging the CEP engines in a public or private cloud for services comes another degree of challenges. The business applications development and deployment are concerned about:

· Underlying middleware technology

· Underlying Database technology

· Underlying business event processing technology

These challenges impact development time and cost as well as time to market. With the integration of Integrasoft CEP Cloud Services and Gigaspaces XAP, the solution offers a complete services infrastructure solution to handle all business events with:

· Integrated solution for event processing

· Virtualized CEP Cloud(s) Services

· Distributed cache for HPC

· Underlying distribution messaging

· Underlying database persistence

· Easier deployment and management

The solution offers high performance CEP + Messaging + Distributed Cache which is typically need by HPC applications. The tight binding of these three functionalities offers a new class of applications that are high event rates, data intensive, complex relations and correlations of events.

Now, the business SLAs demands can be met with a linear performance scale of a Cloud, constant latency as event rates increase, where event sources and processing of events are physically close together to maximize efficiency and performance. In summary, the solution is geared to control costs and minimize operational risks which are inherent with Cloud infrastructure.


My take

Event Driven Architecture (EDA) is becoming more popular in recent years due to the demand for greater scalability. CEP provides a way to extend the use of EDA into the way we process and access our data. Having said that most of the existing CEP solutions relies on a centralized event coordinator and could become a scalability bottleneck as a result of that. What’s interesting in the approach taken by integrasoft  is that it brings the scalability of GigaSpaces to CEP through the integration of Esper and GigaSpaces. 

I would be very interested to learn about your specific requirements in that area and work collaboratively with Integresoft to make sure that the integrated solution can address your needs.  Feel free to contact me or Michael directly on that regard or simply post a comment on this post.


Read more...

Next Page »