Scale-out vs Scale-up
Posted 1 September 2010 @ 7:32 am by Nati ShalomIn my previous post Concurrency 101 I touched on some of the key terms that often comes up when dealing with multi-core concurrency.
In this post I'll cover the difference between multi-core concurrency that is often referred to as Scale-Up and distributed computing that is often referred to as Scale-Out model.
The Difference Between Scale-Up and Scale-Out
One of the common ways to best utilize multi-core architecture in a context of a single application is through concurrent programming. Concurrent programming on multi-core machines (scale-up) is often done through multi-threading and in-process message passing also known as the Actor model.Distributed programming does something similar by distributing jobs across machines over the network. There are different patterns associated with this model such as Master/Worker, Tuple Spaces, BlackBoard, and MapReduce. This type of pattern is often referred to as scale-out (distributed).
Conceptually, the two models are almost identical as in both cases we break a sequential piece of logic into smaller pieces that can be executed in parallel. Practically, however, the two models are fairly different from an implementation and performance perspective. The root of the difference is the existence (or lack) of a shared address space. In a multi-threaded scenario you can assume the existence of a shared address space, and therefore data sharing and message passing can be done simply by passing a reference. In distributed computing, the lack of a shared address space makes this type of operation significantly more complex. Once you cross the boundaries of a single process you need to deal with partial failure and consistency. Also, the fact that you can’t simply pass an object by reference makes the process of sharing, passing or updating data significantly more costly (compared with in-process reference passing), as you have to deal with passing of copies of the data which involves additional network and serialization and de-serialization overhead.
Choosing Between Scale-Up and Scale-Out
The most obvious reason for choosing between the scale-up and scale-out approaches is scalability/performance. Scale-out allows you to combine the power of multiple machines into a virtual single machine with the combined power of all of them together. So in principle, you are not limited to the capacity of a single unit. In a scale-up scenario, however, you have a hard limit -– the scale of the hardware on which you are currently running. Clearly, then, one factor in choosing between scaling out or up is whether or not you have enough resources within a single machine to meet your scalability requirements.
Reasons for Choosing Scale-Out Even If a Single Machine Meets Your Scaling/Performance Requirements
Today, with the availability of large multi-core and large memory systems, there are more cases where you might have a single machine that can cover your scalability and performance goals. And yet, there are several other factors to consider when choosing between the two options:
1. Continuous Availability/Redundancy: You should assume that failure is inevitable, and therefore having one big system is going to lead to a single point of failure. In addition, the recovery process is going to be fairly long which could lead to a extended down-time.
2. Cost/Performance Flexibility: As hardware costs and capacity tend to vary quickly over time, you want to have the flexibility to choose the optimal configuration setup at any given time or opportunity to optimize cost/performance. If your system is designed for scale-up only, then you are pretty much locked into a certain minimum price driven by the hardware that you are using. This could be even more relevant if you are an ISV or SaaS provider, where the cost margin of your application is critical to your business. In a competitive situation, the lack of flexibility could actually kill your business.
3. Continuous Upgrades: Building an application as one one big unit is going to make it harder or even impossible to add or change pieces of code individually, without bringing the entire system down. In these cases it is probably better to decouple your application into concrete sets of services that can be maintained independently.
4. Geographical Distribution: There are cases where an application needs to be spread across data centers or geographical location to handle disaster recovery scenarios or to reduce geographical latency. In these cases you are forced to distribute your application and the option of putting it in a single box doesn’t exist.
Can We Really Choose Between Scale-Up and Scale-Out?
Choosing between scale-out/up based on the criteria that I outlined above sound pretty straightforward, right? If our machine is not big enough we need to couple a few machines together to get what we're looking for, and we're done. The thing is, that with the speed in which network, CPU power and memory advance, the answer to the question of what we require at a given time could be very different than the answer a month later.
To make things even more complex, the gain between scale-up and scale-out is not linear. In other words, when we switch between scale-up and scale-out we're going to see a significant drop in what a single unit can do, as all of a sudden we have to deal with network overhead, transactions, and replication into operations that were previously done just by passing object references. In addition,we will probably be forced to rewrite our entire application, as the programming model is going to shift quite dramatically between the two models. All this makes it fairly difficult to answer the question of which model is best for us.
Beyond a few obvious cases, choosing between the two options is fairly hard, and maybe even almost impossible.
Which brings me to the next point: What if the process of moving between scale-up and scale-out were seamless -- not involving any changes to our code?
I often use storage as an example of this. In storage, when we switch between a single local disk to a distributed storage system, we don’t need to rewrite our application. Why can’t we make the same seamless transition for other layers of our application?
Designing for Seamless Scale-Up/Scale-Out
To get to a point of seamless transition between the two models, there are several design principles that are common to both the scale-out and scale-up approaches.
Parallelize Your Application
1. Decouple: Design your application as a decoupled set of services. “All problems in computer science can be solved by another level of indirection" is a famous quote attributed to Butler Lampson. In this specific context: if your code sets have loose ties to one another, the code is easier to move, and you can add more resources when needed without breaking those ties. In our case, designing an application from a set of services that doesn’t assume the locality of other services is used to enable us to handle a scale-up scenario by routing requests to the most available instance.
2. Partition: To parallelize an application, it is often not enough to spawn multiple threads, because at some point they are going to hit a shared contention. To parallelize a stateful application we need to find a way to partition our application and data model so that our parallel units share-nothing with one another.
Enabling Seamless Transitions Between Remote and Local Services
First, I'd like to clarify that the pattern I outlined in this section is intended to enable seamless transition between distributed and local service. It is not intended to make the performance overhead between the two models go away.
The core principle is to decouple our services from things that assume locality of either services or data. Thus, we can switch between local and remote services without breaking the ties between them. The decoupling should happen in the following areas:
1. Decouple the communication: When a service invokes an operation on another service we can determine whether that other service is local or remote. The communication layer can be smart enough to go through more efficient communication if the service happens to be local or go through the network if the service is remote. The important thing is that our application code is not going to be changed as a result.
2. Decouple the data access: Similarly, we need to abstract our data access to our data service. A simple abstraction would be a distributed hash table, where we could use the same code to point to a local in-memory hash-table or to a distributed version of that update. A more sophisticated version would be to point to an SQL data store where we could have the same SQL interface to point to an in-memory data store or to a distributed data-store.
Packaging Our Services for Best Performance and Scalability
Having an abstraction layer for our services and data brings us to the point where we could use the same code whether our data happens to be local or distributed. Through decoupling, the decision about where our services should live becomes more of a deployment question, and can be changed over time without changing our code.
In the two extreme scenarios, this means that we could use the same code to do only scale-up by having all the data and services collocated, or scale-out by distributing them over the network.
In most cases, it wouldn't make sense to go to either of the extreme scenarios, but rather to combine the two. The question then becomes at what point should we package our services to run locally and at what point should we start to distribute them to achieve the scale-out model.
To illustrate, let’s consider a simple order processing scenario where we need to go through the following steps for the transaction flow:
1. Send the transaction
2. Validate and enrich the transaction data
3. Execute it
4. Propagate the result
Each transaction process belongs to a specific user. Transactions of two separate users are assumed to share nothing between them (beyond reference data which is a different topic).
In this case, the right way to assemble the application in order to achieve the optimal scale-out and scale-up ratio would be to have all the services that are needed for steps 1-4 collocated, and therefore set up for scale-up. We would scale-out simply by adding more of these units and splitting both the data and transactions between them based on user IDs. We often refer to this unit-of-scale as a processing unit.
To sum up, choosing the optimal packaging requires:
1. Packaging our services into bundles based on their runtime dependencies to reduce network chattiness and number of moving parts.
2. Scaling-out by spreading our application bundles across the set of available machines.
3. Scaling-up by running multiple threads in each bundle.
The entire pattern outlined in this post is also referred to as Space Based Architecture. A code example illustrating this model is available here.
Final Words
Today, with the availability of large multi-core machines at significantly lower price, the question of scale-up vs. scale-out becomes more common than in earlier years.
There are more cases in which we could now package our application in a single box to meet our performance and scalability goals.
A good analogy that I have found useful to understanding where the industry is going with this trend is to compare disk drives with storage virtualization. Disk drives are a good analogy to the scale-up approach, and storage virtualization is a good analogy to the scale-out approach. Similar to the advance in multi-core technology today, disk capacity has increased significantly in recent years. Today, we have xTB data capacity on a single disk.
PC hard disk capacity (in GB).The plot is logarithmic,
so the fitted line corresponds to exponential growth
Interestingly enough, the increase in capacity of local disks didn’t replaced the demand for storage, quite the contrary. A possible explanation is that while single-disk capacity doubled every year, the demand for more data grew at a much higher rate as indicated in the following IDC report:
Market research firm IDC projects a 61.7% compound annual growth rate (CAGR) for unstructured data in traditional data centers from 2008 to 2012 vs. a CAGR of 21.8% for transactional data.
Another explanation to that is that storage provides functions such as redundancy, flexibility and sharing/collaboration. Properties that a single disk drive cannot address regardless of its capacity.
The advances with the new multi-core machines will follow similar trends, as there is often a direct correlation between the advance in the capacity of data and the demand for more compute power to manage it, as indicated here:
The current rate of increase in hard drive capacity is roughly similar to the rate of increase in transistor count.
The increased hardware capacity will enable us to manage more data in a shorter amount of time. In addition, the demand for more reliability through redundancy, as well as the need for better utilization through the sharing of resources driven by SaaS/Cloud environments, will force us even more than before towards scale-out and distributed architecture.
So, in essence, what we can expect to see is an evolution where the predominant architecture will be scale-out, but the resources in that architecture will get bigger and bigger, thus making it simpler to manage more data without increasing the complexity of managing it. To maximize the utilization of these bigger resources, we will have to combine a scale-up approach as well.
Which brings me to my final point -– we can’t think of scale-out and scale-up as two distinct approaches that contradict one another, but rather must view them as two complementing paradigms.
The challenge is to make the combination of scale-up/out native to the way we develop and build applications. The Space Based Architecture pattern that I outlined here should serve as an example on how to achieve this goal.
References
-
Parallel processing patterns:
-
Space Based Architecture (GigaSpaces Code example)
Read more...
GigasPaces Cloud Enabled Application Platform for Rackspace
Posted 27 August 2010 @ 7:27 am by Nati Shalom
![]()
View more presentations or Upload your own.
Read more...
Concurrency 101
Posted 18 August 2010 @ 10:25 am by Nati ShalomLast week Guy Korland (our VP of R&D) and I met with a prospect where we discussed patterns that would enable them to take advantage of new multicore hardware.
Early in the discussion it was was apparent that there was a terminology gap between multi-core concurrency and its close relative from the distributed programming world.
After that meeting, Guy put together the following vocabulary. I thought it was worth sharing for everyone that deals with multicore concurrency and distributed computing. Enjoy..
Concurrency 101:
- Serializability – When a concurrent algorithm can promise that the concurrent events can be ordered in a "serial" correct order when.
- Linearizablity – When a concurrent algorithm can promise that the concurrent events can be ordered in a "serial" correct order when the guiding line is that each method invocation can be consider to happen at a single point of time during the method invocation.
- Lock-Free – An algorithm that promises that at least a single thread can progress all the time (meaning the system can progress) while other threads might starve.
- Wait-Free - An Algorithm that promises that at all the threads can progress all the time (meaning no starvation).
- Obstruction-Free - An Algorithm that promises that any thread can progress if executed in isolation (meaning no progress is promised)
- Consensus – An algorithm that help different process/thread to get into a single decision (proven to be impossible in distributed system) and can be done on parallel only if you use CompareAndSet.
- Amdahl’s Law – Amdahl's law is a model for the relationship between the expected speedup of parallelized implementations of an algorithm relative to the serial algorithm.
- Moore's Law – a popular statement that declares that CPU speed doubled approximately every two years. (It hasn't actually proven to be true, although with careful attention to concurrency it can still seem to be true.)
- NUMA - Non-Uniform Memory Access or Non-Uniform Memory Architecture is a computer memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.
In the next post i’ll write on Concurrency (Scale-up) vs Distributed Computing (Scale-out)…
Read more...
YeSQL (part II) - Putting NoSQL, SQL and Document Model Together
Posted 25 July 2010 @ 7:16 am by Nati Shalom
In my previous post (YeSQL: An Overview of the Various Query Semantics in the Post Only-SQL World ) I introduced the common query semantics in the post Only-SQL world and made an argument that the right approach is to decouple the query semantics from the underlying NoSQL implementation. This would allow us to combine SQL semantics with NoSQL backend to achieve the best of both worlds -- standard query and scalability.
In this post I wanted to illustrate this idea through some code example using GigaSpaces as the underlying implementation of the concept. The example is based on the the same code examples from another earlier post, WTF is Elastic Data Grid? (By Example) where I covered some of the simple models for writing and reading objects to a distributed data grid. Toward the end I will reference other examples such as Datanucleus and Hbase, as well as cover patterns for supporting NoSQL semantics such as document model with existing RDBMS such as MySQL.
SQL/NoSQL Code Example (Using GigaSpaces)
The Data Model
@SpaceClass
public class Data implements Serializable {
private Long id;
private String data;
private Map info;
public Data(long id, String data , Map info ) {
this.id = id;
this.data = data;
this.info = info;
}
@SpaceRouting
@SpaceId
public Long getId() {
return id;
}
// getter/setter for id, data, info attributes omitted
}
As with the previous example, we use @SpaceRouting
annotation to set the routing index. This index determines which target partition this instance should belong to. @SpaceId
determines the unique identifier (Key) for this instance.
Adding Document Model Semantics
Most of the common examples for document-based APIs use a JSON data model as the document. A document in JSON world would look something like this:
Source: Wikipedia
From a NoSQL implementation perspective, a document often translates to a Map of Maps where each attribute is mapped to a key,value representation and nested value to a key whose value is a Map and so forth.
Document Model in GigaSpaces
In the current version of GigaSpaces a document is basically a Map attribute whose values are indexed. That gives the flexibility to add,remove attribute on an existing object without changing the object schema as with any schemaless API. A nested object would be mapped to a key whose value is a complex object that is stored just like any regular POJO.
In this example I added an info attribute that is basically a Map of Key/Values.
private Map info;
One of the main benefit of this approach is the combination of a strongly typed POJO/Table model with the flexibility of the document model. In other words, I could match objects by type, and each object can have a “fixed” structure that must conform to certain type, and a variable part that can vary on a per instance basis. In our specific example, when I look for a Data object I'm guaranteed to get the Data.id and Data.data, but Data.info could include a variable list of attributes that can vary on a per instance basis. I can still “pin” some attributes by forcing an indexing on the relevant keys as illustrated below:
// this defines several indexes on the same info property
@SpaceIndexes( { @SpaceIndex(path = "info.address", type = SpaceIndexType.BASIC),
@SpaceIndex(path = "info.socialSecurity", type = SpaceIndexType.BASIC)
})
public Map getInfo() {
return info;
}
Note: As of GigaSpaces 8.0 release the document model semantics would be extended to support full hierarchy of Maps of Maps and dynamic indexes.
Combining SQL Query with (GigaSpaces) NoSQL Data Store
Now that we’ve gone through the API example let’s see how we could insert this data into a NoSQL data store (GigaSpaces in this specific case) and query it through SQL.
Inserting the Data Object
for (long i=0;i<1000;i++)
{
gigaSpace.write(new Data(i,"message" + i, createInfo(i)));
}
The createInfo() generates a new Map and fill it with values as can be seen below:
public static Map createInfo(long i)
{
Map info = new HashMap();
info.put("address", i + " Broadway");
info.put("socialSecurity", 1232287642L + i);
info.put("salary", 10000 + i);
return info;
}
Querying the NoSQL document data using SQL
There are basically two models for querying data using SQL in GigaSpaces. (1) Adding SQL-like queries using the GigaSpaces SQLQuery API (2) using a fully standard SQL jdbc driver.
Querying the data using a SQL-like model
Data[] d = gigaSpace.readMultiple(
new SQLQuery(Data.class, "info.salary = 10000"),
Integer.MAX_VALUE);
The gigaspaces.readMultiple(..) operation is equivalent to a “select” statement. It takes the class (the equivalent of the from table name clause in SQL), and query clause “info.salary < 11000 and info.salary >= 10000” (the equivalent of the where clause in SQL) .
As we can see the syntax of the SQL query borrows the syntax of an object-oriented model where I can reference to the associated attributes within the document attribute just like any nested attribute. In our example info.salary would point to the info Map and pull the attribute who’s key=”salary”.
This API is useful if you're already working in POJO as your domain model, as it works with objects natively and therefore can bypass the need for any O/R Mapping.
Querying the data using standard JDBC
In this example we will illustrate how we can plug different query engines to the same data instance. In this case we will use a standard JDBC connection to connect to the data in the following way:
Connection conn;
Class.forName("com.j_spaces.jdbc.driver.GDriver").newInstance();
String url = "jdbc:gigaspaces:url:jini://*/*/myElasticDataGrid";
conn = DriverManager.getConnection(url);
Statement st = conn.createStatement();
String query = "SELECT * FROM com.gigaspaces.examples.Data";
ResultSet rs = st.executeQuery(query);
// Iterate through the result set
int i = 0;
while (rs.next()) {
System.out.println("Data [" + (i++) +"] "+ rs.getString("data") );
}
The first line sets up the GigaSpaces JDBC driver. The GigaSpaces JDBC driver is responsible for mapping the SQL query language into a the underlying GigaSpaces methods. That means that from a GigaSpaces data store perspective JDBC calls looks pretty much the same as any other call. The URL uses the following pattern <gigaspaces jdbc prefix>:<gigaspaces space url>, where the gigaspaces-jdbc-prefix is always set to jdbc:gigaspaces. The space URL points to the relevant data grid cluster.
Note that one of the interesting concepts that comes with this is that you don’t need to point to a specific host as you would in most of the today's databases, but rather we use “*” as the host name, which initiates a network discovery using multicast to find the relevant instances of the cluster.
The rest of the code looks just like any other SQL call. We use fairly basic mapping where every class is mapped to a table and and an object attribute is mapped to a column. Note that unlike complex O/R mapping we leverage the fact that a space can store objects in their native format. That means that we don’t need to break nested objects into different tables but instead we store them as single embedded java object where the relationships are kept consistent with their original java representation.
Since the standard SQL doesn’t support nested object queries our current version of the JDBC provides access to the top level attributes. Any nested object is treated as a POJO and is matched according to the POJO based template matching semantics. Currently, this limits the type of queries that one can perform on nested object compared with the SQL-like API because we can’t perform matching on individual fields within the document or any nested object. As the SQLQuery and JDBC layer use the same underlying query engine, this limitation is only a semantic limitation and not a technical limitation, and will be resolved in the next release of GigaSpaces.
What About Performance?
Alex Popescu, co-founder and CTO of InfoQ.com, made a comment in his post NoSQL Databases Should Support SQL Queries questioning the performance overhead associated with adding another layer of indirection, as I pointed out in my original post. He rightfully quoted Jeff Kesselman:
The two software problems that can never be solved by adding another layer of indirection are that of providing adequate performance or minimal resource usage.
Indeed, this is one of the main challenge in this entire discussion. It is relatively simple to add another level of indirection but it's almost impossible to make it perform well.
The key to addressing the performance challenge relies on the implementation of the underlying data store and how well it is suited to support the functionality required by the higher level abstraction. In our specific NoSQL discussion the performance of the SQL abstraction would be greatly influenced by the ability of the underlying NoSQL data store to support complex queries at the core level. In this case, adding a different set of query semantics would be a matter of simple syntax mapping which should yield negligible overhead and in some cases could turn out to be more efficient, as we could support algorithms that are not simple to implement at a lower level. A good example is found in Kevin Weil's presentation Hadoop, Pig, and Twitter (slides 17-18). Kevin provides an example of how a simple Pig query could map to a fairly complex hadoop task in Java. So even if a lower level API could be more efficient at a micro level, it might turn out to be less efficient at the macro level as a result of the associated complexity.
In our example the abstraction provided by putting Datanucleus on top of Google BigTable is probably going to yield fairly high overhead when it comes to complex queries, because most of the query semantics are implemented at the mapping layer. With GigaSpaces, we chose to support all the query semantics at the core layer and make the SQL abstraction a thin semantics mapping layer. A recent benchmark test showed only 2% difference between JDBC queries and native queries.
GigaSpaces Native Query engine makes the overhead
of the SQL abstraction negligible
Other SQL/NoSQL References
Datanucleus provides a JPA/JDO mapping layer ontop of various datastore implementations.
As such, it well positioned to provide a common SQL abstraction layer. It is also well advanced in the area of becoming a common mapping layer for multiple NoSQL datastores. Currently it supports HBase, BigTable, Amazon S3. MongoDB and Cassandra are currently works in progress.
As there is still a big difference between the various datasource implementations (each supports only a subset of query semantics), each datasource plugin covers a subset of Datanucleus features. There are various documented limitations per datastore. You can find the list of supported features in the relevant datastore plugin documentation (a reference to googlestorage is provided here ).
You can also find a useful code example for using the Datanuclues JPA plugin for Hadoop/Hbase in the following post Apache Hadoop HBase plays nice with JPA by Matthias Wessendorf.
RDBMS Support for NoSQL Semantics
As I was working on this write-up, which centers around supporting SQL on top of a NoSQL datastore, I came across an interesting post: Schema-Free MySQL vs NoSQL by Ilya Grigorik, CTO/Founder at PostRank.
In his post, Ilya provides an interesting pattern that shows how you can store a schemaless document model on a MySQL datastore.
Instead of defining columns on a table, each attribute has its own table (new tables are created on the fly), which means that we can add and remove attributes at will. In turn, performing a select simply means joining all of the tables on that individual key
Ilya ends with a comment that I thought was in line with the my thoughts outlined in this post:
..there is absolutely no reason why we can’t have many of the benefits of “NoSQL” within MySQL itself.
Final Words
In this post I've tried to illustrate, mainly through the GigaSpaces example, how the SQL and NoSQL models can be brought together to achieve the best of both worlds, i.e., the scalability of the NoSQL model and the rich and standard query semantics of SQL. At the same time, I believe that additional semantics that are not currently supported by SQL such as the document model and object relationship can also be brought together as an extension to current SQL semantics as illustrated in this post.
Having said that, there will be a time where a Object or document-centric language are more intuitive or simpler to use then SQL. The decoupling of the datastore and the query language should enable us to write objects in an Object/Document-centric model and still query it using a SQL engine and vice versa. This will give us more flexibility to choose the language that best fits the context in which it is being used, and mix and match among them just as we do today in any web application where we mix together different languages such as HTML, CSS, JavaScript and other languages.
Finally, I believe that large part of the discussion in the NoSQL community took a wrong turn by putting too much emphasize on the query language rather then the scalability patterns which IMO should remain the only motivation for switching between one datastore to the other.
References
- YeSQL: An Overview of the Various Query Semantics in the Post Only-SQL World
- WTF is Elastic Data Grid? (By Example)
- Apache Hadoop HBase plays nice with JPA by Matthias Wessendorf
- Hadoop, Pig, and Twitter presentation by Kevin Weil
- Schema-Free MySQL vs NoSQL by Ilya Grigorik
- NoSQL databases Should Support SQL Queries by Alex Popescu
Read more...
YeSQL: An Overview of the Various Query Semantics in the Post Only-SQL World
Posted 15 July 2010 @ 7:42 am by Nati Shalom
The NoSQL movement faults the SQL query language as the source of many of the scalability issues that we face today with traditional database approach.
I think that the main reason so many people have come to see SQL as the source of all evil is the fact that, traditionally, the query language was burned into the database implementation. So by saying NoSQL you basically say "No" to the traditional non-scalable RDBMS implementations.
This view has brought on a flood of alternative query languages, each aiming to solve a different aspect that is missing in the traditional SQL query approach, such as a document model, or that provides a simpler approach, such as Key/Value query.

Most of the people I speak with seem fairly confused on this subject, and tend to use query semantics and architecture interchangeably. So I thought that a good start would be to provide a quick overview of what each query term stands for in the context of the NoSQL world. Then, I'll try to break some common misconceptions -- which led me to come up with the YeSQL term.
Common Query Semantics in the Post Only-SQL world
The following are some of the common query semantics in the NoSQL world. For those that are interested in code examples i’ve linked each category with the relevant GigaSpaces reference API.
- Key/Value query: Key/Value query, as the name suggests, is probably the most basic form of query. Each data item is associated with a unique identifier (key). In the NoSQL world, memcache is one of the most common implementations of such an interface. A common pattern to perform complex queries with memcache is to defer them to an underlying database which is used as a search engine. The result of these queries is a key or set of keys that is then used to perform subsequent fetching of the values through the memcache data store. The reason they gained new momentum in the post-SQL world is because they lend themselves fairly natively to the concept of partitioning and distribution, which is a key piece in making a data store scalable. In other words, people were willing to trade the rich query functionality provided by most traditional RDBMS for scalability with only basic query support if that was their only choice.
- Document-based query: The roots of this model are in the search engine world, where it is very common to store different types of documents, even if each one represents a completely different object. In the NoSQL world, a document is not the typical Word or PowerPoint that you would see in search engine, but rather objects in the form of Jason or XML, or binary objects associated with a set of key/values, as in the case of Cassandra. In SQL terms, a document can be seen as a blob that is associated with a set of keys, each indexed independently and maintaining a reference to this blob. Each blob can be a different type (tables), each blob can have different set of associated indexes (keys). Matching is done through the associated indexes. The result-set often includes multiple types, each containing a different set of data. Because the indexes and the blob don’t need to conform to a strict structure of rows and tables, it is referred to as “schemaless” -- i.e., it can have different versions of the same type, and add fields to new types without having to modify any table or update older version of the data. Examples that support the document model are CouchDB and MongoDB.
- Template query: Template queries were common in JavaSpaces and even in later versions of Hibernate. With template-based matching, you can fetch an object based on class type or inheritance hierarchy, as well as values of the attributes of that object. In more object-oriented versions of template matching you can also perform matching based on specific items within a graph attribute. GigaSpaces is one of the better-known implementations that support the JavaSpaces template query model.
- Map/Reduce: Map/Reduce is often used to perform aggregated queries on a distributed data store. A simple scenario would be Max, or Sum. In such scenario the query request need to be executed independently in each partition (Map) and then aggregated back to the client (Reduce). An implicit Map/Reduce can take a certain query request and spread the execution of that query implicitly. The client gets the aggregated query as if it was a single query. The explicit model allows you to execute code in free-form, where you can control the mapping model -- which call goes to which data, the code to run in each node (aka tasks), and the results. In a typical Hadoop implementation, Map/Reduce is often done through the explicit model. Frameworks like Hive and Pig provide an abstraction model that can handle that process implicitly.
- SQL query: If you think about it, SQL is yet another form of dynamic language that was specifically designed for complex data management. With SQL, data is often ordered in tables and rows. Some of the query semantics in SQL, such as Joins, distributed transactions, and others are known to be anti-pattern for scalability. It is mostly this aspect of SQL semantics that associates SQL with scalability limitation. Examples of NoSQL implementations that support SQL are Google Bigtable using JPA, Hive/Hadoop, MongoDB and GigaSpaces. I will discuss in further details below what that actually means.
YeSQL -- There's Nothing Wrong with SQL!
Now that we've covered some of the concepts behind query formats, it becomes more apparent that there is nothing really wrong with SQL. Like many languages, SQL gives you a fairly long rope with which to hang yourself if you choose to, but that is true of almost any language. If you design your data model to fit into a distributed model, you may find that SQL can be a fairly useful format to manage your data. A good example is Hive/Pig/Hbase and Google JPA /Bigtable. In both cases the underlying data store is based on a scalable Key/Value store, but the front-end query language happens to be SQL-based. MongoDB aims toward a similar goal with the main difference that it provides SQL-like support and doesn’t fully comply with any of the existing standards.
Its About the Architecture, Stupid!
NoSQL implementations such as Hive/HBase as well as JPA/Bigtable can be a good example of how next-generation databases can support both linear scaling and a SQL API.
The key is the decoupling of the query semantics from the underlying data-store as illustrated in the diagram below:
Supporting a SQL API on top of a NoSQL data store in Google Bigtable
Convergence is Underway
Last week I spent some time at the Hadoop summit. Hadoop created a fairly generic substrate that led to an innovative ecosystem behind it. There are already many new frameworks today that provide different levels of abstraction to the way Hadoop manages data in both query and processing, such as Hive, Cascading, and Pig. Many of them provide tools that the original creator of Hadoop never even thought of.
Which brings me to the point that we can apply the same decoupling pattern that I mentioned above to support a document model in connection with SQL. In other words, I believe that going forward, most of the leading databases will support all of the semantics listed above, and we won't have to choose a database implementation just because it support a certain query language.
We've already seen a similar trend with dynamic languages. In the past, a language had to come with a full stack of tools, compiler, libraries, and development tools behind it, making the selection of a particular language quite strategic. Today, a JVM in Java or a CLR in .Net provides a common substrate that can support a large variety of dynamic languages on top of the same JVM runtime. Good examples are Groovy and Java or Jruby.
Final Words
As I pointed out throughout this post, SQL is actually a fairly good query language and will continue to serve a major role in the post only-SQL world. However, the concept of one size fits all doesn't hold up. The data management world is going to be built from a variety of tools and data management languages, each serving a particular purpose. Ideally, we should be able to access any data using any of several query languages, regardless of how it was stored. For example, I should be able store a Jason object using a document model and then, at any time, query that Jason object using SQL query semantics or a simple Key/Value API.
References
Read more...








