Why browse when you can simply search?

Posted 23 April 2007 @ 5:06 am by Nati Shalom

I had interesting discussions in the past few weeks about the use of Object Graphs and how that approach maps into a space model. It reminds me of the discussions I use to have when I used Versant as an object database and object store in the late 90s.  Object Graphs assume a certain hierarchy in which the data is ordered. The model also assumes that the navigation path to that data takes that specific order.

One of the promises of the Object Graph model is that through navigation you are able to point directly to the data and therefore be very efficient in how fast you can retrieve it. The fact that it could be just a pointer makes the navigation task very simple and native to the programmer. For example — all you do is Customer.getAccounts().. and you will get a direct reference to the relevant account list. 

So the idea behind object graphs is that by ordering the data in a certain hierarchy you can speed up significantly the time it takes to retrieve large sets of data simply by pointing directly to it.

Another reason is usability, because browsing through the data model in an object-oriented fashion can be more intuitive. In this post I'll focus mostly on the first aspect - the performance. I'll address the usability in a separate post.

So how do Object Graphs relate to the space model?

The space stores tuples as objects, or entries, as they are called in JavaSpaces terminology.  A space maintains entries as objects which can contain graphs as attributes. The way a space handles the relationship between different entries is closer to a relational model than it is to an object-oriented model. In other words, you can view each entry-class as a table and each entry-instance as a row within that table. That means that two separate entries CAN'T maintain a direct Java reference to the same attribute. As with a relational model, it is possible to use logical references through object IDs and even create a hierarchical view out of the individual entries using a parent-child pattern.

The following link provides a reference to how you can build Object Graphs with GigaSpaces. Hibernate is also a good example of how you could create Object Graphs out of the relational model using O/R mapping techniques, and I can imagine you can build equivalent mapping tools over a space model to address such requirements.

Seems straightforward, right? Well, the underlying implementation can be complex, especially when we need to deal with large data structures. In such cases we need to provide the ability to get specific objects, smart indexing, explicit locking of related objects, cascading deletes, etc. Due to the relationship between the different objects this approach tends to be relatively costly on inserts and updates. It also tends to be limited in terms of concurrency, due to the complex locking structure.

While all these are technically doable, I ask: do we really need all that? What value do I get if I order my data in such a hierarchical model? If performance is the objective, isn't there a different way to achieve it?

What if I could retrieve that data just at the same speed or even faster through a search?

Google Desktop as a search mechanism for emails is a good analogy. Even though there are lots of technical differences on the implementation side, it provides a very good equivalent to the issue above.

Let me explain:

Before desktop search technologies such as Google Desktop, the way we arranged our emails and documents was in hierarchical folders. Some people use the sender name as the hierarchy model, some use date — but we all ordered our information in some sort of hierarchy, and when we looked for a specific item, we simply navigated to the appropriate folder and then browsed to the appropriate item. As long as the way we looked for that data-item was consistent, the model worked fine, but even then, if we ordered the data based on name and then wanted to look an item up by date, the model broke.

So the main reason we organized our emails in folders was to speed up the way we could find data items (archived emails in this specific case). The main reason we all went through that process was because the original search provided by Microsoft was too slow. Now, when things such as Google Desktop emerged, and search was as fast as navigation — if not faster — that changed the entire method in which I organized my data. I stopped placing items in folders. It doesn't make sense anymore. With this approach I don't have to organize the information in a certain hierarchical order I simply select the right keywords and search for the relevant item from the search results.

Now back to my original point:

A space, and an In-Memory-Data-Grid, provides the opportunity to run search in-memory, which can be as fast as navigation because the index in a space points directly to the reference of the object in-memory.  There are also open source projects such Compass,  led by a GigaSpaces colleague of mine, Shay Banon, that provides Google-like search for POJOs. Compass has the promise of having an effect similar to that of Google Desktop on the way we order our in-memory data. That could change the whole motivation for using object graphs in the first place.

I'm over-simplifying things here for the sake of the discussion, however, unless I'm missing something fundamental, I think that we should really ask ourselves the question of: why browse (or navigate) when you can simply search?

We need to ask that question before we start thinking about how we build complex distributed object graphs, how we deal with referential integrity, locking of hierarchical trees, managing lifecycles of our objects based on reference counts — all those "nice" things that need to be addressed as soon as you start dealing with graphs.

Anyway, those are just my initial thoughts and I hope that I'm making sense here - I would be very interested in hearing what others think about this topic.

Read more...

2 Responses to “Why browse when you can simply search?”

  1. Shay Hassidim Says:

    An important note here is would be the roots of constructing complex object graphs - If we will go back into the 90s of the previous century, where ODBMS/RDBMS used to store the business logic state, the expensive disk devices access time led the software engineers to break the application in-memory data with its OO model and pure object references into separate entities in the database. The ODBMS done this in implicit manner , but still , each object introduced into the system as a separate entity. With RDBMS and the different ORM mapping tools (Toplink , JDO , Hibernate , etc) this phase done in relatively simple work , but imposed serious scalability , latency and performance problems.

    With GigaSpaces and in-memory-data-grid technologies it becomes almost obsolete to deal with breaking object graphs into different entries when storing these into memory. The reason is simple - the shared data repository (IMDG) is an exact image of the application business logic state. In many cases the interaction between the business logic and the IMDG done without any remote calls – i.e. collocated deployment topology , so the effort to store the application business logic state is close to ZERO (no serialization involved).
    To allow the business logic to retrieve the data from the IMDG , it need only to query the graph root object. Later it can fetch it and access the relevant low level objects using regular object reference. You might need to locate the root object using its simple fields (native types) or its complex fields – both options are supported with GigaSpaces (effectively join). Locating objects can be done using the regular template matching , SQL Query or regular expression.

    We should remember that when taking the application object graph data and breaking it into different entities at the database/memory we would need to face with cascading delete issues , recursive data retrieval issues , deadlocks issues (when parent and child are locked ). In addition - when breaking the in-memory application business logic graph into separate entities , each entity should have unique identity. The cost of generating such identity and maintaining it would have impact on the performance. This makes the application business logic to be fully aware the actual data storage architecture used under the hood.

    So “Why browse when you can simply search?” – here are some important facts:
    - The Speed of searching data with GigaSpaces – GigaSpaces search mechanism can scan the index data in a rate of 500,000 scans per sec and return results in the speed of 200,000 objects per second in embedded mode (yes – 200,000 queries per second). The number above is test result done with Intel Woodcrest microprocessor.
    - Indexes management – GigaSpaces maintain the indexed data in different way than the way RDBMS or ODBMS maintaining these. GigaSpaces maintain pure in memory index data with pointers directly to the entry data - without any disk access as done with databases. Indexes are optimized for exact equality matching or bigger/less than queries.

    Best Regards,
    Shay

    —————————————————-
    Shay Hassidim
    Deputy CTO
    GigaSpaces Technologies
    Write Once. Scale Anywhere.

  2. insurance health alva florida Says:

    quote florida health insurance insurance quote health florida

Leave a Reply