MOnagals

Month

April 2012

3 posts

Part 4: The NoSQL hoopla … What is NonsenSQL about it?
A Request and an Offer to NoSQL Techies

In spite of my 3 blog posts on NoSQL (http://bit.ly/NoSQLt http://bit.ly/NoSQL2 http://bit.ly/NoSQL3), where I had clearly stated what my concerns were with the whole NoSQL movement and what I strongly feel are the problems with the modus operandi of the architects and designers of the NoSQL systems, many people are still misinterpreting my comments/suggestions. Let me elaborate more in this post on what I strongly believe needs to be done and what I am willing to do to help in this regard. 

I fully realize that there are many different types of NoSQL systems and that there are many differences between them with respect to the functionality they provide and the technologies that were invented/leveraged/implemented to realize that functionality. While not all the points I have made in my previous posts would necessarily apply to every one of the systems, every point I have made would apply to at least a reasonable subset of the systems. 

Some people are expecting me to provide detailed review/criticism of each of the NoSQL systems along different dimensions (replication, data model, locking, etc). They seem to have missed my points about the descriptions of the internals of these systems being vaguely documented, and the choice of technologies/algorithms not being well specified and justified. This doesn’t mean that such details aren’t available for any aspect of any of the systems. I am sure there are a few systems for which some amount of detail is documented somewhere for some aspects of the implementation/design of those systems.  

My major point is that the designers/architects/implementors (“the techies”) of the NoSQL systems have to more carefully document the design of their systems so that the above points are dealt with methodically. Then, we can have more meaningful discussions about the merits and demerits of each of those systems, and the correctness/appropriateness of the chosen approaches to solving specific technical issues. The NoSQL techies, as responsible citizens of the land of data management, owe that level of rigor to the community. It would also help them in achieving better clarity in their thinking and increase the likelihood of catching logical errors in their algorithms. The whole ecosystem would gain from this exercise. Hopefully, the NoSQL techies would realize that many of the algorithms invented for RDBMSs would be applicable to even their systems and they would learn how to do design their systems so that they are extensible to accommodate new requirements. As I have said before, many features of RDBMSs which were initially considered unnecessary in the NoSQL context are now creeping back in.

I am not expecting the NoSQL techies to necessarily write research papers which are subject to the rigorous refereeing processes of conferences like VLDB, ACM SIGMOD, IEEE ICDE, EDBT, etc, even though such things would be desirable in the long run. As I detailed in my Part 2 blog post (http://bit.ly/NoSQL2), I personally had to go through tremendous amount of evangelization of my ideas (the ARIES family of locking and recovery algorithms - http://bit.ly/RepHis http://bit.ly/ARIESi) before they became widely accepted and got adopted/adapted for implementation in many different types of systems (not just RDBMSs). I of course know that many other people working in the traditional DBMS community have also had to go through similar pains to get their ideas rationalized and accepted. I have put references to my story only because I know it very well and because, for one reason or the other, my experiences have been well documented in various ways (papers, presentations, interviews and videos). I sincerely hoped that the readers of my blog post would take the trouble to follow up on them to get a far better feel for what some people have to go through to make long lasting impact on the wider technical community. Impact that goes beyond money making, flashy marketing collateral, elevator pitches or industry watcher/analyst pronouncements.

Maybe I am being idealistic but I feel I should appeal to the NoSQL community about this with the hope that it gains traction! I do realize that in addition to the open source community that is a big part of the NoSQL movement, there are a number of startups and big Web 2.0 companies who have sizable internal development groups that are engaged in NoSQL work. For different reasons, this set of people might choose not to act like the traditional DBMS community in following the kinds of suggestions I am making in my blog posts to bring more order to the current chaotic situation. 

The onus is on the NoSQL techies to do the needed documentation of their work and rationalization of their design choices rather than people like me having to play with their systems or dig through any available open source code to figure out such technical details and rationale. As the references I have given in the Part 2 blog post make it clear, I chose to do such things in the past as part of the due diligence background work in relating my ARIES algorithms to what had been done before by others. It is worth pointing out that a typical researcher doesn’t take the trouble to do as much digging into real systems to compare with the prior art.  

So, here is my humble request to the NoSQL techies: For each of your systems, please send me or point me to detailed technical information on each of the important aspects of your system. This should be documentation in the form of papers or presentations, and not pointers to source code comments and such! If some significant aspects of a system aren’t documented reasonably, I am urging the appropriate people to produce such documentation. Of course, for legal reasons, you should NOT send me any confidential or proprietary information. 

Here is my offer in return for the above: Once I get hold of such documentation, I am willing to maintain a page for each significant NoSQL system where I will consolidate all the information on that system. Once I get hold of all that information, I will be able to do the comparisons between systems and make suggestions for improvements, etc. for each of the systems. I am planning a tutorial on NoSQL systems and it would be in the best interest of the techies of the different systems to get their systems featured in such a tutorial by providing accurate and complete information on their systems. 

I would like to hear the readers’ reactions to my humble request and my offer in return. 

Apr 4, 201210 notes
Part 3: The NoSQL hoopla … What is NonsenSQL about it?
The Myths about Transactions (ACID) and NoSQL

There has been widespread characterization of one of the major distinctions between NoSQL and traditional DBMSs by saying that the former don’t care for ACID semantics or that transactions aren’t needed. This is an oversimplification to say the least. As long as the NoSQL system supports incremental updates by concurrent set of users (as opposed to only single-threaded bulk or batch updates), even if multi-API-calls transactions are not supported, at least within the internals of such a system some notion of transaction is essential to retain a certain level of sanity of the internal design and keep things consistent. This is even more important if the system supports replication and/or the updating of multiple data structures within the system even in a single API call (e.g., if there are multiple access paths which have to be updated). Similar points apply to locking and recovery semantics and functionality. 

The above sorts of issues are real and were quite tricky to handle in Lotus Notes, which used very ad hoc ways of dealing with the associated complications, until log-based recovery and transaction support were added in R5 (http://bit.ly/LNotes). From Day 1 in 1989, Notes has supported replication and disconnected operations with the consequent issues of potentially conflicting parallel updates having to be dealt with. Even RDBMSs were late in dealing with that kind of functionality. 

Even if at the individual object level, high concurrency isn’t important given the nature of a NoSQL application, it might still be important from the viewpoint of the internal data structures of the NoSQL system to support high concurrency or fine granularity locking/latching (e.g., for dealing with concurrent accesses to the space management related data structures - see http://bit.ly/CMSpMg).

Vague discussions about NoSQL systems and ACID semantics make many people think that RDBMSs enforce strong ACID semantics all the time. This is completely wrong if by that people imply serializability as the correctness property for handling concurrent execution of transactions. Even from the very beginning, RDBMSs (System R and products that came from it) have supported different degrees of isolation, in some cases even the option of of being able to read uncommitted data, and different granularities of locking (http://bit.ly/CMQuCC). Even with respect to durability, in-memory RDBMSs like TimeTen and SolidDB which came much later, allowed soft commits, etc., trading off durability guarantees for improved performance. 

In my last 2 posts on NoSQL (http://bit.ly/NoSQLt http://bit.ly/NoSQL2), I gave a lot of information on my background to make it clear to the readers that this whole space of data management is a tricky business. The devil is in the details and it isn’t for the faint hearted :-) I wanted to make it clear that I don’t believe in quick and dirty approaches to handling intrinsically complicated issues and that I am not somebody who takes frequent elevator rides with VCs :-) At the same time, I am not an ivory tower researcher either! When I hear many presentations on “my kind of topics” at various conferences and meetings like the Hadoop User Group (HUG), I have a tough time making sense of what is going on given the high level nature of what is being presented with no serious attempts being made to compare what is proposed with what has been done before and about which more is known. 

Of course, NoSQL systems aren’t the only context in which such things have happened in the past. A great number of people have talked about optimistic concurrency control and recovery without much of the details really being worked out (see my discussions on this topic in http://bit.ly/CMOpCC). Even now some of the NewSQL people make some tall claims about how traditional recovery isn’t needed and that they can get away without logging while still supporting SQL, etc. One has to quiz them quite a bit to discover that they do in fact do some bookkeeping that they choose not to describe as logging and/or that they don’t support statement-level atomicity even though they support SQL and SQL requires it! 

For some people, it might be very tempting to think that the NoSQL applications are so much different from traditional database applications that simple things are sufficient (“good enough” being the often used phrase to describe such things) and that overnight mastery of the relevant material is possible. Even in the Web 2.0 space, if the application programmers are not to go crazy, more of the burden has to be taken up by the designers of the NoSQL systems. A case in point is how the Facebook messaging system designers decided eventual consistency semantics is too painful to deal with. To begin with, if the NoSQL systems have vague semantics of what they support and subsequently, as they evolve, if such things keep changing, users will be in big trouble! Also, with no standards in place for these systems, if users want to change systems for any number of reasons, applications might require significant rewriting to keep end user semantics consistent over time. 

Apr 1, 2012
Next page →
2012 2013
  • January
  • February 1
  • March
  • April
  • May
  • June
  • July
  • August
  • September
  • October
  • November
  • December
2012 2013
  • January
  • February
  • March 2
  • April 3
  • May
  • June
  • July
  • August
  • September
  • October
  • November
  • December