Follow up to my #EDBT2013 keynote on #NoSQL and #NonsenSQL (http://bit.ly/NoSQLp) - An In-Depth Look at Modern Database Systems slides (revised from my tutorial at VLDB2013) at http://bit.ly/CMnMDS #NewSQL #DBMS #RDBMS #MongoDB #DB2 #SQLServer #Netezza #Hana
A preliminary bibliography at http://bit.ly/MDSbib It will evolve as time goes on!
In spite of my 3 blog posts on NoSQL (http://bit.ly/NoSQLt http://bit.ly/NoSQL2 http://bit.ly/NoSQL3), where I had clearly stated what my concerns were with the whole NoSQL movement and what I strongly feel are the problems with the modus operandi of the architects and designers of the NoSQL systems, many people are still misinterpreting my comments/suggestions. Let me elaborate more in this post on what I strongly believe needs to be done and what I am willing to do to help in this regard.
I fully realize that there are many different types of NoSQL systems and that there are many differences between them with respect to the functionality they provide and the technologies that were invented/leveraged/implemented to realize that functionality. While not all the points I have made in my previous posts would necessarily apply to every one of the systems, every point I have made would apply to at least a reasonable subset of the systems.
Some people are expecting me to provide detailed review/criticism of each of the NoSQL systems along different dimensions (replication, data model, locking, etc). They seem to have missed my points about the descriptions of the internals of these systems being vaguely documented, and the choice of technologies/algorithms not being well specified and justified. This doesn’t mean that such details aren’t available for any aspect of any of the systems. I am sure there are a few systems for which some amount of detail is documented somewhere for some aspects of the implementation/design of those systems.
My major point is that the designers/architects/implementors (“the techies”) of the NoSQL systems have to more carefully document the design of their systems so that the above points are dealt with methodically. Then, we can have more meaningful discussions about the merits and demerits of each of those systems, and the correctness/appropriateness of the chosen approaches to solving specific technical issues. The NoSQL techies, as responsible citizens of the land of data management, owe that level of rigor to the community. It would also help them in achieving better clarity in their thinking and increase the likelihood of catching logical errors in their algorithms. The whole ecosystem would gain from this exercise. Hopefully, the NoSQL techies would realize that many of the algorithms invented for RDBMSs would be applicable to even their systems and they would learn how to do design their systems so that they are extensible to accommodate new requirements. As I have said before, many features of RDBMSs which were initially considered unnecessary in the NoSQL context are now creeping back in.
I am not expecting the NoSQL techies to necessarily write research papers which are subject to the rigorous refereeing processes of conferences like VLDB, ACM SIGMOD, IEEE ICDE, EDBT, etc, even though such things would be desirable in the long run. As I detailed in my Part 2 blog post (http://bit.ly/NoSQL2), I personally had to go through tremendous amount of evangelization of my ideas (the ARIES family of locking and recovery algorithms - http://bit.ly/RepHis http://bit.ly/ARIESi) before they became widely accepted and got adopted/adapted for implementation in many different types of systems (not just RDBMSs). I of course know that many other people working in the traditional DBMS community have also had to go through similar pains to get their ideas rationalized and accepted. I have put references to my story only because I know it very well and because, for one reason or the other, my experiences have been well documented in various ways (papers, presentations, interviews and videos). I sincerely hoped that the readers of my blog post would take the trouble to follow up on them to get a far better feel for what some people have to go through to make long lasting impact on the wider technical community. Impact that goes beyond money making, flashy marketing collateral, elevator pitches or industry watcher/analyst pronouncements.
Maybe I am being idealistic but I feel I should appeal to the NoSQL community about this with the hope that it gains traction! I do realize that in addition to the open source community that is a big part of the NoSQL movement, there are a number of startups and big Web 2.0 companies who have sizable internal development groups that are engaged in NoSQL work. For different reasons, this set of people might choose not to act like the traditional DBMS community in following the kinds of suggestions I am making in my blog posts to bring more order to the current chaotic situation.
The onus is on the NoSQL techies to do the needed documentation of their work and rationalization of their design choices rather than people like me having to play with their systems or dig through any available open source code to figure out such technical details and rationale. As the references I have given in the Part 2 blog post make it clear, I chose to do such things in the past as part of the due diligence background work in relating my ARIES algorithms to what had been done before by others. It is worth pointing out that a typical researcher doesn’t take the trouble to do as much digging into real systems to compare with the prior art.
So, here is my humble request to the NoSQL techies: For each of your systems, please send me or point me to detailed technical information on each of the important aspects of your system. This should be documentation in the form of papers or presentations, and not pointers to source code comments and such! If some significant aspects of a system aren’t documented reasonably, I am urging the appropriate people to produce such documentation. Of course, for legal reasons, you should NOT send me any confidential or proprietary information.
Here is my offer in return for the above: Once I get hold of such documentation, I am willing to maintain a page for each significant NoSQL system where I will consolidate all the information on that system. Once I get hold of all that information, I will be able to do the comparisons between systems and make suggestions for improvements, etc. for each of the systems. I am planning a tutorial on NoSQL systems and it would be in the best interest of the techies of the different systems to get their systems featured in such a tutorial by providing accurate and complete information on their systems.
I would like to hear the readers’ reactions to my humble request and my offer in return.
There has been widespread characterization of one of the major distinctions between NoSQL and traditional DBMSs by saying that the former don’t care for ACID semantics or that transactions aren’t needed. This is an oversimplification to say the least. As long as the NoSQL system supports incremental updates by concurrent set of users (as opposed to only single-threaded bulk or batch updates), even if multi-API-calls transactions are not supported, at least within the internals of such a system some notion of transaction is essential to retain a certain level of sanity of the internal design and keep things consistent. This is even more important if the system supports replication and/or the updating of multiple data structures within the system even in a single API call (e.g., if there are multiple access paths which have to be updated). Similar points apply to locking and recovery semantics and functionality.
The above sorts of issues are real and were quite tricky to handle in Lotus Notes, which used very ad hoc ways of dealing with the associated complications, until log-based recovery and transaction support were added in R5 (http://bit.ly/LNotes). From Day 1 in 1989, Notes has supported replication and disconnected operations with the consequent issues of potentially conflicting parallel updates having to be dealt with. Even RDBMSs were late in dealing with that kind of functionality.
Even if at the individual object level, high concurrency isn’t important given the nature of a NoSQL application, it might still be important from the viewpoint of the internal data structures of the NoSQL system to support high concurrency or fine granularity locking/latching (e.g., for dealing with concurrent accesses to the space management related data structures - see http://bit.ly/CMSpMg).
Vague discussions about NoSQL systems and ACID semantics make many people think that RDBMSs enforce strong ACID semantics all the time. This is completely wrong if by that people imply serializability as the correctness property for handling concurrent execution of transactions. Even from the very beginning, RDBMSs (System R and products that came from it) have supported different degrees of isolation, in some cases even the option of of being able to read uncommitted data, and different granularities of locking (http://bit.ly/CMQuCC). Even with respect to durability, in-memory RDBMSs like TimeTen and SolidDB which came much later, allowed soft commits, etc., trading off durability guarantees for improved performance.
In my last 2 posts on NoSQL (http://bit.ly/NoSQLt http://bit.ly/NoSQL2), I gave a lot of information on my background to make it clear to the readers that this whole space of data management is a tricky business. The devil is in the details and it isn’t for the faint hearted :-) I wanted to make it clear that I don’t believe in quick and dirty approaches to handling intrinsically complicated issues and that I am not somebody who takes frequent elevator rides with VCs :-) At the same time, I am not an ivory tower researcher either! When I hear many presentations on “my kind of topics” at various conferences and meetings like the Hadoop User Group (HUG), I have a tough time making sense of what is going on given the high level nature of what is being presented with no serious attempts being made to compare what is proposed with what has been done before and about which more is known.
Of course, NoSQL systems aren’t the only context in which such things have happened in the past. A great number of people have talked about optimistic concurrency control and recovery without much of the details really being worked out (see my discussions on this topic in http://bit.ly/CMOpCC). Even now some of the NewSQL people make some tall claims about how traditional recovery isn’t needed and that they can get away without logging while still supporting SQL, etc. One has to quiz them quite a bit to discover that they do in fact do some bookkeeping that they choose not to describe as logging and/or that they don’t support statement-level atomicity even though they support SQL and SQL requires it!
For some people, it might be very tempting to think that the NoSQL applications are so much different from traditional database applications that simple things are sufficient (“good enough” being the often used phrase to describe such things) and that overnight mastery of the relevant material is possible. Even in the Web 2.0 space, if the application programmers are not to go crazy, more of the burden has to be taken up by the designers of the NoSQL systems. A case in point is how the Facebook messaging system designers decided eventual consistency semantics is too painful to deal with. To begin with, if the NoSQL systems have vague semantics of what they support and subsequently, as they evolve, if such things keep changing, users will be in big trouble! Also, with no standards in place for these systems, if users want to change systems for any number of reasons, applications might require significant rewriting to keep end user semantics consistent over time.
After my first blog post made 2 days ago on the topic of NoSQL (see http://bit.ly/NoSQLt), which has been widely read (according to Google Analytics, 1100+ visits from 58 countries), I have been surprised to see some people’s knee-jerk reactions :-) That prompted me to post variations of the following on Facebook, Twitter, LinkedIn and Google+:
I wish people would read exactly what I wrote in http://bit.ly/NoSQLt and stop imagining stuff that I didn’t write! I didn’t say NoSQL is not needed or that everything has been invented before. It is interesting how the blogosphere and twitterati are having a field day putting words in my mouth :-) E.g., search for @seemohan on Twitter
Since most people who read my earlier post probably don’t have a clue about who I am and what my philosophy with respect to technologies and inventions over my entire career has been, I would like them to be aware of the following collateral so that my comments could be interpreted with the right perspective and frame of mind. Readers who take the trouble to follow up on the references listed below will hopefully realize that I am not an RDBMS bigot, that I am open-minded about new ways of addressing problems and their solutions, and that I have invented and transferred technologies to also non-relational systems like MQSeries messaging system, Lotus Notes groupware/document system, WebSphere Application Server, FlowMark workflow management system, Parallel Sysplex Coupling Facility, etc. in the mainframe and non-mainframe environments (http://bit.ly/ARIESi).
I am not trying to claim that I know everything about our industry/technologies or what matters when, or that I have definitive ideas about what the right evolutionary path for data management systems is. I am merely trying to temper some of the marketing and technical hype associated with NoSQL and related areas, and to pass on some caveats and warnings based on my 30+ years of experience in the data management field. I am a bits and bytes (or nuts and bolts) kind of guy who has worked mostly on technologies relating to the bowels of different systems which manage persistent data of different kinds in distributed and clustered environments (http://bit.ly/CMpapp). In my writings and while designing my algorithms, I have tried hard to dig into what has been done in the past and document as much of my learning about the prior art and related work in my papers, crediting the people who did the prior work.
My comments aren’t targeted merely at one NoSQL system or one set of people. I would like all sorts of people to give some attention to what I have to convey regarding NoSQL systems: entrepreneurs, end users, IT management, systems architects, designers, marketers, students, industrial researchers, academicians (pure and those who moonlight on the side as entrepreneurs and consultants), established little/big industry people, …
Now, coming back to the topic of NoSQL and some of the concerns that I have about what is being done in that context, watching a video of the panel discussion that I took part in Sri Lanka in September 2011 would be a way for readers to hear me express some of my thoughts and reactions. In due course of time, I plan to document more of my views in the written form.
I have closely observed or taken part in the evolution of many systems the designers of which initially designed their systems thinking in a simple way but later on had to add more sophisticated functionality which they found out was very hard to do. Examples are System/38’s database functionality which was embedded in the horizontal and vertical microcode of the system, Lotus Notes which from its beginnings in 1989 (http://bit.ly/LNhist) has looked in many ways like the NoSQL systems of today, and RDBMSs like mainframe version of DB2, Sybase and SQLServer, and OODBMs like ObjectStore which started out with page level locking as the smallest granularity of locking.
S/38 had a single level store and it relied on the virtual memory paging subsystem and the file system for accessing and caching data in memory. There was no buffer manager as in other RDBMSs. The granularity of latching during a call to the data manager was an entire table (locking was at record level). As the systems became more powerful and SMPs came into existence, latch conflicts became severe and the myriad things that took advantage of the table level latch became very painful to deal with.
Lotus Notes until R5 had very ad hoc ways of handling recovery, no notion of transactions and many non-scalable features. Changing that system and adding log-based recovery and transaction semantics was painful (http://bit.ly/LNotes).
Reducing the smallest granularity of locking from page size to something smaller was quite painful in RDBMSs/OODBMSs like DB2, Sybase, SQLServer and ObjectStore. The original lock granularity had been taken advantage of in many places in unobvious/subtle ways and those were very tricky to identify and fix.
I am really concerned about some of the design choices made in the case of NoSQL systems. As they mature and what were initially considered as unnecessary features start creeping in (due to the slippery slope that these systems are on when they deviated significantly from the feature set of RDBMSs), they are going to suffer a lot with growing pains along the above lines. I am unsure of the extent to which the designers of such systems are conscious of these sorts of consequences of what they have chosen to do initially.
I tried to demonstrate in our original ARIES paper (http://bit.ly/CMaries) the benefits to be had and the need for concurrently thinking about storage management, locking and recovery, unlike some layered approaches advocated in some earlier work. I also discussed numerous approaches to locking and recovery implemented in relational and non-relational systems which would be worth paying attention to as NoSQL systems evolve.
While there is a lot of talk about scalability, elasticity, etc., such design criteria seem to be applied in a spotty way in the design of these systems. Even systems which support incremental updates, don’t seem to think of having to scale along the concurrency dimension by supporting finer granularity of locking/latching.
Way too much burden is being placed on the laps of the application writers or database administrators since even statement level atomicity isn’t guaranteed when a single statement which updates more than one object encounters a failure of some sort or the other. Of course, only some NoSQL systems support the functionality of multiple object updates in a single statement.
The lack of standards with each NoSQL system cooking up at its own APIs is also going to be a nightmare in due course of time. Whether it is an open source system or a proprietary one, users will feel locked in.
To be continued … in future posts.
After not looking at it seriously for a long time, during the last few months, I have been paying closer attention to the NoSQL phenomenon. I have been amazed at the amount of hoopla associated with it and the “anything goes” attitude of a significant fraction of the people using and/or working on such systems. Of late, it has become fashionable to diss RDBMSs, and a significant chunk of the technologies that have been laboriously thought about and worked out over the last few decades. Some inconvenient/inadequate features of RDBMSs in certain contexts have been used as arguments to throw the baby with the bath water while coming up with alternatives. As some of us anticipated, many features which were initially considered unnecessary/undesirable, are now being retrofitted to the NoSQL systems, in many cases in ad hoc and simple-minded ways.
Having worked in the database field for more than 3 decades with a fair amount of impact on the research and commercial sides of this field (see http://bit.ly/cmohan), it pains me to see the casual way in which some designs have been done and some supposedly new ideas get proposed/implemented. Not enough efforts are being made to relate these proposals to what has been done in the past and benefit from the lessons learnt in the context of RDBMSs. Not everything needs to be done differently just because it is supposedly a very different world now!
As a senior citizen of the database community, I feel I need to say something on this and related topics. For a while, I have been irregular in expressing my opinions very vocally in public fora. Now, I have decided to use this blog to become somewhat more active :-)
Of course, I have to state the obvious: what I say in this blog are all my personal opinions and they don’t necessarily reflect the opinions of my employer of the last 30 years!
I did raise some heckles and asked some uncomfortable questions when my academic sibling Raghu Ramakrishnan gave what I strongly felt was a very one-sided keynote (“Cloud Data Serving: Key-Value Stores to DBMSs”) at VLDB 2009 in Lyon where he extolled the benefits of such systems without enough caveats around what he was saying. I felt the impressionable minds, who constituted a significant fraction of the huge audience, deserved to be exposed to the latter.
As the General Chair, I listened to a number of related presentations at HPTS 2011 workshop in Asilomar (http://bit.ly/HPTSpr). More recently, I attended the Silicon Valley edition of the MongoDB annual conference (MongoSV - http://bit.ly/MDsv11) along with, believe it or not, 1200 other people. After listening to some of the detailed presentations, I decided to tweet my reactions. What follows are a subset of the tweets I authored during that event. They may be of interest to people who didn’t see them before or to those who didn’t jot them down for later use.
In future posts, I hope to elaborate more on some of the points I have made above.
My daughter Pavithra Mohan, on turning 22 today, has made a resolution to get back to writing regularly through traditional media like magazines as well as, for the first time for her, Web 2.0 or social media like tumblr (Weaves by pavsmo) and twitter (@PavithraSMohan). She even credits me in this post :-)
Catch-22 is likely the most fitting descriptor for this day: having just turned 22, I do believe I’m past my prime as I have officially crossed the threshold into the land of the pruned and ripened.