Hacker News new | past | comments | ask | show | jobs | submit login
My Thoughts on NoSQL (eflorenzano.com)
107 points by twampss on July 21, 2009 | hide | past | favorite | 42 comments



Some shaky statements in here. CouchDB should not be compared to these. Description of Cassandra is misleading.

He disses Tokyo Cabinet table feature which is in fact _not_ like typical tabular data but behaves more like a second level hash (hash of hashes) — which he raves about as an awesome feature of other storages.

"None of these really set CouchDB apart, because you could just encode JSON data and store it in Tokyo Cabinet, you can maintain your own indexes of data fairly easily, and you can build a simple REST API in a matter of days, if not hours."

Except you'd be missing the querying features in Tokyo Cabinet. He also misses the fact that Cabinet has real master-to-master replication.

Redis' batch-disk-dumping-from-memory strategy is interesting and theoretically allows it memcached speed with persistance. I'll be looking out for this one.

Finally the statement on using each one for a different use case is crazy. Feature overlap between Redis, Cassandra and Tokyo C/T is too big.


> CouchDB should not be compared to these.

Does CouchDB use SQL? No. So it may be mentioned in a post about "NoSQL".

> Description of Cassandra is misleading.

Defend your claim.

> He also misses the fact that Cabinet has real master-to-master replication.

When you start to need real master-to-master replication, the kind that Tokyo Tyrant offer is simply insufficient. It has no versioning, no way to resolve inconsistencies, no quorum write/read, etc. By the time you're at that scale, you had better be looking at Cassandra, HBase, or Hypertable IMO.

> Feature overlap between Redis, Cassandra and Tokyo C/T is too big.

Maybe I can buy that Redis and Tokyo Cabinet/Tyrant have too great of feature overlap (and note that I'm only interested in Redis as a high performance count server), but Cassandra is in a completely different realm with completely different sets of tradeoffs. To suggest that it is similar to the others is simply naive.


CouchDB is a document centric system. Its really a different beast, altogether. Its closer to a CMS than a DB.


No. While you could build a CMS with couchdb, it definitely isn't one out of the box.

All "document-oriented" means in the Couch context is that the db's data model is "you have a key and a collection of attributes associated with it."



You're confusing "Notes Domino" with "a CMS" which it is not. From your link:

> CouchDB shares a basic concept of a document oriented database with views that Notes Domino has.

> In this model "documents" are just arbitrary collections of values that are stored some how. In CouchDB the documents are JSON objects of arbitrary complexity. In Notes the values are simple name value pairs, where the values can be strings, numbers, dates or arrays of those.


I didn't say it was a CMS: "Its closer to a CMS than a DB."


I guess the name CouchCMS was already taken?


I really would have liked some more background to his statement on Tokyo Cabinet's tables other than saying that "they basically suck." I've never heard this before and couldn't find anything on Google to support that.


It is a completely unsupported assumption made by someone who was probably expecting the TC tables to act like standard RDBMS tables instead of looking at them as a way to structure the values in a key-value store so that you can query on components of the value in addition to the key.


He mentions Tokyo as a successor to BDB, but I'm not sure in what respects that is true. I've used BDB in Keyspace and looked at Tokyo as an alternative, and they're not at all equivalent. BDB has tons more features than Tokyo. For example, concurrency in Tokyo is limited to one writer and many readers, which is equivalent to the so-called Concurrent Data Store (CDS) in BDB, but BDB also has a Transactional Data Store. Roughly, Tokyo and BDB's CDS use a global lock per db, where the TDS is more like a real database with finer-grained locks (and it's ACID). Also, the level of documentation is ~100:1 in favor of BDB. Nevertheless, I have many gripes with BDB which I'll write down some day soon.

He mentions Redis, but fails to mention the biggest limitation, which is that the entire dataset must fit into memory. To be fair, Redis seems to hit a sweetspot in terms of user-requirements --- lots of cases where data fits into memory. But I still wouldn't recommend it to a client, because what if his data storage needs grow? A few GBs are easily filled up, esp. if you're a web programmer and liberal about what you store.

CouchDB is mentioned. Having seen a recent video of Damien Katz where he describes how CouchDB came along, I feel it's more of an "MS Access for Web2.0", which is no surprise given DK worked on Lotus Notes. It's a lightweight integrated app+db server for Javascript programmers. It's not meant to be used at a Facebook scale operation.

This article, mentioning all these KV-stores together without differentiating them is really confusing and adds to the general melting pot of "NoSQL", which is unfortunate.


Nevertheless, I have many gripes with BDB which I'll write down some day soon

I'm considering BDB as the data storage layer for a new project, so I'd be very curious to hear your experiences with it.


He missed my favorite: http://www.persvr.org/

CouchDB requires custom views for each field you want to optimally query. Persevere (using the JavaScriptDB) does that natively and smartly for all fields: http://www.sitepen.com/blog/2009/04/20/javascriptdb-persever... (see section: Adaptive On-Demand Concurrent Indexing)

The biggest benefit of NoSQL dbs is the ability to add properties (i.e. columns) on the fly. Querying these properties should not be an O(n) operation. Unless I am completely mistaken, it is in CouchDB unless you write a custom view. Persevere doesn't require that and supports most common methods of querying ( http://docs.persvr.org/documentation/jsonquery ) If you query a field, it will index it and if you query it often it will maintain the index and update it regularly.

I'm just starting on a project that uses Persevere and already I'm quite impressed by it. It may not be as famous as CouchDB/Cassandra yet but it offers a lot of very interesting features including transaction support and nested objects with referencing.


CouchDB requires explicit indexes, but they can be arbitrarily complex, which is the right approach for a database focussed on scalability. They are also fun to write (javascript + map/reduce).

It appears that JavaScriptDB just maintains indexes of everything, until you have over 2000 objects of a class and then it only updates the indexes of fields when you search on them. This saves you the trouble of initially specifying indexes, but by the time your DB gets big enough that they become important, you'll have to manage them explicitly anyway, and you don't get anything besides plain old lexical sort. Persevere looks great for rapidly developing small projects, but not particularly scalable. The automatic index thing might be handy for a multi-faceted search, which is a pain to do in CouchDB.


The biggest benefit of NoSQL dbs is the ability to add properties (i.e. columns) on the fly

Are you claiming that you can't add a column on the fly to a table in a relational database? Because that my friend is complete gibberish.


You can't let your users do things that add arbitrary 'columns' whenever, or grow to 10,000+ on every 'table' to model attributes.

That kind of usage works really well when the DB is a Entity-Attribute-Value tuplestore underneath. It's not so hot with tables, much less relational ones.


The fact that a system requiring ad-hoc attributes wouldn't work so "hot" in a relation isn't a problem with the relational model. This difficulty comes from the application of the relational model bringing to light that such a situation is necessarily more complex than one without ad-hoc attributes, and you cannot treat such data the same way as you can treat data in a relation. The important lesson is that you cannot abstract this added complexity away without losing something else of value (data integrity and simplicity of reasoning).

This gives you more power when your situation isn't ad-hoc attributes (by providing more guarantees) and lets you model the ad-hoc situation with a EAV model if you must. SQL maybe should provide some more sugar for accessing that EAV, but it's not a fundamental relational model issue.


Sorry for the confusion. I only mentioned properties=>columns to differentiate from the concept of rows/tables/values etc.

What I meant was that different properties can be added to each object in the store without having to worry about the effect of the same properties on other objects in the same store. If you add "cellphone#" property to an object, it doesn't mean all objects in the store will now have a "cellphone#" property (defaulted to NULL). It means other objects will not even have a property "cellphone#" unless you go back to those objects and edit them.

This has nothing to do with the length of time it takes to add a column in whichever SQL database you use.


I think he's saying that if you have a large database in an rdb, adding the new column 'live' can take a prohibitively long time and kill access to the server, whereas a NoSQL db can add the new column without any large initial expensive operations on the db.

(Disclaimer: not a db guy.)


Once again, it is shown that the NoSQL crowd's complaints about relational databases in general are in fact only complaints about MySQL in particular.


Generalise much? I agree that what the parent complains about is a MySQL problem but it's pretty disingenuous to just dismiss the whole "NoSQL" idea based on one misinformed comment on HN.


Nah, this keeps coming up on HN, and 99% of the time, the thing that "SQL databases" just can't do is something that DB2, Oracle, Sybase, Informix et al have done for 15 years or more.


Well, I don't know what you've been reading here, but there are plenty of things you can do with one of the systems listed in that article that you can't do, or can't do well/easily/cheaply, with any conventional RDBMS. The listed systems are very disparate, but here's a couple of examples:

In Redis you can read or write ~100k operations a second. You can do set operations, add, rem, member, list .. way cool if you can stomach the memory use. Lazy sync to disk. Hard to think of anything that could beat it for collecting stats or totals etc for high volume data streams.

In CouchDB you can do M/R style indexed views, like word counts. You can access the DB directly from HTTP, no special drivers - the DB is implemented as a web server. You can authorise via OAuth (soon), secure via TLS and proxy through squid. You can access, and replicate, from the other side of the world.

Tokyo Cabinet is over 10 times faster than MySQL as a simple hashtable, like, say, a session store, which is the first place most sites start having trouble with their DB. Tiny, bare bones, drivers for all major languages, network accessible. A no-brainer for simple storage IMO.

And all of these are completely free, unlike the cripplingly expensive commercial alternatives you mentioned - especially important for startups.

So .. basically I completely disagree, there's a lot of good things about these alternative DBs. Don't discount them just because you heard a few dumb ill-informed comments on sites like this, most people involved in non-RDBMS DB implementations know exactly what they're doing.


This should be a constant time operation on most databases, because existing rows are not touched. However, if you have lots of properties that are not indexed or directly queried, I suggest having a single JSON field for them.


Oops, that's what I get for repeating a single complaint I'd seen voiced in some other thread...

Apologies.


I'm going in the opposite direction from NoSQL. I'd like as much sql as possible. I'd enjoy using a language with an in memory database, one where you could create and discard databases as easily as a set, list, or map. To me, some of the issue here is that people are rebelling against a language (sql) related to a branch of math (relational algebra) because they're dissatisfied with the way SQL persists and scales (which is debatable, but seems to have some merit).

So in this case, I think "nosql" is a misnomer.


You can do that pretty easily with SQLite:

        conn = sqlite3.connect(':memory:')
        conn.execute('''
        create table visits (plano text,
                             "column" text,
                             "row" text,
                             route text)
        ''')
        conn.execute('''
        create table square (plano text,
                             "column" text,
                             "row" text,
                             x int,
                             y int)
        ''')

        for line in infile:
            if line.startswith('#'):
                continue
            plano, column, row, routes = line.split(',')
            column = column.lower()
            x, y = square_coords(plano, column, row)
            conn.execute('''insert into square (plano, "column", "row", x, y)
                                        values (?,     ?,        ?,     ?, ?)
                                        ''',
                                              (plano, column,   row,    x, y))
However, the trouble is that SQL is still kind of verbose, not type-safe, etc. I've been writing up some thoughts on how to do better at http://canonical.org/~kragen/binary-relations.html but I have a long way to go still.


That's pretty cool. The thing is, I'm actually not looking for a "conn" (connection" of any sort). I'm thinking of database as a container class. Kind of like...

db = New db

db.addTable(colname, list[names], primaryKey) db.addTable(joinTable, foreignKey, colname, list[names])

Map a = db.(select col a, col b from ...)

db = nil

and so forth... in other words, we're going the opposite direction from database as storage. Instead, we're going to database as something that is as in-memory as a hash (probably more likely to be persisted at some point, but not necessarily).

Sometimes, I see classes created that are bending over backward to recreate a bit of sql, and the model and logic really only exist for one purpose - sql-like methods on the data. People rarely write their own list, set or map classes, but often they will essentially recreate a bug-ridden 1/2 of common sql in their app...

Anyway, looks like you've been thinking about this for a while, I'll definitely take a look at the link. Thanks for posting it.


It's not connected to anything. ":memory:" means "something that is as in-memory as a hash".


That's interesting, looks like I have something I need to read about.


  (connect '("localhost" "" "username" "passwd") :database-type :mysql)

  (create-database :users)

  (define-view-class user ()
    ((id     :initarg :id    :accessor user-id :type integer
             :db-constraints :key :not-null :auto-increment)
    ((name   :initarg :name  :accessor user-id :type string)
     (passwd :initarg :passwd :accessor user-passwd :type string)))

   (create-view-from-class 'user)
Then in the REPL

  CL-USER 1> (list-tables)
  ("users")
Now for some interaction:

  CL-USER 2> (make-instance 'user :name "Popeye" :passwd "sailorM0n")

  #<user>     ; <- instantiating an object creates a users.user record
Some time later, perhaps after a machine restart

  CL-USER 1> (select 'user)
  ((#<user>))
  CL-USER 2> (select 'user :where [= :name "Popeye"])
   (#<user>)
Say, you wanna change Popeye's password.

  CL-USER 3> (setf (user-passwd 
                      (select 'user :where [= :name "Popeye"]))
                   "new-password")
But typically you would just define a function, say, FIND-USER which will get you an object corresponding with a user record.

  (defun find-user (by val)
    (caar (select 'user :where (sql-= by val))))
That way you can find an object by ANY slot (or in SQL, find a record by the value of a column.

  CL-USER> (user-name (find-user 'id 0))
  "Popeye"
  CL-USER> (user-id   (find-user 'name "Popeye"))
  0
  CL-USER> (user-passwd (find-user 'name "Popeye"))
  "new-password"
That last value is due to our last update.

So you're bored with this class, and the auto-incremented primary key ID is MySQL only. Fine, redefine the ID slot mapping its value to a sequence

  CL-USER 3> (create-sequence 'user)
  CL-USER 4> (defclass user ()
                ((id  :initform (sequence-next 'user)))
The rest of the class definition stays the same.

  mysql > show tables;
  user  _clsql_sequence_user
:-) so you're bored with the database. drop the classes/tables one by one or just drop the whole database.

   CL-USER 5> (drop-sequence 'user)
   NIL
   CL-USER 6> (drop-view-from-class 'user)
   NIL
   CL-USER 7> (drop-database 'user)
   NIL

You can use the same code in MySQL, Postgres, DB2, Oracle, ODBC, Sqlite?, and an in-memory Sqlite3 implementation. This shit is industrial-strength eye candy. Scan the manual:

http://clsql.b9.com/manual/

Tutorial:

http://www.ravenbrook.com/doc/2002/09/13/common-sql/

It comes in fully open source and corporate supported versions: The OSS one is called CLSQL and its compatible with LispWorks' CommonSQL.


This is a common practice in Sybase-derived RDBMSs.


I dont agree with the 99.9%, in the slightest

I dont dislike RDBMS because of the problems scaling them, I dislike them because of the impedance mismatch, because I dont like the sql language and because I am forced into creating a data model as the starting point of my application when I might not have a clue about what model I actually want.

I am much happier now that I have dropped mysql and using mnesia or couchdb most of the time, Its not about "needing more than mysql/postgre, its just about striving for something thats a better fit.


I don't use databases myself, but the discussions over the last few weeks about "no-SQL" have gotten me curious and I have been nibbling at Codd's "The Relational Model for Database Management" since I had it laying around, and it is mostly theory I am interested in right now. He spent a great deal of time complaining that no vendor actually produced a database that was at all close to the relational model.

My question is whether any DB products, especially any free software, matches the relational model any better than the ones Codd was complaining about nearly two decades ago? Which databases, if any, are the closest to being really relational, not just sort of?


I find the new datastores interesting but I have my doubts.

Has anyone ever worked on a long project (1 year or more) using an XML storage or an Object datastore. They can be a real nightmare.

One reason: Users will always add new requirements:

* We want another application, with different programmers, to access this data. When you have problems good luck defending your choice of datastore. And although the problems aren't related to that datastore, the programmers on the other team may place the blame on you. * We want to query tables that didn't have relationships in the beginning of this project, but do now * We want to generate reports: Max, Min, Count, etc.


I'm surprised not to see MongoDB mentioned; it's the document-oriented database I find most intriguing:

http://www.mongodb.org

I don't know enough about its alternatives to offer meaningful comparisons, but things I really like about MongoDB in the brief time I've played with it:

- It's trivial to get started with.

- Everything just makes sense.

- No schemata to deal with.

- Excellent Ruby support right out of the box.

For the very simple use cases I tend to deal with, I've always defaulted to throwing my data into MySQL or SQLite, and I think MongoDB will replace them for me.


MongoDB is very nice. It's basically a bunch of tables with a binary form of JSON they call BSON instead of "rows". You can add an index to an arbitrary JSON "path" and it literally just works. I just wish they had broken out the layers a bit better. You are stuck with their default persistence for example which is very fast but not particularly safe. I'd sacrifice some speed for safer writes on certain collections.

And I am going to repeat the parent's comment that in MongoDB everything "just makes sense". It's just a big bucket with indexes. Everything you need, nothing you don't. So refreshing.


I mentioned MongoDB...briefly.


I take a different stance as well. I want a non-SQL RDBMS. I want transactions and declarative queries, etc, just without the brain-dead SQL syntax and string manipulation to write a query.


See my long winded reply to someone else in this thread. We have this, it's called CLSQL (the OSS version) or CommonSQL (the commercial version; each more identical than the other, btw ;-)

http://clsql.b9.com/manual/

Transactions are in chapter 5, unbrain-dead syntax, in both functional and OO versions are there. A killer query syntax, in Lispy prefix is there. It uses your database backend of choice (name any 8 major RDBMSes and it has support for them, it also ODBC for the not-so-major ones)

It's a clean beautiful API, learn it, use it, clone it for your favorite language[1], etc.

--

[1] Clojure guys, you need this; throw a Lispy API over Java's orgy of database interfaces at once. Get your major API defined and "standardized" ASAP, otherwise clojure will be assimilated into Java, culturally, and a generation of misguided programmers will be writing Java code with parenthesis.


99.9% web sites are CRUD, which don't really need full features of RDBMS, which is both an overkill and a bottleneck for scalability.

NoSQL is ironic word in that both Hive and Pig are gonna introduce an ANSI SQL dialect for scalable data queries :)

The NoSQL movement is really a NoBlindJoinsViaRDBMS movement, but the latter is less catchy.


Why can't everyone just use the best tool for the job and be done with it?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: