Hacker News new | comments | show | ask | jobs | submit login
MongoDB vs. Clustrix Benchmark (sergeitsar.blogspot.com)
76 points by sergei 2137 days ago | hide | past | web | 65 comments | favorite

In this article MongoDB == NoSQL, that is not the case. Different NoSQL solutions have different use cases. Also IMHO MongoDB is pretty SQLish in the data model, so you are actually comparing two implementations of a similar data model here, and one may be superior to the other one or the other way around I guess. No surprise.

A more interesting attempt is IMHO to check how the difference in the data model of some NoSQL solution can lead to very different performances.

For instance Clustrix VS Redis can be interesting. Examples:

1) A lot of writes against a table where you require then to get things ordered by insertion time. With Redis is is just LPUSH + LRANGE. Try to do a read/write test where many clients are writing and reading at the same time (real world), against a table (or Redis list) with millions of elements.

2) Range queries when there are a lot of writes against this indexes. For instance a table with a score (we are modeling an online game leaders board), a lot of inserts of new scores. Get ranges between random intervals at the same time. Again, many clients writing, many reading.

So a guy who is an expert in Clustrix (and knows how to setup, tune, etc) compares it against some other technology that he does not know (and does not know how to setup, tune, etc) and comes to the surprising realization that his technology is better?

Where have I seen this before? Oh right.. every time I see "Technology A vs Technology B" comparisons.

Naturally his results are in his favor, otherwise he would not have posted them.

In fairness, you don't have to tune MongDB poorly (deliberately or otherwise) to get poor performance with a workload involving substantial numbers of writes; it's well-documented that there is a global lock (http://www.mongodb.org/display/DOCS/How+does+concurrency+wor...) that prevents reads during write operations.

That said, there are certainly nosql systems with better scaling and concurrency stories* than mongodb out there that he could have benchmarked against. :)

*I'm a cassandra committer

Just a point of clarification, but it is a per-server lock, not a global lock across the whole database.

The attacks on NoSQL seem a little harsh to me. People needed scalable systems. They needed them ASAP. Nobody in the RDBMS camp was even hinting at products targeting this market. Then, the NoSQL camp built systems that scale (albeit with compromises).

The fact that this has spurred the others to start building scalable RDBMSs is great. But let's not pretend these new RDBMSs won't have compromises, they'll just have different ones.

The important thing is for developers to make smart decisions about what tradeoffs make the most sense for scaling their application. Different applications will require different tradeoffs.

Disclaimer: I work for VoltDB.

First off, how do you download Clustrix? With MongoDB, simple as pie: http://www.mongodb.org/downloads

Now, where are the docs? Again, Mongo has great docs http://wiki.mongodb.org/display/DOCS/Home

How can I verify your claims?

Oh that's right, you call a salesperson first....

Well, there are quite a few commercial database vendors that offer parallel and clustered RDBMS products, and many of them appear to be quite good.

Unfortunately, they've got a terrible marketing problem.

Before 1998 or so, a relational database was an expensive product that you got from a vendor like Oracle. Since then, a generation of people have grown up that think about using a commercial RDBMS the same way most of us think about putting our hands in a toilet.

@va_coder hits the nail right on the head, it's not just the cost of the product, it's the cost of the buying process.

If I want to trial a product that is open source or has an OS or free edition (that could be mysql, mongodb, postgres or even Virtoso OpenLink or SQL Server Express) I can download it, read the docs and play around with it and learn a lot in a few hours. I might learn that the product is not for me, or I might get a positive impression and feel ready to commit coding time to it.

If I want to trial Oracle or Clustrix, well, I'm going to have to start a contact with a sales organization and then they need to pre-qualify me, and then they need to qualify me and then I'll spend a few hours on the phone talking to people (which might take a few weeks in wall-clock time.)

Even if they give me a 30 day free trial, I could easily spend $500+ of my time just getting the trial... And once I've gotten to the point where I'm negotiating with one vendor I'm going to feel a lot of pressure (internally or from my superiors) to talk with some competitive vendors too to make sure I'm making the right decision.

It's a shame because, certainly, a company like Clustrix could use the revenue they get from product sales to support an awesome development team and really deliver a better product. On the other hand, they have marketing channels that are aimed at large organizations that can afford an expensive buying experience... and it's an expensive selling process for them when they've got to do "complex sales" that require approvals from a large number of stakeholders. They've got to pass those costs onto you.

The trouble with this model is that tomorrow's large organizations are today's small organizations. Today, Facebook could afford just about any commercial software that's out there. However, they made critical technology decisions (that are difficult to reverse) back when they were a little company that could only afford MySQL.

On the other hand, allowing potential customers to easily try your product can destroy your chances at gaining them.

Here's a quick example:

A few years back, I was writing Smalltalk web systems and evaluating which commercial implementation to use. The two big ones that bubbled to the top of the list then were Gemstone and Cincom.

Cincom offered a relatively familiar set up which would require using sticky sessions for the web server, and another backend database like PostgreSQL.

Gemstone offered pearls on a platter: shared session state across all the backends, and a build in distributed object database. I thought it would be brilliant, and reading their blogs and documentations gave no hint of any pitfalls.

Cimcom was a download away---as were any of the free smalltalks---but Gemstone then required you to contact their sales staff before playing with it. So I made a mistake and chose Gemstone, and the system was built to target their as of yet unseen architecture. When I finally got to use Gemstone, it was a mess!

If I had that experience to start with rather than just their glowing self-produced reports, I'd have never picked Gemstone and instead gone with the familiar "kludgy" system that was "good enough".

If I want to trial Oracle or Clustrix, well, I'm going to have to start a contact with a sales organization and then they need to pre-qualify me, and then they need to qualify me and then I'll spend a few hours on the phone talking to people (which might take a few weeks in wall-clock time.)

If Clustrix offered cloud-based deployment of their database, they could have a small free tier for people to play around with. Add a console applicaiton for accessing the API. Make it really developer friendly.

Imagine if you could just type:

    $ sudo gem install clustrix
    $ clustrix create my-test-app
    Created cloud.clustrix.com/my-test-app
    username: my-test-app
    password: xd634shx
If they had that, I'd be trying it out in a flash. Then what if they had a per-usage pricing system like AWS? Something where you pay per gigabyte. Something that allows you to test stuff out for a few dollars a mount, and then when you're ready, scale it right up to a few thousand or more.

Presumably at some point they will offer a cloud-based service. It particularly makes sense given the fundamental design of their product.

Since they only launched their product quite recently, it is understandable that they would want to test it with several customers before offering it as a service.

In fact, if the product scales well enough, it would be perfect for deploying as a cloud-based service and could be very successful as such.

Ah, I didn't realise their product was only recently released.

Small nit: You can download Oracle 11g (or 10g) strait off their website for free — you just need to sign up for a free developer account.

Here's a link: http://www.oracle.com/technetwork/database/enterprise-editio...

Another point to notice: He seems to have run his benchmarks under very trivial data-schemas with a single/simple table. A big (speed, simplicity) advantage of No-SQL is the ability to embed lots of data within the parent model and manage a single table where you would have to manage many in a SQL database.

I would be very interested to see a comparison where a large and "real" data model (that contains 6-7 "joined" tables for the SQL setup and a single table with the embedded document model for the No-SQL setup) is injected into each of the technologies.

Also, it is just horrible style not to include the benchmark code for peer review. Delivers near-0 credibility.

Excellent point. MongoDB doesn't claim to be any faster than other dbs at the simple stuff (in fact, it's often slower because we haven't had years to optimize everything). The speed gains that people usually see are because they can just fetch one document, instead of doing complex joins or aggregations.

Indeed. A lot of the reason I like document databases so much is that you can solve problems differently (and many times, much more easily) than you could in a relational database. Compare:

    db.posts.find({tags: {$in: ["foo", "bar"]}})

    SELECT * from posts JOIN taggings ON taggings.post_id = posts.id JOIN tags ON tags.tagging_id = taggings.id WHERE tags.tag IN ('foo', 'bar');
(Single query tag lookup; naively joins the entire taggings and tags tables before limiting with WHERE)

Or a "better" query with two subselects (yikes!)

    SELECT * FROM posts where post_id IN (SELECT taggings.post_id FROM taggings WHERE taggings.tag_id IN (SELECT id FROM tags WHERE tags.name IN ('foo', 'bar')))
And that's the "find where any tag matches" case. Try the "when all tags match" case ($all in MongoDB), and you'll go grey a few years earlier.

> (Single query tag lookup; naively joins the entire taggings and tags tables before limiting with WHERE)

Really? Which modern RDBMS database actually works like this when indexes are available?

Index selection and optimization is a standard feature of even the simplest relational database system. JOINs on keys with high selectivity (especially unique keys) are extremely efficient in modern database systems - in some common cases just as efficient as pulling two columns out of the same table. Subselects are something that several database systems (MySQL, for example) really suck at - but they are easily avoided in most schemas.

> Try the "when all tags match" case ($all in MongoDB), and you'll go grey a few years earlier.

Depending on your database engine, even very large AND chains can be extremely efficiently computed on an indexed column. For larger use-cases, there are good ways to avoid this problem using careful schema design.

There are legitimate reasons why relational databases aren't the solution to every database problem. Claiming issues which clearly do not exist is not a good way to get points for the other side.

Looking at your first SELECT, there's very few RDBMSs on the market that won't evaluate that WHERE clause prior to the JOIN.

(disclaimer: I work for Clustrix)

relational folks can denormalize when they want to

They get strangled by their DBA before they're finished with it.

And: the point of document stores is that you can denormalize (i.e. put lots of stuff into one data item) and _still_ use indices into the lots of stuff. Most commercial RDBMSes have means and ways to do that (e.g. storage of XML content), but afaik it's not standardized and would tie you closely to a single ($$$$$$) database vendor who will not hesitate to make you bleed whenever they can.

the things like olap and intentional denormalization is not that dba are against of.

by denormalization i didnt mean keeping xml in a clob, but keeping the values in a row as if several tables are already joined into one wide table, thus eliminating the need in joins.

Serious question, because I don't know: what's the advantage over a denormalized SQL table?

because document stores store documents ready to be consumed by an app. these documents are tree-like, not tabular: jsons or xmls

Price is the elephant in the room here; NoSQL exists because people aren't willing to pay for real databases. Building yet another expensive (i.e. > $0) database that doesn't even work in the cloud won't help the Web 2.0 crowd.

Well, that's partly right. Yes, clustrix is more expensive than the open-source databases.

But, companies like Twitter and Netflix certainly have the budget for Exadata (which is what clustrix wants to be when it grows up); they are using Cassandra instead not just because it scales on commodity hardware (the other price factor besides licensing) but also because it works across multiple datacenters which two-phase commit systems can't do no matter how big your budget is, and because its availability (failure tolerance) model is much more robust.

(Many NoSQL systems don't provide these advantages either, which is why lumping all non-relational systems together is usually not helpful.)

These folks disagree with you. There are many more behind them.


And plenty of folks use MySQL (and PogreSQL to a much lesser extent). You just can't scale those.

I can scale PostgresSQL pretty well, thanks. PostgreSQL is used on some of the biggest databases in the world, but no one's going to tell you which. :)

Your benchmark is pretty much irrelevant because Clustrix sells hardware appliances, not the software standalone.

Edit: That's not to say MongoDB didn't fail your benchmark. But the issues you highlight (single mutex) are known. Test against a real database and your results don't look that impressive for a 10 node setup.

All that said, Clustrix looks great. Just make it available sans-hardware.

Last time I heard, twitter still uses MySQL for the statuses (tweets) table. They did have a plan to migrate to Cassandra, but didn't go all the way through with it. So I find it hard to agree with "you just can't scale those". It may be a lot of work, but for some applications you can scale them.

Edit: twitter cassandra link (don't know if this is the latest): http://engineering.twitter.com/2010/07/cassandra-at-twitter-...

It's worth noting that Twitter has built a lot on top of MySQL to get to the scale they're at. Take FlockDB for example, https://github.com/twitter/flockdb

So I guess that statement should be written as "you can't scale with just those". :)

You can't scale MySQL/PostgreSQL? Care to expand? (Although Facebook's schema is likely an abomination flying in the face of every normalized form, it does run on MySQL. And it's not like throwing PostgreSQL on a big box doesn't go far.)

I was going to make the same comment. Even very profitable companies are usually price sensitive so +1 for open source.

Profitable companies have usually figured out that it's better to buy a solution that solves their problem rather than: (a) pay for engineer time to integrate a solution that isn't quite right, (b) take the opportunity cost hit while waiting for (a) to be done. These costs can pretty quickly dwarf the purchase costs of licenses and hardware. Of course if there's a free solution that just slots right in, double bonus, but for the most part the cost of hardware/software is noise next to the cost of people to maintain it.

Winning a benchmark against MongoDB on a non-trivial workload is a little bit like winning the special olympics.

I'd be more curious to see how Clustrix performs against Cassandra, Riak or HBase in their respective domains. Those seem to be the more serious contenders when it comes to "Big Data".

Would you care to elaborate? Seriously, do you have any pointers or benchmarks for a Mongo vs Cassandra vs X comparison that isn't purely anecdotal, and which takes the relative strengths of each into account in a robust way? A lot of blog posts with performance numbers seem anecdotal or loaded towards one option over the other.

I'm excited by some of the underlying technology in Mongo and its ilk (gossiping protocols, etc.), but there's no doubt that they require different programming techniques than traditional RDBMS. Bit like apples and oranges, isn't it?

EDIT: I noticed you changed RDBMS to "big data". I'm still curious if you have any pointers to fair benchmarks though.

Well, don't trust a benchmark that you haven't faked yourself they say.

I was just trying to point out that MongoDB is too easy a target here. The problems it has under high load are fairly well-known, at least to anyone who tried to benchmark it outside of their MacBooks. Just bulk-load a couple million records and watch it tip over if you don't believe me - I'm not making it up and neither is Sergei.

However if Clustrix wants to impress with benchmarks then they should pick an equal opponent. MongoDB is not exactly relevant for companies that consider the calibre (and cost!) of Clustrix.

I am a Mongo fan in some respects, but I agree that every benchmarking I do with any significant data involves the painfully slow bulk writes to get it started -- not a good first impression. Can some Mongo people help explain this?

I chose Mongo because it gets a lot more attention on HN than any other database. I don't remember the last time I saw a post on Cassandra on here...

I think I see Redis more than Mongo.

I get the impression that Mongo is simple to use and simple to program for; Cassandra is quite good, but also more complicated.

i guess rdbms will never show "older tweets are temporary unavailable". because we don't need that, its our special domain! :)

//ps. i am a huge nosql fan dont downvote! :)

Pet peeve: eventual consistency isn't for scalability and performance, it's for availability. In a well designed system, the whole debate only matters during in a failure condition: a strongly consistent system gives up availability upon a certain kind of failure, an eventually consistent system gives up consistency upon a certain kind of failure.

There are strongly consistent scalable "NoSQL" systems e.g., BigTable. Megastore even provides complex distributed cross row transactions.

In a well tuned system, loss of availability (in a failure scenario) could be minimize to seconds. What this means for performance is that you have systems which encounter second-long latency spikes upon failures.

It also isn't a binary switch:

1) Quorums can be used to achieve read-your-write consistency in the case of simple failures (loss of 1 node out of 3).

2) There's multiple kinds of relaxed consistency models. One of them is serializable consistency: you may get a stale read, but the the order of reads is the same as the order of writes. This is used by PNUTs and can be achieved by serializing the writes through a single master. This means that there's (again, for a short time period) loss of write availability in a failure, but there's no loss of read availability.

3) Paxos/multi-Paxos can be used to achieve atomic writes (all available nodes receive the write) while withstanding simple failures (similar to quorum protocols... and I believe multi-Paxos uses quorum protocols under the cover to improve liveness over "raw" Paxos). This is at the cost of higher latency (and complexity). [Edit: In this case, you're dealing with full-blown strong consistency, but with -- at the cost of latency -- the ability to tolerate certain kinds of simple failures/trivial partitions]

Clustrix looks interesting, but it looks like it addresses the scalability and performance issues with RDBMS, but not the availability feature. If an RDBMS were to drop the "A" and "I" in ACID (C in "ACID" means serializable view of the execution, which is not the same as the C in "CAP": the latter means all nodes in a cluster agreeing upon what the data is which isn't required for the former), it would be possible to build a highly available, low latency RDBMS; but it would also not be as useful without atomic and isolated cross row transactions.

[Disclaimer: I work on a Dynamo-style database, but fond of PNUTs as an architecture (more difficult to implement, but IMO a better fit for plurality of web applications) and generally fascinated and interested in distributed systems, databases and systems programming in general]

I like the content of this post, but I take issue with the idea that the debate only matters in a failure condition. That's academically true, but practically misleading. When developing my app, I have to assume any part of the system could fail at any time. Thus my whole app needs to be written with failure in mind.

In an ACID system, that means my transactions either happen or don't, and it's on me to figure out whether they did or not. Sometimes I'll need to do some investigating after a failure to find out what succeeded and what rolled back.

In an eventually consistent system, I need a contingency plan for every operation on the system. Now that contingency plan might be very simple, such as "Whatever happened, happened and last writer wins." The CALM conjecture paper does a decent job of describing when this might be a workable contingency plan. More complicated operations require increasingly complex contingency plans. This is why Cassandra is now offering counters as an API feature, because something as simple as counters is really hard to get right in all failure scenarios.

If you fit the CALM description and require multi-data center availability, then EC is probably a good bet. It may still be a good bet otherwise, but the availability comes along with a much more difficult app development picture, failures or no.

> This is why Cassandra is now offering counters as an API feature

This is specific to Cassandra, where the development team chose not to implement optimistic locking via version vectors. That said the counters patch essentially implements a vector of counters, very similar conceptually to a vector clock.

That said, there are some scenarios where quorums (in absence of an agreed upon view of the cluster e.g., a zookeeper-based failure detectors) could mean concurrent vector clocks, which means it's up to the application to reconcile it.

If your application can not reconcile a split brain scenario on its own and depends on a total order then yes, it's not a good fit for EC.

As I've said, I also think that serializable consistency is easier for applications to deal with (they can assume a total order) than eventual consistency but provides the strong benefit of read availability (something, quite frankly, most Internet applications require).

Most complex applications have points where consistency can be relaxed (either somewhat or fully) and points where consistency is needed.

In "we chose consistency" and "we chose availability" cases, the applications will have to deal with the consequences of those choices: much as developers may handwave the implications of an eventually consistent system (since it returns right results most of the time, not thinking whether or not their application is sensitive to a non-serializable order) they might also implement ad-hoc, poorly thought out HA solutions on top of strong consistent systems -- often times losing both availability and consistency.

The typical usage scenario of masking MySQL failures and latency by using memcache comes to mind. VoltDB is very correct to remove the need to use memcached for read latency (a caching layer in front of a database which already a cache which sits on top of a file system with a cache is a caching layer too many), but in many cases I've seen memcached used to accept reads (and sometimes writes) while the underlying database is unavailable.

Finally, the whole debate about atomicity is quite meaningless if (like many applications) you're fronting your data access layer with an RPC framework (your RPC call may go through to a database and perform a write even it times out to the application i.e., implying non-atomicity).

The CALM paper is a big move into the right direction: being able to determine synchronization points at which ACID transactions (or weaker forms of synchronizations if permissible e.g., quorums and version vectors) can be used. Incidentally, I also think a declarative language (whether like Bloom/Bud or closer to SQL) is a good way to express the constraints and let a system (which may not be all that different from a query planner) determine whether those points occur.

Overall, another solid comment.

As for RPC, whether the client is informed of success or failure has nothing to do with atomicity. The guarantee is that the transaction either happens completely or not at all, and nothing more.

Agreed, it can be very frustrating that under certain failure scenarios, it's unclear whether the transaction completed or rolled back. Still, given atomicity, a correctly designed system can't be left in an inconsistent state. At times, it's up to the application to discover (or re-discover) that state.

Systems that offer atomicity for single record operations can actually make the same assurances, but may require complicated escrow systems and compensating transactions.

Eventual consistency often gives up the ability to develop applications against the system, because more reasoning about the state of the system has to be embedded in the application logic. For a better description of this problem, see this post from an AWS employee:


Note that his "right answer" for performance (not availability, so it's a bit of an aside from your comment) is to dynamically repartition, which is what Clustrix does. Note that we can actually do this type of repartitioning on just the "hot" data, using MVCC to avoid any downtime (unlike Mongo, which blocks all writes).

By the way, the Google Megastore project I believe you're referring to is not actually eventually consistent in the same way that, say, Cassandra is. Megastore is fully ACID within an entity group, meaning that lowering eventually consistency down to the per-node level was not the solution that Google went with. BigTable is also not eventually consistent.

*Clustrix employee

Entirely agree with you, eventual consistency and quorums protocols all have their drawbacks. Difficult of developing against that (and the natural reaction of just "handwaving" the consistency model rather than using optimistic locking with a vector clock at the identified synchronization points) is the main one. I have, however, also seen the situation where developers would fail to understand how to correctly deal with availability and build ad-hoc, poorly thought-out HA layers on top of non-HA systems. The typical memcache + MySQL deployment is the prime example of that: there's neither the high availability, nor an understandable consistency/cache coherence model.

> Note that his "right answer" for performance (not availability, so it's a bit of an aside from your comment) is to dynamically repartition, which is what Clustrix does. Note that we can actually do this type of repartitioning on just the "hot" data, using MVCC to avoid any downtime (unlike Mongo, which blocks all writes).

Dynamic repartitioning is also what HBase/BigTable do ("tablet splitting"). The MVCC approach is interesting way around the unavailability window, by the way.

> By the way, the Google Megastore project I believe you're referring to is not actually eventually consistent in the same way that, say, Cassandra is. Megastore is fully ACID within an entity group, meaning that lowering eventually consistency down to the per-node level was not the solution that Google went with. BigTable is also not eventually consistent.

Yes, I did not mean that Megastore is eventually consistent. It is not, it's ACID compliant and uses Paxos. Real question: was this not stated clearly in my comment (I'd like to edit to clarify if it is)?

Megastore, however, is an example of a "NoSQL" system and I meant to use it as proof that not all "NoSQL" systems give up consistency: ACID is orthogonal to the query language.

Quite frankly I find the whole query language debate (and "NoSQL" marketing gimmick) to be quite pointless, the distributed systems and systems programming aspects are a lot more interesting to me.

> Yes, I did not mean that Megastore is eventually consistent. It is not, it's ACID compliant and uses Paxos. Real question: was this not stated clearly in my comment (I'd like to edit to clarify if it is)?

I was interpreting the nod to Megastore in the context of the first sentence, which seemed pretty strongly about eventual consistency. Nothing to worry about, though - your followup comments are very clear!

> Megastore, however, is an example of a "NoSQL" system

Looking at Google's recent papers, they actually seem to be moving more in a SQL-ish direction. Dremel, in particular, appears to be a variant of SQL that allows easier access to nested and/or repeated columns. I thought Megastore was in a similar vein, but at that moment I can't seem to find the link that made me believe that (other than seeing that it uses strongly typed schemas).

Regardless of the benefits of a SQL-like language as an interface, however, I think you have a good point: the underlying architectural decisions about things like consistency or stability are very interesting, but often obscured.

A large part of the message I'm trying to convey is:

1. A DBMS is much more than just the interface. I'm going to write more on the subject. Whether you're using SQL, datalog, BSON, etc. -- there's a broader set of desirable features that's more important than the query language.

2. I'm not saying I agree with what the NoSQL folks are saying, or their justifications. Let's be honest: there's a prevalent sentiment that relational SQL based databases do not scale, and even further, that they somehow cannot scale. That's just not true.

I saw a video the other day of some guy at Google giving a talk about the database behind the app engine. At one point someone in the audience asked about scale and SQL. His response was "Well, how well does SQL scale?" Everyone in the room laughed.

> I saw a video the other day of some guy at Google giving a talk about the database behind the app engine. At one point someone in the audience asked about scale and SQL. His response was "Well, how well does SQL scale?" Everyone in the room laughed.

Sell your strength, don't attack others weakness. Demonstrate how your product is actually very similar to megastore (strong consistency, use of distributed transactions) and differs only in the query language ("GQL" itself is very similar to SQL to the point where an application developer could care less about which one is using). It looks like this is case the case (except I believe you guys are using 2PC instead of Paxos, in which case you want to clearly state the availability scenarios which Paxos addresses and at what costs).

Of course the problem with systems like megastore is that the overhead they introduce is seldom worth the compromises the application developers have already made. Incidentally, form what I hear, Megastore is not used very frequently outide of the app engine: application developers at Google itself are comfortable with using Spanner/BigTable directly but aren't comfortable with the additional overhead of distributed transactions as used by Megastore. App engine, customers, on the other hand aren't particularly interested in low a in the first place but aren't used to design patterns used on top of systems like BigTable.

By the way, I think Clustrix is a great product. I remember reading the white papers/specs when it came out and I'll make sure to scan it over again. Just why in the (imo, useless) SQL/NoSQL marketing circus rather than focus on the product's strength?

I personally think MongoDB is a poor comparison choice: it has the buzz, but there are some serious flaws in its distribution architecture (namely ill defined consistency and partitioning model) and the outdated disk persistence story (Mendel Rosenblum wrote his dissertation in 1992; building a log structured database is no less risky than cloning circa-1995 MyISAM). Comparison with another strongly consistent system like HBase or HyperTable would be more fit.

He did go into more on the on fault-tolerance and availability of Clustrix in another post:


(nb - I also work at Clustrix)

A blog post by the founder, without posting the benchmarks themselves? How can anyone expect to take this seriously?

OK. Fair enough. I'll post the benchmarks.

You can't randomly throw numbers around without listing your benchmark code, the config files, etc.

Secondly: even though it becomes clear eventually, you should mention up front your relationship with Clustrix. Just because you are a founder of Clustrix doesn't necessarily invalidate your findings, but full disclosure is always a good idea.

Along with licensing costs?

He talks about populating MongoDB with "rows". No one should be using (or benchmarking) a technology unless they understand what it actually is, and what problems it aims to solve.

The term "rows" could certainly be replaced with "tuples", "objects", or whatever else you prefer without changing any meaning of the post. If you look at the benchmarks posted on Mongo's site


you can see that most of them compare Mongo against MySQL. Certainly "rows" are being inserted into the latter.

has anyone out there actually used or evaluated clustrix? i haven't been able to find anything on the internet about this thing that hasn't come straight from the company.

Spending millions on engineer time to avoid buying a real database is not good finance either. At least with this approach you can start with the (free) MySQL and only pay once you're sure your idea has traction.

> Interestingly enough, we never heard that SQL or the relational model was the root of all their problems.

i really think that sometimes SQL and the relational model is a problem. at least, the relational database design is a course in a university, along with (object oriented) programming. so, to properly use postgre or mysql in your sophomore web startup, you should know two things well, or have a clever db guy...

i have seen brilliant web programmers that design nightmare database schemas.

So relational data is hard but implementing appropriate consistency checks in your application is easy? Or do you just skip that second part and hope for the best?

imho, the second, except that some consistency is present ib document dbs: like in the order example, all order items will be within the document and wont be lost without losing the whole order.

no-one would ever use 'eventually consistent' databases for financial or trading data, i think. they're for higly scalable consumer web projects

I updated the post with the benchmark source.

you updated your post and i am waiting people here at hacker news to criticize your benchmark setup.

maybe i missed smth but i still saw only the suggestion for you now send your benchmark to mongodb guys.

edit: please discard, now i see there they say you should include joins to your clustrix benchmark. waiting for the reply


I think it was a mistake to put this post on a blog with no other content. I think it leaves the reader with the impression that this is the only thing you want to contribute to the community... and as other commenters have pointed out that contribution could be perceived negatively.

Something to think about next time.

Every blog has a first post. What's so bad about that?

How long does it take to add a column to that table with 300MM rows? Does Clustrix require schema modifications?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact