Hacker News new | past | comments | ask | show | jobs | submit login
Cassandra vs MongoDB For Time Series Data (relistan.com)
100 points by jbellis on Aug 21, 2013 | hide | past | web | favorite | 81 comments



One thing I always find interesting about these kinds of problems is that most DBs don't describe how they're implemented. It's easy to use the wrong tool, and then once you learn how e.g. Mongo is implemented, it's obvious, "oh, that's why things are slow".

I'd love to see http://eagain.net/articles/git-for-computer-scientists/, but for every DB technology.


Agreed. Seeing a nice, clean user-facing API tells you practically nothing about the performance characteristics and failure modes you can expect to see.

The best database engine I've encountered in this respect is SQLite. It has plenty of information about what design tradeoffs it makes, and why, e.g.:

http://sqlite.org/lockingv3.html

http://sqlite.org/fileformat2.html

http://sqlite.org/atomiccommit.html


You nailed it. I think the issue is that people don't know what questions to ask when gathering the requirements.

I'd like to know more about how this part of the article came to be:

"This choice was made early on and it was supposed to be a temporary one."

HOW was that choice made. What requirements were out there. I think too many people choose Mongo because they believe it's "schemaless"[1] and faster for development, but don't look at the requirements for their actual use case.

[1] - There's always a schema. Either it's informally defined by your code or represented formally somewhere else.


It was one of those, "we need to do something now and this will work" solutions. We had a really talented consultant working for us, writing some of the early code. He was familiar with Mongo and wanted to go that route. Early on I said we should use Cassandra for this, but it took us quite some time to get to the point of being able to migrate. A testament to his code and the "this will be temporary" foreknowledge is the fact that we could swap data stores in a pretty massive sense without much in the way of outward-facing changes to our processing engine.

I think the hard thing for us up front was that trying to explain Cassandra data models to someone (and this guy is really really good) and then hand the rest of the work over to them to implement now, on a contracting rate is not a trivial problem. And we needed to hurry for both deadlines and burn rate.


Cool, thanks for the info!


A pattern I've seen on a lot of the negative Mongo articles has been people using it for things they probably shouldn't.

I've yet to use it for any load, and am struggling to triangulate from all the articles I read on whether it does/doesn't scale efficiently.

All the issues I have hit so far have been self-inflicted, it is still one of the best new technologies I've used in years - but is has taken a while to stop thinking in SQL equivalents and start thinking natively.


> whether [Mongo] does/doesn't scale efficiently

It doesn't. Three words: "global write lock". Writes block reads, reads block writes. Implications: if you run a query in production that doesn't hit an index, all traffic stops. The notablescan setting is a very, very good idea. This also means all queries must have an index, so Mongo ends up with more indexes than say, postgres would.

It's impossible to configure a clustered mongo environment to not lose data: http://aphyr.com/posts/284-call-me-maybe-mongodb

Sharding configuration is baroque, and limited.


> Implications: if you run a query in production that doesn't hit an index, all traffic stops

Even if Mongo did have a global write lock, which it doesn't as has already been covered, it yields on page faults which means that other queries are minimally impacted. See: http://docs.mongodb.org/manual/faq/concurrency/#does-a-read-...


Mongo does have a global write lock. A per-db lock does me very little good when I only have one DB.

As to your linked doc, emphasis added:

> In some situations, read and write operations can yield their locks.

> Long running read and write operations, such as queries, updates, and deletes, yield under many conditions.

In practice, I've been bitten hard by this. A new feature rolls out, and users can't log in anymore, because a query is taking 2 minutes to run.


That's solves the contention problem, but just by virtue of yields on page faults being a big feature you can see where Mongo does have some scaling issue.


The global lock was removed in 2.2

https://blog.serverdensity.com/goodbye-global-lock-mongodb-2...

Now locking is on the database level.


That sorta makes it sound worse, honestly, if databases in MongoDB are anything like you'd use the word for in any other datastore. A global lock across all databases is mind-bogglingly confusing; what's one DB got to do with another?

Couldn't you just work around that before by running a separate Mongo process per database?

Saying it's now on the database still means any single-database app is globally locked. Or does using Mongo imply you're going to be making lots of databases so this actually means anything?


That's what we do - dozens of databases, containing one collection.

It's perverse.


TokuMX. They ripped out the native MongoDB storage engine and replaced it with the TokuDB engine.

http://www.xaprb.com/blog/2013/04/29/what-tokudb-might-mean-...


TokuDB doesn't currently have a drop-in replacement strategy, so you'd need to migrate your whole cluster.

A daunting task with hundreds of shards!


Such a thing could never exist, because the point of tokumx is to change the storage system, so at some point you have to change the storage over and that's just going to be a rewrite of all your data. It sucks but that's the way it is.


Yes, but if I could replace the replicas and let them refill one at a time, then replace the masters, I could do a rolling conversion.

But replication from MongoDB to TokuDB does not work.


Actually, you can do that. I wrote a tool to allow replication from MongoDB to TokuMX: http://www.tokutek.com/2013/07/tokumx-1-0-3-seamless-migrati.... It doesn't allow TokuMX instances to satisfy write concern or to participate in elections, but you can slave a TokuMX replica set off a MongoDB replica set just fine, and this allows you to do a rolling migration where you replace one secondary at a time.


You could say the same about multi-master environments. Those are prime opportunities for losing data (eventually consistent does not mean consistent, it means data loss)


Hopefully this article wasn't seen as bashing MongoDB. I think I mention it was pretty reliable for us even under load. We were just solving a problem with it for which it wasn't the best solution.


It doesn't. MongoDB is a dog, and I know this because I run hundreds of instances of it in a production environment. I need to use 10x as many hosts as I should to support less than 30K queries per second.

The second I can migrate to another data store, I will. Unfortunately that kind of refactor isn't possible right now, but all new projects are using different tools.


For what kind of application or workload is MongoDB superior to the alternatives?


The kind that lives on a single VPS and requires a rich query layer to bootstrap quickly.


> I'd love to see [git-for-computer-scientists], but for every DB technology.

this is a fantastic idea. if someone gets this going I'll enthusiastically contribute.


+1

It would be useful to treat databases as those big data structures, knowing the best/average/worst case, cpu vs. memory trade-offs for search, etc.


i work on scada systems which usually come with a time series database built in so that operators can do some basic plotting. Often these products have a 10 or 15 year legacy.

The scada vendors seem to be careful to avoid making the time series database and plotting tools which come with the HMI packages too powerful, as this might cut in to their sales of Historian type products.

If it wasn't a commodity already, the rash of startups which all seem to write their own tools for storing and plotting metrics from the operations of their servers and software services has certainly made the guts of a capable historian readily available for free.

for storing data open tsdb, based on hbase+hadoop, http://opentsdb.net/ kairosdb, based on cassandra, https://code.google.com/p/kairosdb/ timeseriesframework http://timeseriesframework.codeplex.com/

for plotting data I am hoping to find a library that allows for real time plotting and zooming, scrolling with the mouse wheel. so far I have found openhistorian http://openhistorian.codeplex.com/ kst http://sourceforge.net/projects/kst/ veusz http://home.gna.org/veusz/ chaco http://docs.enthought.com/chaco/ guiqwt https://code.google.com/p/guiqwt/ pyqtgraph http://www.pyqtgraph.org/ lots of D3 based libraries: http://selection.datavisualization.ch/

so there are lots of tools out there if you've got the patience to figure out which one is right for your application and glue it together


I had to look up "scada".

http://en.wikipedia.org/wiki/SCADA for those of you in the same boat


What about Graphite http://graphite.wikidot.com/faq ? I've never tried but it is described as "Scalable Realtime Graphing". It seems to use an internal database so maybe it's not ok for you..

EDIT: I read another comment from you, you said it's rrd-like so it gets rid of old data, not what you're looking for..


The standard 1s time resolution on Graphite/Whisper also seemed to be a limiting factor for use with some of these systems, where you want to observe things on the order of milliseconds (or beyond).


Might be interested in Cube that square released a few years ago. http://corner.squareup.com/2011/09/cube.html

edit: cube, not cuba :)


With something like Kairos, how do you segment the time series data? Like if I wanted to use it to keep track of stats for 2 different customers, how would that work?


KariosDB (https://github.com/proofpoint/kairosdb) is a time-series database for Casssandra. It's an OpenTSDB (http://opentsdb.net) rewrite that supports sub-second precision.


Have you any experience using it?


My company made the same switch from MongoDB to Cassandra, a little over a year ago. We were storing analytics counters in Mongo and wanted better consistency guarantees.

What we found when we switched was that Cassandra had better consistency with similar performance to MongoDB. Then a few months later, as we accumulated more data, performance started to take a nosedive. Counter increments began impacting other database operations and the nodes would become unresponsive. Eventually we moved all of the counters to an in-memory aggregator that flushed to Postgres a couple of times a second.

Counters were introduced in 0.8 (when we started using them) and are pretty half-baked. There has been some good discussion about overhauling counters, though I'm not sure when they're scheduled to land.


The company I work for had a similar experience with Cassandra. We were running it back in the 0.6 days. Once any of our nodes got around 500GB on them the performance would tank.

It seems that they later fixed the performance in 1.2 (http://www.datastax.com/dev/blog/performance-improvements-in...) but by that time we moved our data over to HBase and haven't had any regrets.


It seems a bit out of the ordinary, but we've found ElasticSearch + facets to be wonderful ways to consume time series data. You can dump in a bunch of information about an event (for example, unix timestamp or even response time in ms) and then return a facet over the range and get bucket counts.

ES also has quite nice clustering abilities that make it pretty painless to scale out. If you are clever about your routing keys you can even go crazy pre-shard hundreds of shards very early on and not have any performance hit for map reductions, but the capacity to scale out with another node without reindexing.

We've been surprised at the swiss army knife like ability of ES.


We had some major issues using MongoDB for time series data due to the write volume (real time sensor data). The solution for us was Riak, mainly because we never need to update a vector clock, leaving us without the need for conflict resolution (and last write wins is fine for sensor data).


Why wouldn't generating some hashed _id work? It would then scale for writes easily (or, as in current mongodb, you could use hash-based indexes).

I'm just asking since Riak seemed much slower for me when I tried it.


Hashed _id's aren't the bottleneck. Disks are the bottleneck. MongoDB is seriously lacking in the compression department and bogs down our disks constantly.

Riak is not slow as long as you're not running your cluster on VPSs. At our scale Riak's performance has been significantly better than MongoDB's due to our heavy write load, and there are fewer issues with using big disks with somewhat larger seek times.


I'm a little confused by the schema he describes. He says represent "periods" in columns and "records" in rows. Are these two different units of time?

For say stock data that is sampled every second, is he saying there'd be one row per symbol per minute (a "record"), with 60 columns holding the value for each second (a "period")?

If so, does that mean the data is buffered in memory for 1 minute before getting written to the DB?

"Cassandra is really good for time-series data because you can write one column for each period in your series and then query across a range of time using sub-string matching. This is best done using columns for each period rather than rows, as you get huge IO efficiency wins from loading only a single row per query. Cassandra then has to at worst do one seek and then read for all the remaining data as it’s written in sequence to disk.

We designed our schema to use IDs that begin with timestamps so that they can be range queried over arbitrary periods like this, with each row representing one record and each column representing one period in the series. All data is then available to be queried on a row key and start and end times."


No, OP's wording can be confusing if you have never used Cassandra before.

You are right. There is one row per symbol. However with Cassandra any given row can have any number of columns, so when you want to write a new value, you just create a new column for that second (/period).

The writes are not buffered in memory.


I see. From what I know about Cassandra, this is a much more expensive write than doing it as a new row.

To do this he has to be using dynamic columns, and those are stored as one serialized blob per row. So the more data you have in the row, the more expensive the deserialization/reserialization is with each column you add. For very large series this could be an issue.

But it sounds like this is tolerable for his app because the writes are distributed over time in a predictable fashion.

I am a little surprised though at the author's claim that fetching a single big row results in "huge IO efficiency" over a range of small rows. I'd expect a small amount of overhead, but isn't it more or less the same amount of data being retrieved? What am I missing?

EDIT: I see the author mentioned that it reduces disk seeks because it's all serialized together already. Sort of like you're defragging the series data on every write. I guess that makes sense.

Personally I would probably look at using SSDs and keep the schema more "sane" and have more scalable writes, but that's just me.


1.) You do not have to use dynamic columns for this. Unfortunately I've found in my own experience, as Cassandra has matured over the last year, alot of terminology has fallen in and out of fashion and its hard to recognize what is actually current. Dynamic columns in CQL3 has nothing to do with the behavior OP is talking about and in dynamic columns are sort-of a deprecated feature in Cassandra 1.2. In CQL3, OP's use-pattern in actually hidden if you didn't know any better.

In short, there is no deserialization/reserializaion. OP's writes are append only. I have a similar use pattern to OP, and I haven't seen any performance issues with 100,000s of columns (on SSDs)

2.) The "huge IO efficiency" is similar to what you would see in any columnar data store. Wikipedia has a good walkthrough of it (http://en.wikipedia.org/wiki/Column-oriented_DBMS). The short story is now there is fewer meta data between his values.

--

In any case, it works out because Cassandra is far more well suited for this type use pattern than Mongo is. We migrated from MongoDB (on SSDs) to Cassandra for similar reasons. The perf-killer on Mongo in this scenario is the write lock.


Thanks. I definitely need to read up on Cassandra wide rows now.


nemothekid is correct. We are not using CQL, we are using Thrift and are running with real columns.


My dim understanding is that the term "column" in Cassandra doesn't mean what we think it means.


It would be surprising if, for data with a consistent, predictable structure, MongoDB had the best, most consistent, and most predictable performance. MongoDB's raison d'etre is that data doesn't always have consistent, predictable structures.

That said, there's cool stuff out there in the Mongo ecosystem. E.g., TokuMX is a whole new Mongo storage engine.


For what its worth, we have been using Cassandra for storing time series for about 2 years now at ~2k writes a sec. I would say every issue was self-induced and Cassandra has been amazingly patient with us. It works amazing with this scenario. We experimented with MongoDB a lot initially (along with riak, hbase, etc) and found about the same thing. Turns out using a database in a way its designed to work turns out in your favor. That said hbase did well too, but it scared our ops team.

All the new changes in 1.2 and 2.0 with cql really make it seem like datastax is focused on being mysql and ignoring the time series use case though which makes me nervous.


I sympathize with that perspective. I was a huge Thrift fan five years ago. But it became clear early on that, despite your experience and mine, most people had a really hard time wrapping their heads around Cassandra's Thrift API. Teaching CQL for the last 8 months has been a night and day difference. Ease of use does matter.

It doesn't hurt either that CQL is substantially more performant [1]. Perhaps that will sweeten the pill for you. :)

That said, while CQL may get the most publicity, we certainly haven't been neglecting the rest of the stack, e.g. [2], [3], [4], ...

[1] https://twitter.com/karbonized1/status/369472202015399936 [2] http://www.datastax.com/dev/blog/performance-improvements-in... [3] http://www.slideshare.net/jbellis/cassandra-summit-2013-keyn... [4] http://www.datastax.com/dev/blog/lightweight-transactions-in...


I understand the nervousness, but I was able to convert a thrift/hector time series model to exactly equivalent CQL3 without too much trouble. The (perhaps non-obvious) options involved were "WITH COMPACT STORAGE" for wide rows and "WITH CLUSTERING ORDER BY (blah DESC)" for a reversed comparator.

http://www.datastax.com/documentation/cassandra/1.2/webhelp/...


Thanks for the pointers. We are also looking at that as a future plan. We are currently using Thrift which has been rightly pointed out and which I should have mentioned.


I'm assuming this website is being served from S3 based on the HTTP headers. As I'm typing this, I get an empty Content-type: header. Is that a configuration oversight or is this par for the course if serving web pages from S3?


An odd oversight. Thanks for pointing it out. I'll see what's up with that.


I didn't see the size of the cluster described, but watch out for having timestamps as your row keys, for you are going to have hot spots (all timestamps in one token range).

What is your replication factor and the size of the cluster?

This might be improved with vNodes, though I'm not sure how granular and automatic the subnodes are. If they are just an even range (e.g. 256 vnodes across the same 00-ff range), then you will have the same problem.

This is the major reason why Datastax pushes random ordered partitioning so much, it's easy to get into hot water with byte-ordered keys.


You're right. As I mention in the article, the row keys are not timestamps, the columns are timestamps. We use the RandomPartitioner for rows.


Sorry, I misread!


We use KDB for time series data but it is expensive.

He's anyone checked this project out? http://www.monetdb.org/Home


The column store research that started it all. Definitely read the paper if you haven't already:

"The Researcher’s Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds" (http://oai.cwi.nl/oai/asset/18546/18546B.pdf)


My go-to for any time-series data is a column-oriented database.

fyi, Cassandra is row-oriented.


Well, but in a column-oriented database you would store your time series data in columns. Cassandra is "row-oriented" and you store the time series in rows. In both cases the data ends up sequentially on the disk, which is what you want.


i think monetdb is not supposed to be used on high write (or update) scenarios


I am not sure if this is really an apple to apple comparison because from the article it seems that the data schema in Cassandra was carefully designed while it's not obvious whether it's the case for the MongoDB one("it was supposed to be a temporary one").

MongoDB has many many limitations. Data schemas have to be carefully designed taking them into consideration, otherwise they are going to have a huge impact to the performance.


When it comes to timeseries data, there aren't a lot of ways to design your schema. In my experience when working with a write heavy load such as timeseries data, you are always fighting the writelock with MongoDB. AFAIK, there isn't much you can do about your schema to combat this.


I'm considering using Cassandra for time series data. What exactly did you mean by the columns were for the "period"? What is the period here?


I find it strange that there is no discussion of the performance of writing data.

Cassandra and Mongo differ hugely in this respect, and I expect you will see huge performance gains in write performance. Mongo's write locking will mean that reads will be blocked while you are inserting data. Reads in Cassandra may trigger compaction and or require sequential IO if the table has not yet been compacted, so the tradeoff is interesting


Reads in Cassandra will never trigger a compaction. Compaction is only be triggered by writes. If the compaction queue gets behind due to overloaded CPUs / disks, reads will begin to slow down.


Would be really useful to get more background before/after and a lot more detail on the structures, the clustering and the queries required - and any perf improvements that were attempted previously successful or not - and any lessons learned from making the change.


Good ideas, I'll see if I can generate some time to post more about what we did.


We (Cloud Monitoring @ Rackspace) have built a time series data store for persisting metrics on top of Cassandra. Check it out: https://news.ycombinator.com/item?id=6257401


Makes one wonder what the performance of Mysql/Memcached would have been.

From my work experience, time-series data is quite stable in definition. I would see this more as a business case for a relational database than a NoSQL database.


Disk usage patterns are the problem with most relational stores (and were for Mongo). When you do a query on time series data, you want to look up a start, read some data, and stop when you hit the end point. Getting it laid out largely sequentially on disk is a big win there.


What GB/TByte Volume are we talking about here? Since memcache would be able to store some parts in RAM.


Isn't graphite designed specifically for time series data? Or is there another use case to use Cassandra over Graphite?


graphite's back end is called whisper and it is a rrdb type datastore which throws away data as it gets older. You can see a plot of your time series from three years ago with a datapoint per day, but not one data point every .5 seconds that you initially recorded at.


It is interesting to see that most NoSQL databases, implemented in Java/Erlang scale better than a C++ one.


Anyone uses cassandra with OrderedPartitioner and is it good ?(i know random is best)


OrderedPartitioner is definitely an expert-mode feature. You really have to have a good feel for the key proximity of reads and writes, or otherwise you can create some really nasty hotspots on individual nodes. We stick with Random.


But is Cassandra web scale?


some vague graph from this year's Cassandra Datastax conf: https://pbs.twimg.com/media/BMfdOagCcAAxMVU.jpg


For Time Series Data there is also TempoDB


how's that, anybody try using it ?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: