I'd love to see http://eagain.net/articles/git-for-computer-scientists/, but for every DB technology.
The best database engine I've encountered in this respect is SQLite. It has plenty of information about what design tradeoffs it makes, and why, e.g.:
I'd like to know more about how this part of the article came to be:
"This choice was made early on and it was supposed to be a temporary one."
HOW was that choice made. What requirements were out there. I think too many people choose Mongo because they believe it's "schemaless" and faster for development, but don't look at the requirements for their actual use case.
 - There's always a schema. Either it's informally defined by your code or represented formally somewhere else.
I think the hard thing for us up front was that trying to explain Cassandra data models to someone (and this guy is really really good) and then hand the rest of the work over to them to implement now, on a contracting rate is not a trivial problem. And we needed to hurry for both deadlines and burn rate.
I've yet to use it for any load, and am struggling to triangulate from all the articles I read on whether it does/doesn't scale efficiently.
All the issues I have hit so far have been self-inflicted, it is still one of the best new technologies I've used in years - but is has taken a while to stop thinking in SQL equivalents and start thinking natively.
It doesn't. Three words: "global write lock". Writes block reads, reads block writes. Implications: if you run a query in production that doesn't hit an index, all traffic stops. The notablescan setting is a very, very good idea. This also means all queries must have an index, so Mongo ends up with more indexes than say, postgres would.
It's impossible to configure a clustered mongo environment to not lose data: http://aphyr.com/posts/284-call-me-maybe-mongodb
Sharding configuration is baroque, and limited.
Even if Mongo did have a global write lock, which it doesn't as has already been covered, it yields on page faults which means that other queries are minimally impacted. See: http://docs.mongodb.org/manual/faq/concurrency/#does-a-read-...
As to your linked doc, emphasis added:
> In some situations, read and write operations can yield their locks.
> Long running read and write operations, such as queries, updates, and deletes, yield under many conditions.
In practice, I've been bitten hard by this. A new feature rolls out, and users can't log in anymore, because a query is taking 2 minutes to run.
Now locking is on the database level.
Couldn't you just work around that before by running a separate Mongo process per database?
Saying it's now on the database still means any single-database app is globally locked. Or does using Mongo imply you're going to be making lots of databases so this actually means anything?
A daunting task with hundreds of shards!
But replication from MongoDB to TokuDB does not work.
The second I can migrate to another data store, I will. Unfortunately that kind of refactor isn't possible right now, but all new projects are using different tools.
this is a fantastic idea. if someone gets this going I'll enthusiastically contribute.
It would be useful to treat databases as those big data structures, knowing the best/average/worst case, cpu vs. memory trade-offs for search, etc.
The scada vendors seem to be careful to avoid making the time series database and plotting tools which come with the HMI packages too powerful, as this might cut in to their sales of Historian type products.
If it wasn't a commodity already, the rash of startups which all seem to write their own tools for storing and plotting metrics from the operations of their servers and software services has certainly made the guts of a capable historian readily available for free.
for storing data
open tsdb, based on hbase+hadoop, http://opentsdb.net/
kairosdb, based on cassandra, https://code.google.com/p/kairosdb/
for plotting data I am hoping to find a library that allows for real time plotting and zooming, scrolling with the mouse wheel.
so far I have found
lots of D3 based libraries: http://selection.datavisualization.ch/
so there are lots of tools out there if you've got the patience to figure out which one is right for your application and glue it together
http://en.wikipedia.org/wiki/SCADA for those of you in the same boat
EDIT: I read another comment from you, you said it's rrd-like so it gets rid of old data, not what you're looking for..
edit: cube, not cuba :)
What we found when we switched was that Cassandra had better consistency with similar performance to MongoDB. Then a few months later, as we accumulated more data, performance started to take a nosedive. Counter increments began impacting other database operations and the nodes would become unresponsive. Eventually we moved all of the counters to an in-memory aggregator that flushed to Postgres a couple of times a second.
Counters were introduced in 0.8 (when we started using them) and are pretty half-baked. There has been some good discussion about overhauling counters, though I'm not sure when they're scheduled to land.
It seems that they later fixed the performance in 1.2 (http://www.datastax.com/dev/blog/performance-improvements-in...) but by that time we moved our data over to HBase and haven't had any regrets.
ES also has quite nice clustering abilities that make it pretty painless to scale out. If you are clever about your routing keys you can even go crazy pre-shard hundreds of shards very early on and not have any performance hit for map reductions, but the capacity to scale out with another node without reindexing.
We've been surprised at the swiss army knife like ability of ES.
I'm just asking since Riak seemed much slower for me when I tried it.
Riak is not slow as long as you're not running your cluster on VPSs. At our scale Riak's performance has been significantly better than MongoDB's due to our heavy write load, and there are fewer issues with using big disks with somewhat larger seek times.
For say stock data that is sampled every second, is he saying there'd be one row per symbol per minute (a "record"), with 60 columns holding the value for each second (a "period")?
If so, does that mean the data is buffered in memory for 1 minute before getting written to the DB?
"Cassandra is really good for time-series data because you can write one column for each period in your series and then query across a range of time using sub-string matching. This is best done using columns for each period rather than rows, as you get huge IO efficiency wins from loading only a single row per query. Cassandra then has to at worst do one seek and then read for all the remaining data as it’s written in sequence to disk.
We designed our schema to use IDs that begin with timestamps so that they can be range queried over arbitrary periods like this, with each row representing one record and each column representing one period in the series. All data is then available to be queried on a row key and start and end times."
You are right. There is one row per symbol. However with Cassandra any given row can have any number of columns, so when you want to write a new value, you just create a new column for that second (/period).
The writes are not buffered in memory.
To do this he has to be using dynamic columns, and those are stored as one serialized blob per row. So the more data you have in the row, the more expensive the deserialization/reserialization is with each column you add. For very large series this could be an issue.
But it sounds like this is tolerable for his app because the writes are distributed over time in a predictable fashion.
I am a little surprised though at the author's claim that fetching a single big row results in "huge IO efficiency" over a range of small rows. I'd expect a small amount of overhead, but isn't it more or less the same amount of data being retrieved? What am I missing?
EDIT: I see the author mentioned that it reduces disk seeks because it's all serialized together already. Sort of like you're defragging the series data on every write. I guess that makes sense.
Personally I would probably look at using SSDs and keep the schema more "sane" and have more scalable writes, but that's just me.
In short, there is no deserialization/reserializaion. OP's writes are append only. I have a similar use pattern to OP, and I haven't seen any performance issues with 100,000s of columns (on SSDs)
2.) The "huge IO efficiency" is similar to what you would see in any columnar data store. Wikipedia has a good walkthrough of it (http://en.wikipedia.org/wiki/Column-oriented_DBMS). The short story is now there is fewer meta data between his values.
In any case, it works out because Cassandra is far more well suited for this type use pattern than Mongo is. We migrated from MongoDB (on SSDs) to Cassandra for similar reasons. The perf-killer on Mongo in this scenario is the write lock.
That said, there's cool stuff out there in the Mongo ecosystem. E.g., TokuMX is a whole new Mongo storage engine.
All the new changes in 1.2 and 2.0 with cql really make it seem like datastax is focused on being mysql and ignoring the time series use case though which makes me nervous.
It doesn't hurt either that CQL is substantially more performant . Perhaps that will sweeten the pill for you. :)
That said, while CQL may get the most publicity, we certainly haven't been neglecting the rest of the stack, e.g. , , , ...
What is your replication factor and the size of the cluster?
This might be improved with vNodes, though I'm not sure how granular and automatic the subnodes are. If they are just an even range (e.g. 256 vnodes across the same 00-ff range), then you will have the same problem.
This is the major reason why Datastax pushes random ordered partitioning so much, it's easy to get into hot water with byte-ordered keys.
He's anyone checked this project out? http://www.monetdb.org/Home
"The Researcher’s Guide to the Data Deluge:
Querying a Scientiﬁc Database in Just a Few Seconds"
fyi, Cassandra is row-oriented.
MongoDB has many many limitations. Data schemas have to be carefully designed taking them into consideration, otherwise they are going to have a huge impact to the performance.
Cassandra and Mongo differ hugely in this respect, and I expect you will see huge performance gains in write performance. Mongo's write locking will mean that reads will be blocked while you are inserting data. Reads in Cassandra may trigger compaction and or require sequential IO if the table has not yet been compacted, so the tradeoff is interesting
From my work experience, time-series data is quite stable in definition. I would see this more as a business case for a relational database than a NoSQL database.