
Ask HN: What database should we use for our “big data” problem? - iLoch
I originally posted this question to Stack Exchange, and am copying it below here so I can get some insight from HN too :) The original (extended) post is here: http:&#x2F;&#x2F;dba.stackexchange.com&#x2F;questions&#x2F;162008&#x2F;deciding-on-the-right-database-for-big-data<p>We&#x27;ve identified the following requirements of a database:<p>- <i>Thousands of inserts&#x2F;updates per second</i><p>- <i>Has an solid aggregation strategy</i>: Aggregating data in N different ways is really important to us, any hindrance in our ability to aggregate data is going to slow us down.<p>- <i>Store + query related data</i> (hierarchical&#x2F;recursive JSON) -- ideally efficiently, but not required: getting the data out in a useful format is very important to us.<p>- <i>Partition tolerant</i> (easily clustered, automated replication)<p>- <i>Availability</i> (heavy reads + writes): same story as the partition tolerance. Availability requirements scale with the popularity of the video games we&#x27;re storing data for.<p>With those requirements in mind, we&#x27;ve been mulling over a few different choices.<p>- <i>Cassandra</i>: Nails the partition tolerance and availability requirements, but it&#x27;s very limited in querying capability. Also has the intriguing ScyllaDB, which could prove to provide us with more capacity if needed. With the addition of Spark to our infrastructure, Cassandra may be a good fit for our aggregation needs as well. (And we realize Riak + others are compatible with Spark as well.)<p>- <i>MongoDB</i>: Native support for JSON could be a big plus. Has a built in aggregation pipeline, but we&#x27;re unsure of its capability. Not too enthused with the master-slave replication - again we&#x27;re not overly concerned with consistency.<p>- <i>Postgres</i>: Maybe we don&#x27;t need NoSQL yet? With the right partitioning strategy, an RDBMS could prove to work for our use case. Postgres has (from what I can tell) excellent support for JSON. Indexing on JSON properties could prove to be very useful for aggregation queries.
======
joeclark77
I have a couple of questions for clarification:

1\. When you say "thousands of inserts/updates per second" do you mean
thousands of both? Or just thousands of inserts and maybe a smaller number of
updates? It makes a difference because updates require _finding_ the record
first, and some data stores (like Hadoop) are basically write-only.

2\. Are these "aggregations" going to be run on whole database tables at once
(like the top score of each game), or on small chunks of them (like an
individual player's scores)? For small chunks, strategies like indexing that
speed up queries are essential, but for big batch processes, the MapReduce
algorithm is the way to go.

I'm going to give my two cents based on a couple of alternative scenarios. In
the first scenario, you're capturing gamers' scores/achievements/whatever and
it's really only inserts, no updates or deletes except to correct errors. In
that case you might go with a basic star schema implemented in a relational
database, or something like BigTable if you need it faster. Read the Kimball
book ( _The Data Warehouse Toolkit_ ) if you don't know the model. Basically
it's optimized for aggregation -- sums/averages/counts sliced and diced and
pivoted any way you want to see them. At the middle of the star schema is a
big "fact" table that basically records all events/transactions in a write-
once-then-read-only manner.

In the second scenario, you're storing gamers' saved games or something, and
really need to do updates and deletes as well as inserts. Yet you want to run
aggregations on data at the level of individual gamers and on individual
games. In this case I think I'd use some NoSQL database with very fast access
and redundancy, based on Amazon's Dynamo. Riak is based on that model. I can't
remember if Cassandra is. You could have two "tables", one for players and one
for games, and each data point would be inserted twice. So you could get all
the data for a game with one quick hash lookup, or all the data for a player,
and then run your aggregation at that level of detail. Really big system-wide
analyses could run on the whole database using MapReduce (you mentioned Spark,
that'd do it).

~~~
iLoch
1.) Over time, the number of new players playing a game decreases, to the
point where most writes will be updates. That being said, in some cases we
have access to totals via the APIs, so we just do a replace/upsert.

2.) We tend to do a bit of both. One of the main uses we have for aggregation
right now is for global-level analytics (ie. how are certain weapons in the
game being utilized over time)

I think the second approach sounds more closely aligned to what we're
interested in doing.

The trouble with any of these solutions (and this is where I sort of neglected
to make my description of the problem more complex) is that their fairly one
dimensional. The reality is that one player may have multiple characters each
with their own stats, and we may want to track the statistics of each weapon
of each character for each player as well. We may want to track all of this
per playlist as well. The relationships start to get _very_ complex, as does
the data.

~~~
joeclark77
Well, one of the reasons they call it Big Data is that it can get big. You can
store the same data point multiple times - once in the "player" record, once
in the "game" record, once in the "weapon" record, etc. Whatever makes it
faster/easier to process.

You have to forget everything you learned about "normalization" if you've ever
studied relational databases and SQL. If redundant copies or even multiple
platforms and data models allow you to make faster queries, use them!

