
Analytics 101: Choosing the right database - turoczy
https://reflect.io/blog/analytics-101-choosing-the-right-database/
======
greggyb
Analytics 101: Choosing the right database is the wrong first step.

I was excited when I saw 'choosing the right data model' as one of the rules,
but they are talking about the data models the DB uses internally. The
important data model is choosing how you model the data you have to analyze. I
have my biases, but I'd argue that a dimensional model would be a good
starting if we're really at a 101 level and extensions to the model are for
future classes/development.

Starting with the end in mind is very important. When I look at this, I think
in terms of data culture. Who needs to be able to do what in your
organization? What types of question will you need to answer most often? What
types of questions will you need to support in an ad-hoc manner?

To many organizations "analytics" means arithmetic, but with complex filtering
logic and business logic, traditional BI, essentially. To others, "analytics"
means R code monkeys. To others it may mean specifically visualizations and
the presentation layer. There are many interpretations of the word. Regardless
of the interpretation, process and culture are more important to understand
before the technology.

For a rough analogy, it's like saying "Software development 101: Choosing the
right programming language". Sure it matters, but knowing what your software
needs to support and what the primary use cases are are more important to
understand.

Ninja edit: Grammar.

~~~
threeseed
Modelling your data dimensionally is becoming less and less popular in the
analytics space.

It used to be the standard in data warehouses but now the trend is to leave
the data unstructured and use query tools e.g Drill or do multiple ETL into
structured versions. But even the structured versions would not be
relationally modelled to any great extent. Data scientists typically want
access to data right now not in a few months when your database guy has
finishing modelling it.

~~~
greggyb
Dimensional modeling, though often associated with a traditional waterfall
methodology, is not tied to this delivery methodology.

It remains the most understandable model to the largest population of end
users. Analytics is a very broad term, as I mentioned in my original post, and
the audience is huge. If "analytics" to you implies an audience of primarily
data savvy end users, then dimensional modeling may not hold as much value.

I tend to find that the data scientists at our clients still do a lot more
data wrangling than data science when they don't have clean models to work
with as a baseline.

Additionally, there's a big difference between exploratory analysis of new
data sources where access and low latency are key, and well-known domains that
have fairly predictable needs. The former have a habit of transforming into
the latter. Dimensional models remain one of the most efficient physical
structures of data for a read-dominant workload.

Long story short, I think it's worthwhile to both of us to look outside our
bubbles. I work for a BI and data science consultancy. My focus specifically
is in core BI workloads, and so I'm definitely overexposed to more traditional
modeling techniques. I can guarantee you, though, that the cycle time for a
usable pilot that includes a fully realized dimensional model is more on the
order of a handful of weeks than months in a typical delivery.

What's your bubble?

------
sandstrom
Surprised that RethinkDB[1] isn't mentioned. It has support for replication
and sharding, plus a query language well suited for analytics.

(I'm not affiliated with them, just think they get proportionally little
coverage given their interesting product)

[1] [http://rethinkdb.com/](http://rethinkdb.com/)

~~~
lobster_johnson
Last I checked, RethinkDB had poor aggregation performance — which is key to
doing analytics, unless you want to use it purely as a master data store and
do aggregations elsewhere.

Some simple "group by" selects that would take ~3 seconds with Postgres or
Elasticsearch would take several minutes with RethinkDB, unless it died of RAM
starvage first. It looks to me like RethinkDB is not optimized for sequential
read access, nor is its caching algorithm tuned to such workloads.

I believe it also lacks many of the aggregations that you'll want to use, like
multi-level bucketing on different dimensions.

This was ~1 year ago, though, so it may have massive improved since then, who
knows.

~~~
danielmewes
Daniel @ RethinkDB here.

We're constantly improving performance and a lot has happened within the past
year. I think that at this point RethinkDB is as good a database for analytics
as many of the other general-purpose databases when it comes to features and
performance.

From what I can tell, there are still two main limitations that apply in some,
but not all scenarios:

* Grouping big results without an associated aggregation requires the full result to fit into RAM. I believe this was the limitation that you ran into a year ago, which lead to RAM exhaustion. This limitation is still there ( [https://github.com/rethinkdb/rethinkdb/issues/2719](https://github.com/rethinkdb/rethinkdb/issues/2719) in our issue tracker). However we're shipping a new command `fold` with the upcoming 2.3 release of RethinkDB, which can be used in the vast majority of cases to perform streaming grouped operations (in conjunction with a matching index). See [https://github.com/rethinkdb/rethinkdb/issues/3736](https://github.com/rethinkdb/rethinkdb/issues/3736) for details.

* Scanning data sets that don't fit into memory on _rotational_ disks is still inefficient. Most SQL databases deploy sophisticated optimizations to structure their disk layout in order to minimize the effects of high seek times. RethinkDB's disk layout it built with a stronger focus on SSDs. This limitation hence doesn't apply if the data is stored on SSDs.

~~~
sandstrom
Focusing on SSDs seems reasonable.

Rotational disks are on their way out. For example, Samsung recently
introduced a 15TB SSD[1], able to compete even with the largest rotational
disks.

[1] [https://news.samsung.com/global/samsung-now-introducing-
worl...](https://news.samsung.com/global/samsung-now-introducing-worlds-
largest-capacity-15-36tb-ssd-for-enterprise-storage-systems)

------
gtrubetskoy
Actually, you might want to not choose any database at all, but instead focus
on deciding on the data format, such as Parquet
([http://parquet.io](http://parquet.io)) or Avro
([https://avro.apache.org/](https://avro.apache.org/)), etc. Many of the tools
such as Hive, Impala, Spark, etc. support these formats natively.

You will also need to think about the schema, partitioning, compression and
other parameters, and those are not trivial decisions.

~~~
threeseed
The data format is important. ORC/Parquet being substantially faster then Text
or Sequence files.

But the query engines are far more important in terms of performance. Just
spend any time with SparkSQL and then Hive and you'll know what I mean.

------
pookeh
Surprised [http://druid.io/](http://druid.io/) wasn't mentioned. This db was
made specifically for both real time analytics and batch analytics. It even
has a nice front end [http://imply.io/](http://imply.io/)

------
pella
NEW: Scalable & Open Source PostgreSQL extension
[https://www.citusdata.com](https://www.citusdata.com) ( based on PG9.4 /
PG9.5 )

Github:
[https://github.com/citusdata/citus](https://github.com/citusdata/citus)

HN:
[https://news.ycombinator.com/item?id=11353322](https://news.ycombinator.com/item?id=11353322)
"Citus Unforks from PostgreSQL, Goes Open Source (citusdata.com)" ( 24th
March, 2016 )

 _" What is Citus?

\- Open-source PostgreSQL extension (not a fork)

\- Scalable across multiple hosts through sharding and replication

\- Distributed engine for query parallelization

\- Highly available in the face of host failures "_

 _" Citus provides users real-time responsiveness over large datasets, most
commonly seen in rapidly growing event systems or with time series data .
Common uses include powering real-time analytic dashboards, exploratory
queries on events as they happen, session analytics, and large data set
archival and reporting."_ [https://www.citusdata.com/blog/17-ozgun-
erdogan/403-citus-un...](https://www.citusdata.com/blog/17-ozgun-
erdogan/403-citus-unforks-postgresql-goes-open-source)

------
hbcondo714
This is by no means a comprehensive list of databases and that's not the
intent of this article. The real intent is that it's a simple read for many
companies still running solely on 'general purpose databases' and showing
where newer database technologies can fit in based on their data needs.
Upvote.

------
cdeshpande
What about ElasticSearch. Even though its search engine, its growing in
popularity as schemaless JSON data store

~~~
lobster_johnson
Surprised Elasticsearch isn't mentioned in the article.

Unlike several of the databases mentioned, it has a data model particularly
appropriate for analytics: While only apparently schemaless, its schema is
extensible (no need to pre-declare it), and by default every column is
indexed. Which means that there's no extra work on the client to assert the
existence of indexes for new fields.

More importantly, it does complex, nested, distributed aggregations (top-K,
date histograms, etc.) out of the box, and is incredibly fast at it, owing to
the columnar-store-like Lucene index model. You can do complicated
aggregations across millions of values over several dimensions in
milliseconds.

Elasticsearch has consistency issues, though, and even with 2.x and the recent
translog support you should probably never use it as a primary data store.

Some of the other databases mentioned (Cassandra, Riak and so on) are useful
mostly as primary datastores that get processed into something that can do
aggregations. For example, Cassandra -> Elasticsearch is probably a great
combo.

------
Xeoncross
I've always wanted a super-compact database for storing integers on smaller
setups where I don't have the resources to run a dedicated logging server.

I can represent almost everything as an int. Like time, cpu usage, line
number, etc. Even just a single byte is enough for most things like which
server number or custom error was thrown.

------
fweespee_ch
Please, please don't set your text color to #888 or #999 on a white
background.

I don't want to have to edit your CSS just so I can read the text.

~~~
dizzystar
I use the blacken plugin on my browser, which automatically converts all text
to black. I know this isn't your point, but it is much better than altering
CSS to fix the web.

Otherwise, I'd just assume the designers disrespect all of us with poor vision
and take it for granted the writing is done with equal care. Rather than
ranting about it, I'd rather not think about it.

With that said, for _some_ reason, blacken doesn't automatically convert this
site to black, meaning I have to manually convert the text to black. Even
after the author fixed the color, it is still impossible for me to read.
Please don't override user-assisted plugins. Being on the fort page of HN, I'm
assuming this is a decent article, but I have no desire to read it now.
thanks.

~~~
bilmeswe
We're going to darken the text on our next push. Thanks for pointing it out.

------
paoloiam
Turoczy, thank you for sharing! Great insights.

