
Practical advice for analysis of large, complex data sets - yarapavan
http://www.unofficialgoogledatascience.com/2016/10/practical-advice-for-analysis-of-large.html
======
maxxxxx
I hope some experienced data analytics people will read this thread so here is
a slightly unrelated question: We have a data set of 1 TB growing at 1 TB/year
we need to analyze. Our IT is pushing for Hadoop but this involves a lot of
integration work because they have no plumbing ready. The whole thing just
feels way too complex for our use case.

The data is reasonably structured so I think we can easily use a SQL database
with possibly some XML or JSON columns. This would be much easier and quicker
to set up.

Is 1TB a size that makes sense for Hadoop? Are there any alternatives like
Google BigQuery, MongoDB or others? Sorry, I am not up to date with the latest
cloud offerings. Also, we are in the medical field so this raises some
security questions.

~~~
jakub_g
I only know Hadoop from reading about it on HN, and the blog post I remember
the most is "Don't use Hadoop - your data isn't that big"

[https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html](https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html)

TL;DR

> But my data is 100GB/500GB/1TB!

> A 2 terabyte hard drive costs $94.99, 4 terabytes is $169.99. Buy one and
> stick it in a desktop computer or server. Then install Postgres on it.

~~~
sn9
There's also this [0] on how to outperform Hadoop with command line tools.

Obviously, you should look at the role I/O costs can play. A lot of RAM and
some SSD drives might be a better idea than Hadoop at this scale.

[0] [http://aadrake.com/command-line-tools-can-be-235x-faster-
tha...](http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-
hadoop-cluster.html)

------
mikecb
I love this blog. They had a great post describing a privacy preserving query
proxy:
[http://www.unofficialgoogledatascience.com/2015/12/replacing...](http://www.unofficialgoogledatascience.com/2015/12/replacing-
sawzall-case-study-in-domain.html)

