
Ask HN: What are some ways to work with large amounts of data quickly? - nstart
I&#x27;ve been wondering about companies that work with large amounts of data. Analytics, reporting and that kind of thing. How do they select such large amounts of data especially when they are time based, and do it so quickly? From my understanding there are two main challenges.<p>A) Selecting the data from the DB very quickly. My guess is that there is quite a bit of tuning of the DB done here. Would love to know how to make that possible.<p>B) Working with all that data. If I do get all that data I can&#x27;t possibly store hundreds of thousands of records in the memory till I process it. What might be the recommended way to deal with it?<p>Any reading resources I can be pointed to would also be super helpful :).<p>Thanks!!
======
eb0la
I use sampling a lot for this stuff because my work usually involves
prototyping and for this I usually get a statistically-significant sample from
the data. I use this also as test vectors.

At this scale you can use whatever you want
(ipython+pandas+numpy/R/Matlab/Mathematica...). For visualization I like to
use Qlikview, but you can use Tableau, or DS3.js.

When I have the prototype / proof-of-concept the next step is productivizig
this.

You'll need: \- Somewhere to drop that data ( EC2 / Hadoop HDFS / NFS...) \-
Something to get the data from its origin and put it into the storage, aka ETL
(extract transfer load). Usual suspects here: Informatica, Talend, Sqoop, Pig,
hadoop, spark... \- Some kind of "database". If you're using structured data
you can use a "columnar" database like: Sybase IQ, Sap HANA, Teradata,
Netezza, or MS SQL Server 2012/14/16 with columnstore index, Hadoop HBase,
etc. If you're comfortable with it you can also dump the data into files with
columnar formats (like Parket).

\- Something that puts that data graphically in front of the user and it's
easy to work with. Some examples: Qlikview, Tableau, Sap Lumira, or DS3.js. If
you want to code your own stuff, take a look at the tools data journalist use.

You'll need some engineering to tune the data architecture to follow your
users natural workflow. For instance, if they only need month-to-month reports
you can partition your data to reflect that.

------
ahoka
It usually boils down to the following simple principles and techniques:

A: Probabilistic data structures and algorithms For example you can quickly
estimate if a very large set contains a certain key with bloom filters or
estimate cardinality with hyperloglog.

B: Divide and conquer Just divide your data to workable pieces and combine
partial results at the end. Hadoop is a popular infrastructure for doing this.
Other example is doing pieces of work in memory and saving it to disk by
chunks and reducing it later to get the final result.

------
karterk
1\. People don't store such large amounts of data in a conventional database
(especially relational DB). Depending on the exact requirements, the popular
choices are HBase, Cassadra or even flat files in HDFS.

2\. To analyze the data, Hadoop and Spark are great. Spark is especially a
great fit for iterative algorithms, and it also has a great python interface.

------
realtarget
What do you mean by "large amounts of data"?

A friend of mine is a data analyst at a large market research company and
analyzes more than 10 billion cookies a month using Hadoop from apache:
[http://hadoop.apache.org/](http://hadoop.apache.org/)

~~~
nstart
well let's say you want to create a minute by minute report on URL's visited
in an ecommerce site that has a large amount of usage. Every minute I want to
process all those URLs and decide if I want to create a flash sale on the
site. Since this was an example I just thought of, let's imagine that there
are 400k URLs visited every minute.

That would be a decent definition of what I meant by large amounts of data.
Sorry for not making that clearer before.

Your friend's example is pretty good. Curious how they work on the data in
memory.

~~~
ahoka
It depends on how many unique URLs you have. If you have like a few hundreds
or even thousand I would just use a simple hash table with counters to see
what's a hot sale right now (I guess that's what you want in your example). If
you have several millions than maybe I would put it into Redis or something
similar where I can utilize ready-to-use HyperLogLog cardinality estimation.

------
tixocloud
It depends on what the use case is and tradeoffs with budget, space and time.

For real-time reporting and analytics, we've gone with in-memory databases.

However, for more detailed analytics, we work with Hadoop and we also build
our own data warehouse to summarize transactional data.

------
DrNuke
CRUDding data in small chunks locally, pre-working 1% of data locally to
assess the extension of your domain or uploading in the cloud in a parallel
manner are the starting techniques.

