
Ask HN: Software for exploring billion row sized datasets - tomx
Given billions of rows of data across a few tables, how do you best make sense of it?<p>My current method involves thinking up interesting questions and writing database queries. I then plug the resulting data into gnuplot to examine it.<p>Is there generally a better way? I am kinda hoping a Mathematica/Matlab type shell or similar for databases or other data sources exists. Just type a query and view a graph. Even better, type queries, output graphs into a web page.<p>Or is the method to hire a data scientist to build specialised reports?<p>The data format is agnostic, interested in how this works across all ecosystems.
======
lutusp
> Given billions of rows of data across a few tables, how do you best make
> sense of it?

Your inquiry won't go anywhere until you describe the problem you're trying to
solve. Be specific, if only for a single example problem.

I say this because there's no generic solution to accessing a large database
-- the solution depends on the goal.

------
lukev
Using a language that supports an interactive development library might speed
up the process for you.

I use Clojure, and like Incanter for this kind of work. I also use Datomic as
my data store, when I can, which makes it quite easy to perform ad-hoc
queries.

Of course, the fact that your data is too large to effectively fit in memory
means that, whatever you're graphing, you're going to have to aggregate it a
bit first before you can visualize it. That's really the hardest part of what
you asking, and how you do that efficiently depends entirely on what your
query is and what kind of data store you're using.

I'm not aware of any off-the-shelf software that does what you're talking
about, unless it fits into an OLAP-type schema
(<http://en.wikipedia.org/wiki/OLAP_cube>) for which there are several
products available.

------
runarb
When you mentions billion row datasets MapReduce and Apache Hadoop comes to
mind, but that requires that you are capable to do some computer programming.

There may also be a lot of existing solution to present/summaries/graph you
data, depending on what it contains and witch program created it. Can you give
us some more insight into what kind of data you have?

------
teyc
SQL Server and Excel PivotTables uses Vertipaq. The main idea is data along
columns tend to not change very much. Therefore, one is able to compress data
in memory in columns, achieving a very high degree of compression.

Perhaps you can roll something like this as well.

------
jamessb
There are GUI analysis tools that produce graphs directly from databases, eg
Tableau <http://www.tableausoftware.com/solutions/big-data-analysis>

