

Ask HN: Who is actually doing something with "Big Data"? - cbo

I'm not doubting the validity of the term. Obviously it exists. The big companies -- Google, Facebook, Twitter, etc. -- all have enormous datasets to work with, so of course it's real.<p>But I read this term so much that it seems as though EVERYONE is working with it, and that just doesn't seem feasible to me.<p>Do you work with "Big Data" at your (preferably not aforementioned) company? What do you do with it? Where do you get it from? Are there any significant pain points?
======
apaprocki
The term, at least to me, does not necessarily refer to storage of large
amounts of data. At Bloomberg, we process a large amount of data in realtime.
When focusing on the processing of realtime market data, the hardware topology
looks different than when you are, say, doing a map reduce across a huge
number of nodes. You often have a single point of entry for a particular
stream (with redundancy, of course), but then the trick is to efficiently
distribute the data to the many places that require it with as near zero
latency as possible. Obviously there is a need for longer term storage and
instantaneous retrieval of billions of data points, but the largest focus is
on near-zero latency from point of consumption, distributing to all internal
nodes/apps and to many thousands of sites in 180+ countries with a large
number of "nines".

The biggest challenge is that data feeds originate in nearly all of those
countries and also need to be distributed efficiently to every other country.
(e.g., NASDAQ originates in the US and reaches around the globe, and the same
is true for realtime feeds on the opposite side of the globe in the Middle
East, India, Singapore, Hong Kong, Tokyo, etc.) The Internet is not reliable
from a latency point of view so coupled with the required hardware is the
required network. We operate one of the largest private networks in the world.

edit: Also, from a processing point of view we have had great success with
speeding up complex algorithms that would normally take minutes to run across
huge compute clusters, bringing them down to seconds by porting them to run on
large GPU clusters. Certain things are definitely suited for running on GPUs,
but I feel it is still pretty foreign to most programmers and hard for
companies to decide to jump into that kind of project. You're starting to see
more specialized use of GPU or slower-clock-but-massively-parallel compute
devices for a wider variety of tasks. (e.g.,
[http://gigaom2.files.wordpress.com/2011/07/facebook-
tilera-w...](http://gigaom2.files.wordpress.com/2011/07/facebook-tilera-
whitepaper.pdf))

~~~
ifearthenight
Thanks for the long but interesting comment. Are you able to share a little
about the db setup you are using?

------
quadlock
I've been using Big Data techniques(using nosql and processes such as map-
reduce) to do data analysis and to produce useful and actionable information.
My data isn't super huge right now, my mongodb data directory is at about
33GB. It is a good start and useful for working out techniques that can be
applied to much larger datasets.

~~~
vaib
Great! I've been looking for that sort of comment. Can you help me get started
on some idea to build a data analysis tool to produce some useful results out
of some large data that would normally take a lot of time for data analysts? I
just want want to make a good hadoop POC. Technology no problem, I just want
help some sort of idea on the idea side.

------
gumbo
big data is not really a technology, it is just a term to refer to the fact
that when you have "too much data" to deal with you can't do it the usual way.

Now to come back to the point of your question: often when people says "Big
data", they are referring to the fact they use "some" technologies suited for
"big data" like NoSQL database.

------
SanjeevSharma
check out <http://www.storediq.com/>. They seem to be doing some interesting
stuff with 'Big Data'

