
Ask HN: How to analyse the real-time Web? - dvd03
This seems to be one of today's extremely interesting problems:<p>How can we process an enormously high rate of incoming data such that processing latency is minimal?<p>What are your thoughts?<p>Some points to consider:<p>1. High rate of inserts<p>If the incoming rate of new data is very high, and if we store this data for any length of time, then databases tend to struggle. Why? Because, as we know, inserting individual datapoints becomes a significant bottleneck when the database is large, and the rate of inserts is high.<p>2. Computation across full dataset<p>Any cleverness on our part, to discern patterns in the data, may require our regularly crunching sequentially through all (or, at least, significant amounts of) our data. Databases are optimised for random access. To achieve this, they have overheads which mean that they are not optimised for sequential reads.<p>It is because of points 1 and 2 that industry folk are loudly singing the praises of Hadoop when they consider high rates of incoming data which requires persistent storage - (HDFS, with its low-overhead chunk archiving) and its corresponding MapReduce computation framework.<p>So, problem's solved? Not so fast.<p>Let's consider:<p>A. Inherent latency in HDFS and MapReduce<p>We know also that HDFS has an associated latency which prevents real-time analysis. This is for four reasons:
i) Data is inserted into the HDFS in chunks. Until a chunk is formed, it cannot be inserted - and, accordingly, it cannot be analysed.
ii) MapReduce has a synchronisation lock after the map function, so no reduce outputs can be given until maps are all complete.
iii) Reduce pulls output data from Map, so pipelining is made inherently difficult - even without the synchronisation lock.
iv) Hadoop processing only occurs after data is written to disk,<p>B. Order<p>We also know that whilst HDFS with MapReduce is great for unstructured data, many data streams are ordered - at least chronologically. This order can (and often should?) be exploited when processing. Exploiting this order across HDFS is not trivially done.<p>It is because of points A and B that we should probably be thinking of in-memory stream processing in addition to Hadoop - if, indeed, we believe that the community is correct in thinking that Hadoop is the perfect fit for archived data which comes in at high volume, and which also needs processing.<p>What are your thoughts?
======
smiler
Can I ask a non-technical question about analysing the real-time web...

What's the business benefit? I don't think enough happens 'real time' for it
to make much of a difference - reading a few blogs, news websites, digg /
reddit & trending topics on twitter is enough to catch up on what happened in
the world in a day.

Has anyone actually thought of anything that could make this useful?

~~~
jgrahamc
_Has anyone actually thought of anything that could make this useful?_

Yes. This is part of what we are doing at <http://causata.com/>. Understanding
the actions of people in near-real time is important if you want to target
them with the right information, ads, offers, etc.

~~~
FreeRadical
But why does this have to be real time and not, say, 4 hours delayed? If a
person is interested in soccer at 8am, then they probably still are interested
in soccer later in the day.

I'd be interested to hear any counter examples.

~~~
leej
it means you can display a soccer-related ad at 8am and know that he will be
watching it.

~~~
FreeRadical
But this occurs already with ad networks and has been going on for years, they
track your previous behaviours and delivery 'relevant' ads.

------
fleitz
Hadoop is designed as a batch processing framework not a real-time analysis
framework. If your analysis functions are idempotent with regard to future
data in the time domain simply compute summaries for each "block" down to the
resolution supported for that age of data such that your summaries fit in
memory. Save ALL the data to hadoop in case you need to replay it later. The
answer to your question depends very much on whether you can summarize your
data in the time domain. eg. if computing an average store block summaries as
the average AND the number of items so that future summaries can be easily
integrated. There is no one answer that will solve any possible analysis
function, you'll need to optimize your system around the analysis function you
want to perform and perhaps have a few different systems purpose built for
different types of analysis.

~~~
dvd03
Let's consider a particular problem domain: analysis of global financial data
- fixed income, stocks, derivatives, etc.

Agreed, Hadoop is a batch processing framework across a chunked archive. Work
has been done recently to bring Hadoop "out of the past" -
[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-13...](http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html).
However, even with these latest amendments, the latency can prove more than a
little troublesome for trading strategies which require rapid execution.

It is for this reason that companies like Truviso and StreamBase - both born
out of highbrow academic research - have built in-memory stream-processing
frameworks in addition to persistent data stores.

If we assume, for the sake of argument, that analysis of historical data is
important, and that Hadoop is fit for purpose. And if we assume also, for the
sake of argument, that a distributed in-memory processing facility is also
important. Then which in-memory solution ought we to employ, and how ought we
to relate this to the Hadoop solution which we'll also be using?

~~~
japherwocky
you should try a few of them and see which one actually works better. Probably
more importantly, you should figure out more specifically what you're trying
to accomplish.

------
ig1
Have a look at what the financial industry use, they've been tackling high
volume low-latency realtime data feed analysis problems for the last couple of
decades. Time series databases (such as kx/kdb) are popular as are propriety
in-memory databases (often based on something like BerkleyDB).

------
jey
What's the objective? We need to decide on what we want "analyze" to mean
before we can evaluate/debate the merits of different techniques.

------
physcab
Although I haven't really delved into it that much, HBase (and other "NoSQL"
databases) supposedly address the latency and structured data points. I
brought up HBase as opposed to Cassandra or Redis because HBase sits on top of
HDFS and has a Hadoop/MapReduce API.

