Ask HN: To everybody who uses MapReduce: what problems do you solve? - valevk
======
alecco
A large telco has a 600 node cluster of powerful hardware. They barely use it.
Moving Big Data around is hard. Managing is harder.

A lot of people fail to understand the overheads and limitations of this kind
of architecture. Or how hard it is to program, especially considering salaries
for this skyrocketed. More often than not a couple of large 1TB SSD PCIe and a
lot of RAM can handle your "big" data problem.

Before doing any Map/Reduce (or equivalent), please I beg you to check out
Introduction to Data Science at Coursera
[https://www.coursera.org/course/datasci](https://www.coursera.org/course/datasci)

~~~
AsymetricCom
> Moving Big Data around is hard.

I never had any issues with Hadoop. Took about 2 days for me to familiarize
myself with it and adhoc a script to do the staging and setup the local
functions processing the data.

I really would like to understand what you consider "hard" about Hadoop or
managing a cluster. It's pretty straight forward idea, architecture is dead
simple, requiring no specialized hardware at any level. Anyone who is familiar
with linux CLI and running a dyanamic website should be able to grok it
easily, imho.

Then again, I come from the /. crowd, so YC isn't really my kind of people,
generally.

~~~
skrebbel
> _Then again, I come from the /. crowd, so YC isn't really my kind of people,
> generally._

You sound like a snob.

~~~
AsymetricCom
See what I mean?

------
willvarfar
We had been using hadoop+hive+mr to run targetting expressions over billions
of time series events from users.

But we have recently moved a lot back to mysql+tokudb+sql which can compress
the data well and keep it to just a few terrabytes.

Seems we weren't big data enough and we were tired of the execution times,
although impala and fb's newly released presto might also have fitted.

Add: down voters can explain their problem with this data point?

~~~
srl
This doesn't answer the question - you described /how/ you solved some
problem, not what problem you're actually solving.

(Mind you, at my writing this is the top comment, so I don't think you're
getting many downvotes. But your comment irked me, so there you go.)

~~~
res0nat0r
> not what problem you're actually solving.

> targetting expressions over billions of time series events from users.

~~~
collyw
You moved back to MySQL.

What was the motivation to move to Map Reduce in the first place if a well
understood technology, MySQL, works fine?

(sorry if I am posting a lot on this topic. I am really interested in finding
answers rather than trying to prove any point that relational databases are
better in case anyone thinks otherwise).

~~~
willvarfar
(Poster) we moved from mysql+innodb/myisam to hadoop because of performance
problems. We did test and evaluate hadoop, and then jumped. Then tokudb comes
along (technically we moved back to mariadb) and puts performance advantage
firmly back on mysql's side. I imagine impala and presto and any other column
based, non-map-reduce engines would give tokudb a fairer fight though.

------
lclarkmichalek
This system isn't in production just yet, but should be shortly. We're parsing
Dota2 replays and generating statistics and visualisation data from them,
which can then be used by casters and analysts for tournaments, and players.
The replay file format breaks the game down into 1 min chunks, which are the
natural thing to iterate over.

Before someone comes along and says "this isn't big data!", I know. It's
medium data at best. However, we are bound by CPU in a big way, so between
throwing more cores at the problem and rewriting everything we can in C, we
think we can reduce processing times to an acceptable point (currently about
~4 mins, hoping to hit <30s).

~~~
baudehlo
This sounds like something you could just do in SQL and have it all done in
milliseconds.

~~~
lclarkmichalek
For our highly unstructured, untyped, non relational artefacts? It'd be fairly
impossible to use it as a datastore in this particular case, and regardless,
it would provide little to no speed increase over our current application, as
the limiting factor is the CPU cost of the map function.

~~~
alecco
Maybe you should consider transforming the data to structure it a bit. Have a
unique identifier per object and arrays for each seen characteristic with
(value, id), and it's reverse index. Then decompose the processing of each
object to sub-problems matching n-way the characteristics. It doesn't have to
be in SQL, though. I'd try a columnar RDBMS. YMMV

------
alexatkeplar
We use Elastic MapReduce at Snowplow to validate and enrich raw user events
(collected in S3 from web, Lua, Arduino etc clients) into "full fat" schema'ed
Snowplow events containing geo-IP information, referer attribution etc. We
then load those events from S3 into Redshift and Postgres.

So we are solving the problem of processing raw user behavioural data at scale
using MapReduce.

All of our MapReduce code is written in Scalding, which is a Scala DSL on top
of Cascading, which is an ETL/query framework for Hadoop. You can check out
our MapReduce code here:

[https://github.com/snowplow/snowplow/tree/master/3-enrich/ha...](https://github.com/snowplow/snowplow/tree/master/3-enrich/hadoop-
etl)

~~~
sandGorgon
thanks for this. snowplow has been an amazing source of learning. I'm quite
interested in the etl process than in the actual mapreduce.

have you seen your etl used to pull data from Twitter or Facebook. I am
wondering what is the state of art there considering throttling, etc.

~~~
alexatkeplar
Hi sandGorgon! Thanks for the encouraging words. We haven't yet seen people
use the existing Scalding ETL to pull data from Twitter or Facebook. As you
suggest, there are some considerations around using Hadoop to access web APIs
without getting throttled/banned. Here's a couple of links which might be
helpful:

\- [http://stackoverflow.com/questions/6206105/running-web-
fetch...](http://stackoverflow.com/questions/6206105/running-web-fetches-from-
within-a-hadoop-cluster)

\- [http://petewarden.com/2011/05/02/using-hadoop-with-
external-...](http://petewarden.com/2011/05/02/using-hadoop-with-external-api-
calls/)

I think a gatekeeper service could make sense; or alternatively you could
write something which runs prior to your MapReduce job and e.g. just loads
your results into HDFS/HBase, for the MapReduce to then lookup into. Akka or
maybe Storm could be choices here.

We have done a small prototype project to pull data out of Twitter & Facebook
- that was only a Python/RDS pilot, but it gave us some ideas for how a proper
social feed into Snowplow could work.

~~~
sandGorgon
is your python code part of the snowplow repository/gist?

it would be very interesting to take a look at it.

------
DenisM
It's worth noting that CouchDB is using map-reduce to define materialized
views. Whereas normally MR parallelization is used to scale out, in this case
it's used instead to allow incremental updates for the materialized views,
which is to say _incremental updates for arbitrarily defined indexes_! By
contrast SQL databases allow incremental updates only for indexes whose
definition is well understood by the database engine. I found this to be
pretty clever.

~~~
smoyer
I've been using CouchDB (and now BigCouch) for about four years and it's both
clever and useful. We're storing engineering documents (as attachments) and
using map/reduce (CouchDB views) to segment the documents by the metadata
stored in the fields. The only downside is that adding a view with trillions
of rows can take quite a while.

~~~
DenisM
Strangely, there aren't many discussions of Couch* family on hacker news. Do
you know why would that be?

I'm thinking about basing a new product around couchbase lite, but lack of
popular acceptance is one of the things holding me back.

~~~
smoyer
I'd be happy to take the discussion off-line if you want to get into more
depth but I have to wonder if the merger of Memcached and CouchDB happened
right as CouchDB would have made a name for itself. Right around that time, I
think CouchDB and Mongo were equally well-known.

------
batbomb
I'll tell you where it's not used: High Energy Physics. We use a workflow
engine/scheduler to run jobs over a few thousand nodes at several different
locations/batch systems in the world.

If processing latencies don't matter much, it's an easier more flexible system
to use.

~~~
mileswu
At least on the experimental LHC side, we process/analyse each event
independently from every other event, so it's an embarrassingly parallel
workload. All we do is split our input dataset up into N files, run N jobs,
combine the N outputs.

Because we have so much data (of the order of 25+ PB of raw data per year; it
actually balloons to much more than this due to copies in many slightly
different formats) and so many users (several thousand physicists on LHC
experiments) that's why we have hundreds of GRID sites across the world. The
scheduler sends your jobs to sites where the data is located. The output can
then be transferred back via various academic/research internet networks.

HEP also tends to invent many of its own 'large-scale computing' solutions.
For example most sites tend to use Condor[1] as the batch system, dcache[2] as
the distributed storage system, XRootD[3] as the file access protocol,
GridFTP[4] as the file transfer protocol. I know there are some sites that use
Lustre but it's pretty uncommon.

[1]
[http://research.cs.wisc.edu/htcondor/](http://research.cs.wisc.edu/htcondor/)
[2] [http://www.dcache.org/](http://www.dcache.org/) [3]
[http://xrootd.slac.stanford.edu/](http://xrootd.slac.stanford.edu/) [4]
[http://www.globus.org/toolkit/docs/latest-
stable/gridftp/](http://www.globus.org/toolkit/docs/latest-stable/gridftp/)

~~~
memracom
I remember when HEP invented that WWW technology stuff which turned out to be
rather popular outside of HEP as well.

------
btown
This is a system in early development, but my research group is planning on
using MapReduce for each iteration of a MCMC algorithm to infer latent
characteristics for 70TB of astronomical images. Far too much to store on one
node. Planning on using something like PySpark as the MapReduce framework.

------
alexrson
How strongly does a RNA binding protein bind to each possible sequence of RNA?

~~~
collyw
How big is your data?

Yours is the first example where I a decent knowledge of the field, so can
understand the needs accurately. In most cases I see people using NoSQL in
places where MySQL could handle it easily.

Maybe I am just too set in my relational database ways of thinking (having
used them for 13 years), but there are few cases where I see NoSQL solutions
being beneficial, other than for ease of use(most people are not running
Facebook or Google).

I a bit sceptical in most cases (though I would certainly like to know where
they are appropriate).

~~~
rch
First, I should note that my needs are fairly specific, and not typical of the
rest of the NGS world. The datasets are essentially the same though.

The rate at which we are acquiring new data has been accelerating, but each of
our Illumina datasets is only 30GB or so. The total accumulated data is still
just a few TB. The real imperative for using MR is more about the processing
of that data. Integrating HMMER, for instance, into Postgres wouldn't be
impossible, but I don't know of anything that's available now.

Edit: A FDW for PostgreSQL around HMMER just made my to do list.

~~~
collyw
So is it fair to say it is an "ease of use" use case?

~~~
rch
Is that the same as an 'impossible to do otherwise' case?

Edit: I should say 'currently impossible' since as I noted, I can imagine
being able to build SQL queries around PSSM comparisons and the like. I just
can't build a system to last 5+ years around something that _might_ be
available at some point.

Since I can't reply directly- agreed :)

~~~
collyw
That comment was based on your "into Postgres wouldn't be impossible" phrase.

No fair enough of its the only way you can get things to work. I still see
lots of people jumping on the "big data" bandwagon with very moderate sized
data.

------
sidcool
A host of batch jobs that source data from up to 28 different systems and then
apply business rules to extract a substrate of useful data.

------
zengr
We use MR using Pig (data in cassandra/CFS) with a 6 node hadoop cluster to
process timeseries data. The events contain user metrics like which view was
tapped, user behavior, search result, clicks etc.

We process these events to use it downstream for our search relevancy,
internal metrics, see top products.

We did this on mysql for a long time but things went really slow. We could
have optimized mysql for performance but cassandra was an easier way to go
about it and it works for us for now.

~~~
oacgnol
Can you estimate how much faster your processing is now vs. before on MySQL? I
find it interesting that your cluster is only 6 nodes - relatively small
compared to what I've seen and from what I've read about. It'd be interesting
to know the benefits of small-scale usage of big data tech.

------
clubhi
"How do I make all my projects take 10x longer"

------
nl
While the push-back against Map/Reduce & "Big Data" in general is totally
valid, it's important to put it into context.

In 2007-2010, when Hadoop first started to gain momentum it was very useful
because disk sizes were smaller, 64 bit machines weren't ubiquitous and
(perhaps most importantly) SSDs were insanely expensive for anything more than
tiny amounts of data.

That meant if you had more than a couple of terabytes of data you either
invested in a SAN, or you started looking at ways to split your data across
multiple machines.

HDFS grew out of those constraints, and once you have data distributed like
that, with each machine having a decently powerful CPU as well, Map/Reduce is
a sensible way of dealing with it.

~~~
jfxberns
It still is a valid pattern. As disks get bigger, the harder it is to move the
data and the more efficient it is to move the compute to the data.

------
jread
Generating web traffic summaries from nginx logs for a CDN with 150 servers,
10-15 billion hits/day. Summaries then stored in MySQL/TokuDB.

------
PaulHoule
MapReduce is great for ETL problems where there is a large mass of data and
you want to filter and summarize it.

~~~
darkxanthos
Yup. This. Once I filter down to the data I actually care about, I typically
find I'm no longer anywhere near "Big Data" size.

------
cstigler
MongoDB: where you need Map/Reduce to do any aggregation at all.

~~~
tzury
You may consider using Mongo's aggregation framework[1] instead which is
zillion times faster[2].

[1]
[http://docs.mongodb.org/manual/aggregation/](http://docs.mongodb.org/manual/aggregation/)

[2] [http://stackoverflow.com/questions/13908438/is-mongodb-
aggre...](http://stackoverflow.com/questions/13908438/is-mongodb-aggregation-
framework-faster-than-map-reduce)

------
bcoughlan
A lot of people in this thread are saying that most data is not big enough for
MapReduce. I use Hadoop on a single node for ~20GB of data because it is an
excellent utility for sorting and grouping data, not because of its size.

What should I be using instead?

~~~
grogenaut
Obviously you should throw out the solution that worked for you and start
over. 20GB just isn't cool enough to use M/R.

------
Aqueous
To most people who use MapReduce in a cluster: You probably don't need to use
MapReduce. You are either vastly overstating the amount of data you are
dealing with and the complexity of what you need to do with that data, or you
are vastly understating the amount of computational power a single node
actually has. Either way, see how fast you can do it on a single machine
before trying to run it on a cluster.

------
Tossrock
We use our own highly customized fork of Hadoop to generate traffic graphs [1]
and demographic information for hundreds of thousands of sites from petabytes
of data, as well as building predictive models that power targeted display
advertising.

[1]:
[https://www.quantcast.com/tumblr.com](https://www.quantcast.com/tumblr.com)

------
rubyfan
We are using Map/Reduce to analyze raw XML as well as event activity streams,
for example analyzing a collection of events and meta data to understand how
discreet events relate to each other as well as patterns leading to certain
outcomes. I am primarily using Ruby+Wukong via the Hadoop-Streaming interface
as well as Hive to analyze output and for more normalized data problems.

The company is a large Fortune 500 P&C insurer and has a small (30 node)
Cloudera 4 based cluster in heavy use by different R&D, analytic and
technology groups within the company. Those other groups use a variety of
toolsets in the environment, I know of Python, R, Java, Pig, Hive, Ruby in use
as well as more traditional tools on the periphery in the BI and R&D spaces
such as Microstrategy, Ab Initio, SAS, etc.

------
joeblau
I was using it on a D3.JS chart to aggregate data flow though our custom real-
time analytic pipeline.

~~~
shamsulbuddy
Will u mind sharing some more info on this real time analytics using d3 as i
am also planning to do something like tht..

~~~
jackgolding
look up crossfilter.js

------
gregoryw
Validating performance testing simulations. Tie the inputs of the load
generators to the outputs from the application server logs and verify the
system works as designed at scale.

------
bijanv
Taking dumps of analytics logs and pulling out relevant info for our customers
on app usage

~~~
sitkack
This is the `grep/awk` use case. The nice thing about streaming mr interface
to hadoop (calling external programs) is that you can literally take your
grep/awk workflow and move it to the cluster. Retaining line oriented records
is a huge step in having a portable data processing workflow.

------
redwood
Obviously, in today's theme, Facebook is _not_ using mapreduce effectively to
figure out who of our "friends" we actually care about :)

~~~
KyeRussell
Could you be any more edgy?

------
espennilsen
Saw this tweet today
[https://twitter.com/Andst7/status/399333514803294208/photo/1](https://twitter.com/Andst7/status/399333514803294208/photo/1)

------
mfeldman
"Things that could be trivially done in SQL :(" We use HIVE over HDFS. Sure
the type of things we are doing could have been done in SQL, if we had
carefully curated our data set and chosen which parts to keep and which to
throw away. Unfortunately, we are greedy in the storage phase. Hive allows us
to save much more than we reasonably would need, which actually is great when
we need to get it long after the insert was made.

------
conradev
Airbnb uses distributed computing to analyze the huge amount of data that it
generates every day[1]. The results are used for all sorts of things:
assessing how Airbnb affects the local economy (for government relations),
optimizing user growth, etc.

[1] [http://nerds.airbnb.com/distributed-computing-at-
airbnb/](http://nerds.airbnb.com/distributed-computing-at-airbnb/)

------
hurrycane
Billion of metric values.

------
sitkack
Image restoration, OCR, face detection, full text indexing. Mostly just a
parallel job scheduler.

------
pjbrunet
It would be nice if someone collected/counted the actual answers. I read the
whole thread and "analyzing server logs" was the only answer. If you don't
have funding or have dreams/plans, you're not really _using_ the technology.

------
kirang1989
Just FYI: Even small problems can be solved using the concept of MapReduce. It
is a concept and not an implementation that is tied to BigData and NoSQL. A
simple example would be MergeSort. It uses the concept of MR to sort data.

------
mrgriscom
Things that could be trivially done in SQL :(

------
estebanz01
Process a huge amount of spanish text in paper surveys

------
roschdal
"How can I make Google shareholders richer?"

------
clockwork_189
Data aggregation to calculate analytics.

------
platz
Billions

------
benihana
How can I make myself feel better about what I do by trying to diminish the
work other people do using the same technologies?

------
iurisilvio
How much can I save in my mobile phone and plan?

