A large telco has a 600 node cluster of powerful hardware. They barely use it. Moving Big Data around is hard. Managing is harder.
A lot of people fail to understand the overheads and limitations of this kind of architecture. Or how hard it is to program, especially considering salaries for this skyrocketed. More often than not a couple of large 1TB SSD PCIe and a lot of RAM can handle your "big" data problem.
I strongly agree. Although there are clearly uses for map/reduce at large scale there is also a tendency to use it for small problems where the overhead is objectionable. At work I've taken multiple Mao/reduce systems and converted them to run on my desktop, in one case taking a job that used to take 5 minutes just to startup down to a few seconds total.
Right tool for the job and all that. If you need to process a 50PB input though, map/reduce is the way to go.
> Does anyone have a use case where data is on a single machine and map reduce is still relevant?
What matters is the running data structure. For example, you can have Petabytes of logs but you need a map/table of some kind to do aggregations/transformations. Or a sparse-matrix based model. There are types of problems that can partition the data structure and work in parallel in the RAM of many servers.
Related: it's extremely common to confuse for CPU bottleneck what's actually a problem of TLB-miss, cache-miss, or bandwidth limits (RAM, disk, network). I'd rather invest time in improving the basic DS/Algorithm than porting to Map/Reduce.
Indexes do not solve the locality problem (see Non-clustered indexes). Even for in-memory databases, it is non-trivial to minimize cache misses in irregular data structures like B-trees.
Now why MapReduce "might" be a better fit for a problem where data fits into one disk. Consider a program which is embarrassingly parallel. It just reads a tuple and writes a new tuple back to disk. The parallel IO provided by map/reduce can offer a significant benefit in this simple case as well.
I'm not sure that the two concepts are resulted at all. Obviously mongo has map reduce baked in - but that's not that relevant. Map/reduce is a reasonable paradigm for crunching information. I have a heavily CPU bound system that I parallelise by running on different machines and aggregating that results. I probably wouldn't call it map reduce - but really it's the same thing.
How do you parallelise your long running tasks otherwise?
I completely agree with the parents about Map Reduce. However I would justify using MongoDB for totally different reasons, not scalability. It is easy to setup, easy to manage and above all easy to use, which are all important factors if you are developing something new. However it does have a few "less than perfect" solutions to some problems (removing data does not always free disk space, no support for big decimals,...) and it definitely takes some getting used to. But it is a quite acceptable solution for "small data" too.
Edit: ...and I wouldn't dream of using MongoDB's implementation of MapReduce.
Single hardware with many cores does not give the same performance as multiple machines. For example, consider disk throughput. If the data is striped across multiple nodes then the read request can be executed in parallel, resulting in linear speed up! In a single machine you have issues of cache misses, inefficient scatter-gather operations in main memory, etc.
And it is much more easier to let the MapReduce framework handle parallelism than writing error prone code with locks/threads/mpi/architecture-dependent parallelism etc.
That sounds more like parallelisation rather than a use case for NoSQL.
Why is that any better than all of your data on one database server, and each cluster node querying for part of the data to process it? Obviously there will be a bottleneck if all nodes try to access the database at the same time, but I see no benefit otherwise, and depending on the data organisation, I don't even see NoSQL solving that problem (you are going to have to separate the data to different servers for the NoSQL solution, why is that any better than cached query from a central server?).
"Does anyone have a use case where data is on a single machine and map reduce is still relevant?"
No! MapReduce is a programming pattern for massively parallelizing computational tasks. If you are doing it on one machine, you are not massively parallelizing your compute and you don't need MapReduce.
You can imagine cases where map-reduce is useful without any starting data. If you are analyzing combinations or permutations, you can create a massive amount of data in an intermediate step, even if the initial and final data sets are small.
Have you got any links on how to do that, as it sounds very like a problem I am trying to sole just now - combinations of DNA sequences that work together on a sequencing machine.
At the moment I am self joining a table of the sequences to itself in MySQL, but after a certain number of self joins the table gets massive. Time to compute is more the problem rather than storage space though, as I am only storing the ones that work (> 2 mismatches in the sequence). Would Map Reduce help in this scenario?
If I had your problem, the first thing that I would do is try PostgreSQL to see if it does the joins fast enough. Second thing that I would try is to put the data in a SOLR db and translate the queries to a SOLR base query (q=) plus filter queries (fq=) on top.
Only if both of these fail to provide sufficient performance, would I look at a map reduce solution based on the Hadoop ecosystem. Actually, I wouldn't necessarily use the Hadoop ecosystem. It has a lot of parts/layers and the newer and generally better parts are not as well known so it is a bit more leading edge than lots of folks like. I'd also look at somethink like Riak http://docs.basho.com/riak/latest/dev/using/mapreduce/ because then you have your data storage and clustering issues solved in a bulletproof way (unlike Mongo) but you can do MapReduce as well.
"A lot of people fail to understand the overheads and limitations of this kind of architecture. Or how hard it is to program, especially considering salaries for this skyrocketed. More often than not a couple of large 1TB SSD PCIe and a lot of RAM can handle your "big" data problem."
It's not that hard to program... it does take a shift in how you attack problems.
If your data set fits on a few SSDs. then you probably don't have a real big data problem.
"Moving Big Data around is hard. Managing is harder."
Moving big data around is hard--that's why you have hadoop--you send the compute to where he data is, thus requiring a new way of thinking about how you do computations.
Data science does not solve the big data problem. Here's my favorite definition of a big data problem: "a big data problem is when the size of the data becomes part of the problem." You can't use traditional linear programming models to handle a true big data problem; you have to have some strategy to parallelize the compute. Hadoop is great for that.
"A large telco has a 600 node cluster of powerful hardware. They barely use it."
Sounds more like organizational issues, poo planning and execution than a criticism of Hadoop!
It was expressly designed to avoid non-essential technical problems that come in the way of cloud application developers, when they try to orchestrate algorithms across multiple machines.
Identical HBASE SQL queries and their respective counterparts implemented in the Circuit
Language are: Approximately equally long in code, and orders of magnitude faster than HBASE.
In both cases, the data is pulled out of the same HBASE cluster for fair comparison.
I never had any issues with Hadoop. Took about 2 days for me to familiarize myself with it and adhoc a script to do the staging and setup the local functions processing the data.
I really would like to understand what you consider "hard" about Hadoop or managing a cluster. It's pretty straight forward idea, architecture is dead simple, requiring no specialized hardware at any level. Anyone who is familiar with linux CLI and running a dyanamic website should be able to grok it easily, imho.
Then again, I come from the /. crowd, so YC isn't really my kind of people, generally.
Is this serious? Have you ported a program to Hadoop? Unles you use Pig or one of those helping layers it is quite hard for non-trivial problems. And those helping layers usually come with some overhead cost for non-trivial cases, too.
Big Data should be on the Peta+ level. Even with 10G Ethernet it takes a lot of bandwidth and time to move things around (and it's very hard to keep 10G ethernet full at a constant rate from storage). This is hard even for telcos. Note Terabyte+ level today fits on SSD.
Not really, "Big Data" has nothing to do with how many bytes you're pushing around.
Some types of data analytics are CPU heavy and require distributed resources. Your comment about 10G isn't true. You can move around a Tb every 10 minutes or so. SSDs or a medium sized SAN could easily keep up with the bandwidth.
If your data isn't latency sensitive and run in batches, building a Hadoop cluster is a great solution to a lot of problems.
Of course big data is about number of bytes. That's what something like map reduce helps with. It depends on breaking down your input into smaller chunks, and the number of chunks is certainly related to the number of bytes.
(Poster) we moved from mysql+innodb/myisam to hadoop because of performance problems. We did test and evaluate hadoop, and then jumped. Then tokudb comes along (technically we moved back to mariadb) and puts performance advantage firmly back on mysql's side. I imagine impala and presto and any other column based, non-map-reduce engines would give tokudb a fairer fight though.
sure - but that's not the original question. Maybe the OP is just curious to understand some in production use cases (and the approach to their implementation) - and isn't that interested to know the edge cases where it doesn't work.
This system isn't in production just yet, but should be shortly. We're parsing Dota2 replays and generating statistics and visualisation data from them, which can then be used by casters and analysts for tournaments, and players. The replay file format breaks the game down into 1 min chunks, which are the natural thing to iterate over.
Before someone comes along and says "this isn't big data!", I know. It's medium data at best. However, we are bound by CPU in a big way, so between throwing more cores at the problem and rewriting everything we can in C, we think we can reduce processing times to an acceptable point (currently about ~4 mins, hoping to hit <30s).
For our highly unstructured, untyped, non relational artefacts? It'd be fairly impossible to use it as a datastore in this particular case, and regardless, it would provide little to no speed increase over our current application, as the limiting factor is the CPU cost of the map function.
Maybe you should consider transforming the data to structure it a bit. Have a unique identifier per object and arrays for each seen characteristic with (value, id), and it's reverse index. Then decompose the processing of each object to sub-problems matching n-way the characteristics. It doesn't have to be in SQL, though. I'd try a columnar RDBMS. YMMV
Match stats atm, though we can get much everything. We can go far beyond the combat log, reproducing an entire minimap, along with things like hero position trails, map control, etc etc. Here's a screenshot of a minimap recreation we produced for the recent Fragbite Masters tournament: http://i.imgur.com/kuOvUZX.jpg. We have more complex tools such as XP and Gold source breakdowns over time (i.e. jungle creeps, lane creeps, kills, assists etc), along with other fun stuff still in the pipeline.
We use Elastic MapReduce at Snowplow to validate and enrich raw user events (collected in S3 from web, Lua, Arduino etc clients) into "full fat" schema'ed Snowplow events containing geo-IP information, referer attribution etc. We then load those events from S3 into Redshift and Postgres.
So we are solving the problem of processing raw user behavioural data at scale using MapReduce.
All of our MapReduce code is written in Scalding, which is a Scala DSL on top of Cascading, which is an ETL/query framework for Hadoop. You can check out our MapReduce code here:
Hi sandGorgon! Thanks for the encouraging words. We haven't yet seen people use the existing Scalding ETL to pull data from Twitter or Facebook. As you suggest, there are some considerations around using Hadoop to access web APIs without getting throttled/banned. Here's a couple of links which might be helpful:
I think a gatekeeper service could make sense; or alternatively you could write something which runs prior to your MapReduce job and e.g. just loads your results into HDFS/HBase, for the MapReduce to then lookup into. Akka or maybe Storm could be choices here.
We have done a small prototype project to pull data out of Twitter & Facebook - that was only a Python/RDS pilot, but it gave us some ideas for how a proper social feed into Snowplow could work.
It's worth noting that CouchDB is using map-reduce to define materialized views. Whereas normally MR parallelization is used to scale out, in this case it's used instead to allow incremental updates for the materialized views, which is to say incremental updates for arbitrarily defined indexes! By contrast SQL databases allow incremental updates only for indexes whose definition is well understood by the database engine. I found this to be pretty clever.
I've been using CouchDB (and now BigCouch) for about four years and it's both clever and useful. We're storing engineering documents (as attachments) and using map/reduce (CouchDB views) to segment the documents by the metadata stored in the fields. The only downside is that adding a view with trillions of rows can take quite a while.
I'd be happy to take the discussion off-line if you want to get into more depth but I have to wonder if the merger of Memcached and CouchDB happened right as CouchDB would have made a name for itself. Right around that time, I think CouchDB and Mongo were equally well-known.
At least on the experimental LHC side, we process/analyse each event independently from every other event, so it's an embarrassingly parallel workload. All we do is split our input dataset up into N files, run N jobs, combine the N outputs.
Because we have so much data (of the order of 25+ PB of raw data per year; it actually balloons to much more than this due to copies in many slightly different formats) and so many users (several thousand physicists on LHC experiments) that's why we have hundreds of GRID sites across the world. The scheduler sends your jobs to sites where the data is located. The output can then be transferred back via various academic/research internet networks.
HEP also tends to invent many of its own 'large-scale computing' solutions. For example most sites tend to use Condor as the batch system, dcache as the distributed storage system, XRootD as the file access protocol, GridFTP as the file transfer protocol. I know there are some sites that use Lustre but it's pretty uncommon.
Yes, people in sciences write distributed code (e.g., using MPI) and employ job queueing systems. I think they work well for embarrassingly parallel tasks. But what if you need to do a group-by operation? MapReduce makes these operations easy. Also the Hadoop system takes care of failing nodes, data partitioning, communication, etc.
I do hope physics people consider MR systems in addition to the existing ones they use.
MPI has a reduce operator. On some supercomputers, the operator is implemented in an ASIC.
It is quite easy to implement map/reduce style programs in MPI. It is quite difficult or impossible to implement the parallel algorithms that win the Gordon Bell prize, almost all of which are implemented in MPI, using Hadoop.
MPI can't schedule to optimize IO/data shuffling like MapReduce, this is the price for being more general. The biggest competition to MPI in the HPC space is actually CUDA and GPUs. In either case, the data pipeline is exposed to the compiler or run time for optimization, where as MPI everything is just messaging.
This is a system in early development, but my research group is planning on using MapReduce for each iteration of a MCMC algorithm to infer latent characteristics for 70TB of astronomical images. Far too much to store on one node. Planning on using something like PySpark as the MapReduce framework.
Yours is the first example where I a decent knowledge of the field, so can understand the needs accurately. In most cases I see people using NoSQL in places where MySQL could handle it easily.
Maybe I am just too set in my relational database ways of thinking (having used them for 13 years), but there are few cases where I see NoSQL solutions being beneficial, other than for ease of use(most people are not running Facebook or Google).
I a bit sceptical in most cases (though I would certainly like to know where they are appropriate).
First, I should note that my needs are fairly specific, and not typical of the rest of the NGS world. The datasets are essentially the same though.
The rate at which we are acquiring new data has been accelerating, but each of our Illumina datasets is only 30GB or so. The total accumulated data is still just a few TB. The real imperative for using MR is more about the processing of that data. Integrating HMMER, for instance, into Postgres wouldn't be impossible, but I don't know of anything that's available now.
Edit: A FDW for PostgreSQL around HMMER just made my to do list.
Is that the same as an 'impossible to do otherwise' case?
Edit: I should say 'currently impossible' since as I noted, I can imagine being able to build SQL queries around PSSM comparisons and the like. I just can't build a system to last 5+ years around something that might be available at some point.
We use MR using Pig (data in cassandra/CFS) with a 6 node hadoop cluster to process timeseries data. The events contain user metrics like which view was tapped, user behavior, search result, clicks etc.
We process these events to use it downstream for our search relevancy, internal metrics, see top products.
We did this on mysql for a long time but things went really slow. We could have optimized mysql for performance but cassandra was an easier way to go about it and it works for us for now.
Can you estimate how much faster your processing is now vs. before on MySQL? I find it interesting that your cluster is only 6 nodes - relatively small compared to what I've seen and from what I've read about. It'd be interesting to know the benefits of small-scale usage of big data tech.
While the push-back against Map/Reduce & "Big Data" in general is totally valid, it's important to put it into context.
In 2007-2010, when Hadoop first started to gain momentum it was very useful because disk sizes were smaller, 64 bit machines weren't ubiquitous and (perhaps most importantly) SSDs were insanely expensive for anything more than tiny amounts of data.
That meant if you had more than a couple of terabytes of data you either invested in a SAN, or you started looking at ways to split your data across multiple machines.
HDFS grew out of those constraints, and once you have data distributed like that, with each machine having a decently powerful CPU as well, Map/Reduce is a sensible way of dealing with it.
I came here to say the same thing. We use to rip through anywhere from ~100M-3B rows of data every day.. updates to this dataset are time sensitive so we use it to finish the job as quickly as possible.
A lot of people in this thread are saying that most data is not big enough for MapReduce. I use Hadoop on a single node for ~20GB of data because it is an excellent utility for sorting and grouping data, not because of its size.
To most people who use MapReduce in a cluster: You probably don't need to use MapReduce. You are either vastly overstating the amount of data you are dealing with and the complexity of what you need to do with that data, or you are vastly understating the amount of computational power a single node actually has. Either way, see how fast you can do it on a single machine before trying to run it on a cluster.
We use our own highly customized fork of Hadoop to generate traffic graphs  and demographic information for hundreds of thousands of sites from petabytes of data, as well as building predictive models that power targeted display advertising.
We are using Map/Reduce to analyze raw XML as well as event activity streams, for example analyzing a collection of events and meta data to understand how discreet events relate to each other as well as patterns leading to certain outcomes. I am primarily using Ruby+Wukong via the Hadoop-Streaming interface as well as Hive to analyze output and for more normalized data problems.
The company is a large Fortune 500 P&C insurer and has a small (30 node) Cloudera 4 based cluster in heavy use by different R&D, analytic and technology groups within the company. Those other groups use a variety of toolsets in the environment, I know of Python, R, Java, Pig, Hive, Ruby in use as well as more traditional tools on the periphery in the BI and R&D spaces such as Microstrategy, Ab Initio, SAS, etc.
This is the `grep/awk` use case. The nice thing about streaming mr interface to hadoop (calling external programs) is that you can literally take your grep/awk workflow and move it to the cluster. Retaining line oriented records is a huge step in having a portable data processing workflow.
"Things that could be trivially done in SQL :("
We use HIVE over HDFS. Sure the type of things we are doing could have been done in SQL, if we had carefully curated our data set and chosen which parts to keep and which to throw away. Unfortunately, we are greedy in the storage phase. Hive allows us to save much more than we reasonably would need, which actually is great when we need to get it long after the insert was made.
Airbnb uses distributed computing to analyze the huge amount of data that it generates every day. The results are used for all sorts of things: assessing how Airbnb affects the local economy (for government relations), optimizing user growth, etc.
It would be nice if someone collected/counted the actual answers. I read the whole thread and "analyzing server logs" was the only answer. If you don't have funding or have dreams/plans, you're not really _using_ the technology.
Just FYI: Even small problems can be solved using the concept of MapReduce. It is a concept and not an implementation that is tied to BigData and NoSQL. A simple example would be MergeSort. It uses the concept of MR to sort data.