Hacker News new | past | comments | ask | show | jobs | submit login

I strongly agree. Although there are clearly uses for map/reduce at large scale there is also a tendency to use it for small problems where the overhead is objectionable. At work I've taken multiple Mao/reduce systems and converted them to run on my desktop, in one case taking a job that used to take 5 minutes just to startup down to a few seconds total.

Right tool for the job and all that. If you need to process a 50PB input though, map/reduce is the way to go.




I completely agree as well, but I don't consider myself much of an expert in NoSQL technologies (which is why I read up on threads like this to find out).

Does anyone have a use case where data is on a single machine and map reduce is still relevant?

(I am involved in a project at work where the other guys seem to have enthusiastically jumped on MongoDB without great reasons in my opinion).


> Does anyone have a use case where data is on a single machine and map reduce is still relevant?

What matters is the running data structure. For example, you can have Petabytes of logs but you need a map/table of some kind to do aggregations/transformations. Or a sparse-matrix based model. There are types of problems that can partition the data structure and work in parallel in the RAM of many servers.

Related: it's extremely common to confuse for CPU bottleneck what's actually a problem of TLB-miss, cache-miss, or bandwidth limits (RAM, disk, network). I'd rather invest time in improving the basic DS/Algorithm than porting to Map/Reduce.


OK, maybe I am not understanding you correctly, but what you describe seems to be, if the data is on one machine, connect to a cluster of machines, and run processing in parallel on that.

That doesn't imply a NoSQL solution to me. Just parallel processing on different parts of the data. If I am wrong can you point me to a clearer example?


It sounds to me like the poster above restructured the input data to exploit locality of reference better.

http://en.wikipedia.org/wiki/Locality_of_reference


So assuming the data is one one machine (as I asked), why would an index not solve this problem? And why does Map Reduce solve it?


Indexes do not solve the locality problem (see Non-clustered indexes). Even for in-memory databases, it is non-trivial to minimize cache misses in irregular data structures like B-trees.

Now why MapReduce "might" be a better fit for a problem where data fits into one disk. Consider a program which is embarrassingly parallel. It just reads a tuple and writes a new tuple back to disk. The parallel IO provided by map/reduce can offer a significant benefit in this simple case as well.

Also NoSQL != parallel processing.


It's one of the issues, yes. But I wanted to be more general on purpose.


Maybe I misunderstood what you were asking.

Note both MapReduce and NoSQL are overhyped solutions. They are useful in a handful of cases, but often applied to problems they are not as good.


I'm not sure that the two concepts are resulted at all. Obviously mongo has map reduce baked in - but that's not that relevant. Map/reduce is a reasonable paradigm for crunching information. I have a heavily CPU bound system that I parallelise by running on different machines and aggregating that results. I probably wouldn't call it map reduce - but really it's the same thing.

How do you parallelise your long running tasks otherwise?


I can't say without more information on the problem to solve. As I said above, there are cases where MapReduce is a good tool.

And even if you improve the DS/Algorithm first, usually that is usable by the MapReduce port and you save a lot of time/costs.


I completely agree with the parents about Map Reduce. However I would justify using MongoDB for totally different reasons, not scalability. It is easy to setup, easy to manage and above all easy to use, which are all important factors if you are developing something new. However it does have a few "less than perfect" solutions to some problems (removing data does not always free disk space, no support for big decimals,...) and it definitely takes some getting used to. But it is a quite acceptable solution for "small data" too.

Edit: ...and I wouldn't dream of using MongoDB's implementation of MapReduce.


On modern hardware with many cpu cores you can use a similar process of fork and join to maximise throughput of large datasets.


Single hardware with many cores does not give the same performance as multiple machines. For example, consider disk throughput. If the data is striped across multiple nodes then the read request can be executed in parallel, resulting in linear speed up! In a single machine you have issues of cache misses, inefficient scatter-gather operations in main memory, etc.

And it is much more easier to let the MapReduce framework handle parallelism than writing error prone code with locks/threads/mpi/architecture-dependent parallelism etc.


That sounds more like parallelisation rather than a use case for NoSQL.

Why is that any better than all of your data on one database server, and each cluster node querying for part of the data to process it? Obviously there will be a bottleneck if all nodes try to access the database at the same time, but I see no benefit otherwise, and depending on the data organisation, I don't even see NoSQL solving that problem (you are going to have to separate the data to different servers for the NoSQL solution, why is that any better than cached query from a central server?).


Forks/threads on (e.g.) 12 core CPUs works up to a point. But that point probably does solve many problems without further complication :-)


"Does anyone have a use case where data is on a single machine and map reduce is still relevant?"

No! MapReduce is a programming pattern for massively parallelizing computational tasks. If you are doing it on one machine, you are not massively parallelizing your compute and you don't need MapReduce.


You can imagine cases where map-reduce is useful without any starting data. If you are analyzing combinations or permutations, you can create a massive amount of data in an intermediate step, even if the initial and final data sets are small.


Have you got any links on how to do that, as it sounds very like a problem I am trying to sole just now - combinations of DNA sequences that work together on a sequencing machine.

At the moment I am self joining a table of the sequences to itself in MySQL, but after a certain number of self joins the table gets massive. Time to compute is more the problem rather than storage space though, as I am only storing the ones that work (> 2 mismatches in the sequence). Would Map Reduce help in this scenario?


If I had your problem, the first thing that I would do is try PostgreSQL to see if it does the joins fast enough. Second thing that I would try is to put the data in a SOLR db and translate the queries to a SOLR base query (q=) plus filter queries (fq=) on top.

Only if both of these fail to provide sufficient performance, would I look at a map reduce solution based on the Hadoop ecosystem. Actually, I wouldn't necessarily use the Hadoop ecosystem. It has a lot of parts/layers and the newer and generally better parts are not as well known so it is a bit more leading edge than lots of folks like. I'd also look at somethink like Riak http://docs.basho.com/riak/latest/dev/using/mapreduce/ because then you have your data storage and clustering issues solved in a bulletproof way (unlike Mongo) but you can do MapReduce as well.


> Mao/reduce systems

Well, that certainly sounds like ...

puts on sunglasses

... a Great Leap Forward.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: