A lot of people fail to understand the overheads and limitations of this kind of architecture. Or how hard it is to program, especially considering salaries for this skyrocketed. More often than not a couple of large 1TB SSD PCIe and a lot of RAM can handle your "big" data problem.
Before doing any Map/Reduce (or equivalent), please I beg you to check out Introduction to Data Science at Coursera https://www.coursera.org/course/datasci
Right tool for the job and all that. If you need to process a 50PB input though, map/reduce is the way to go.
Does anyone have a use case where data is on a single machine and map reduce is still relevant?
(I am involved in a project at work where the other guys seem to have enthusiastically jumped on MongoDB without great reasons in my opinion).
What matters is the running data structure. For example, you can have Petabytes of logs but you need a map/table of some kind to do aggregations/transformations. Or a sparse-matrix based model. There are types of problems that can partition the data structure and work in parallel in the RAM of many servers.
Related: it's extremely common to confuse for CPU bottleneck what's actually a problem of TLB-miss, cache-miss, or bandwidth limits (RAM, disk, network). I'd rather invest time in improving the basic DS/Algorithm than porting to Map/Reduce.
That doesn't imply a NoSQL solution to me. Just parallel processing on different parts of the data. If I am wrong can you point me to a clearer example?
Now why MapReduce "might" be a better fit for a problem where data fits into one disk. Consider a program which is embarrassingly parallel. It just reads a tuple and writes a new tuple back to disk. The parallel IO provided by map/reduce can offer a significant benefit in this simple case as well.
Also NoSQL != parallel processing.
Note both MapReduce and NoSQL are overhyped solutions. They are useful in a handful of cases, but often applied to problems they are not as good.
How do you parallelise your long running tasks otherwise?
And even if you improve the DS/Algorithm first, usually that is usable by the MapReduce port and you save a lot of time/costs.
Edit: ...and I wouldn't dream of using MongoDB's implementation of MapReduce.
And it is much more easier to let the MapReduce framework handle parallelism than writing error prone code with locks/threads/mpi/architecture-dependent parallelism etc.
Why is that any better than all of your data on one database server, and each cluster node querying for part of the data to process it? Obviously there will be a bottleneck if all nodes try to access the database at the same time, but I see no benefit otherwise, and depending on the data organisation, I don't even see NoSQL solving that problem (you are going to have to separate the data to different servers for the NoSQL solution, why is that any better than cached query from a central server?).
No! MapReduce is a programming pattern for massively parallelizing computational tasks. If you are doing it on one machine, you are not massively parallelizing your compute and you don't need MapReduce.
At the moment I am self joining a table of the sequences to itself in MySQL, but after a certain number of self joins the table gets massive. Time to compute is more the problem rather than storage space though, as I am only storing the ones that work (> 2 mismatches in the sequence). Would Map Reduce help in this scenario?
Only if both of these fail to provide sufficient performance, would I look at a map reduce solution based on the Hadoop ecosystem. Actually, I wouldn't necessarily use the Hadoop ecosystem. It has a lot of parts/layers and the newer and generally better parts are not as well known so it is a bit more leading edge than lots of folks like. I'd also look at somethink like Riak http://docs.basho.com/riak/latest/dev/using/mapreduce/ because then you have your data storage and clustering issues solved in a bulletproof way (unlike Mongo) but you can do MapReduce as well.
Well, that certainly sounds like ...
puts on sunglasses
... a Great Leap Forward.
It's not that hard to program... it does take a shift in how you attack problems.
If your data set fits on a few SSDs. then you probably don't have a real big data problem.
"Moving Big Data around is hard. Managing is harder."
Moving big data around is hard--that's why you have hadoop--you send the compute to where he data is, thus requiring a new way of thinking about how you do computations.
"Before doing any Map/Reduce (or equivalent), please I beg you to check out Introduction to Data Science at Coursera https://www.coursera.org/course/datasci"
Data science does not solve the big data problem. Here's my favorite definition of a big data problem: "a big data problem is when the size of the data becomes part of the problem." You can't use traditional linear programming models to handle a true big data problem; you have to have some strategy to parallelize the compute. Hadoop is great for that.
"A large telco has a 600 node cluster of powerful hardware. They barely use it."
Sounds more like organizational issues, poo planning and execution than a criticism of Hadoop!
It was expressly designed to avoid non-essential technical problems that come in the way of cloud application developers, when they try to orchestrate algorithms across multiple machines.
Identical HBASE SQL queries and their respective counterparts implemented in the Circuit
Language are: Approximately equally long in code, and orders of magnitude faster than HBASE.
In both cases, the data is pulled out of the same HBASE cluster for fair comparison.
I never had any issues with Hadoop. Took about 2 days for me to familiarize myself with it and adhoc a script to do the staging and setup the local functions processing the data.
I really would like to understand what you consider "hard" about Hadoop or managing a cluster. It's pretty straight forward idea, architecture is dead simple, requiring no specialized hardware at any level. Anyone who is familiar with linux CLI and running a dyanamic website should be able to grok it easily, imho.
Then again, I come from the /. crowd, so YC isn't really my kind of people, generally.
You sound like a snob.
Edit: no downvote from me.
I'm starting to wonder if this is really "Hacker News" or if it's "we want free advice and comments from engineers on our startups so lets start a forum with technical articles"
Some types of data analytics are CPU heavy and require distributed resources. Your comment about 10G isn't true. You can move around a Tb every 10 minutes or so. SSDs or a medium sized SAN could easily keep up with the bandwidth.
If your data isn't latency sensitive and run in batches, building a Hadoop cluster is a great solution to a lot of problems.
W T F does this even mean? I genuinely do not understand what point you are trying to make?
I have used Slashdot longer than you have (ok, possibly not.. but username registered in 1998 here...).
I find HN has generally much more experienced people on it, who understand more about the tradeoffs different solutions provide.
The old Slashdot hidden forums like wahiscool etc were good like this too, but I don't think they exist anymore do they?
But we have recently moved a lot back to mysql+tokudb+sql which can compress the data well and keep it to just a few terrabytes.
Seems we weren't big data enough and we were tired of the execution times, although impala and fb's newly released presto might also have fitted.
Add: down voters can explain their problem with this data point?
(Mind you, at my writing this is the top comment, so I don't think you're getting many downvotes. But your comment irked me, so there you go.)
> targetting expressions over billions of time series events from users.
What was the motivation to move to Map Reduce in the first place if a well understood technology, MySQL, works fine?
(sorry if I am posting a lot on this topic. I am really interested in finding answers rather than trying to prove any point that relational databases are better in case anyone thinks otherwise).
Before someone comes along and says "this isn't big data!", I know. It's medium data at best. However, we are bound by CPU in a big way, so between throwing more cores at the problem and rewriting everything we can in C, we think we can reduce processing times to an acceptable point (currently about ~4 mins, hoping to hit <30s).
I have asked a lot of questions on this topic, and no one has yet convinced me (please do if you have a legitimate NoSQL case).
Also there's the scale issue. Append-only is helpful. Why have SQL if you can't use all the features?
So we are solving the problem of processing raw user behavioural data at scale using MapReduce.
All of our MapReduce code is written in Scalding, which is a Scala DSL on top of Cascading, which is an ETL/query framework for Hadoop. You can check out our MapReduce code here:
have you seen your etl used to pull data from Twitter or Facebook. I am wondering what is the state of art there considering throttling, etc.
I think a gatekeeper service could make sense; or alternatively you could write something which runs prior to your MapReduce job and e.g. just loads your results into HDFS/HBase, for the MapReduce to then lookup into. Akka or maybe Storm could be choices here.
We have done a small prototype project to pull data out of Twitter & Facebook - that was only a Python/RDS pilot, but it gave us some ideas for how a proper social feed into Snowplow could work.
it would be very interesting to take a look at it.
I'm thinking about basing a new product around couchbase lite, but lack of popular acceptance is one of the things holding me back.
If processing latencies don't matter much, it's an easier more flexible system to use.
Because we have so much data (of the order of 25+ PB of raw data per year; it actually balloons to much more than this due to copies in many slightly different formats) and so many users (several thousand physicists on LHC experiments) that's why we have hundreds of GRID sites across the world. The scheduler sends your jobs to sites where the data is located. The output can then be transferred back via various academic/research internet networks.
HEP also tends to invent many of its own 'large-scale computing' solutions. For example most sites tend to use Condor as the batch system, dcache as the distributed storage system, XRootD as the file access protocol, GridFTP as the file transfer protocol. I know there are some sites that use Lustre but it's pretty uncommon.
Not for gauge configuration generation, which has to be done sequentially, but inversions and measurements can be mapped / reduced.
I do hope physics people consider MR systems in addition to the existing ones they use.
It is quite easy to implement map/reduce style programs in MPI. It is quite difficult or impossible to implement the parallel algorithms that win the Gordon Bell prize, almost all of which are implemented in MPI, using Hadoop.
Yours is the first example where I a decent knowledge of the field, so can understand the needs accurately. In most cases I see people using NoSQL in places where MySQL could handle it easily.
Maybe I am just too set in my relational database ways of thinking (having used them for 13 years), but there are few cases where I see NoSQL solutions being beneficial, other than for ease of use(most people are not running Facebook or Google).
I a bit sceptical in most cases (though I would certainly like to know where they are appropriate).
The rate at which we are acquiring new data has been accelerating, but each of our Illumina datasets is only 30GB or so. The total accumulated data is still just a few TB. The real imperative for using MR is more about the processing of that data. Integrating HMMER, for instance, into Postgres wouldn't be impossible, but I don't know of anything that's available now.
Edit: A FDW for PostgreSQL around HMMER just made my to do list.
Edit: I should say 'currently impossible' since as I noted, I can imagine being able to build SQL queries around PSSM comparisons and the like. I just can't build a system to last 5+ years around something that might be available at some point.
Since I can't reply directly- agreed :)
No fair enough of its the only way you can get things to work. I still see lots of people jumping on the "big data" bandwagon with very moderate sized data.
We process these events to use it downstream for our search relevancy, internal metrics, see top products.
We did this on mysql for a long time but things went really slow. We could have optimized mysql for performance but cassandra was an easier way to go about it and it works for us for now.
In 2007-2010, when Hadoop first started to gain momentum it was very useful because disk sizes were smaller, 64 bit machines weren't ubiquitous and (perhaps most importantly) SSDs were insanely expensive for anything more than tiny amounts of data.
That meant if you had more than a couple of terabytes of data you either invested in a SAN, or you started looking at ways to split your data across multiple machines.
HDFS grew out of those constraints, and once you have data distributed like that, with each machine having a decently powerful CPU as well, Map/Reduce is a sensible way of dealing with it.
What should I be using instead?
The company is a large Fortune 500 P&C insurer and has a small (30 node) Cloudera 4 based cluster in heavy use by different R&D, analytic and technology groups within the company. Those other groups use a variety of toolsets in the environment, I know of Python, R, Java, Pig, Hive, Ruby in use as well as more traditional tools on the periphery in the BI and R&D spaces such as Microstrategy, Ab Initio, SAS, etc.