Hacker News new | past | comments | ask | show | jobs | submit login
Command-line tools can be faster than a Hadoop cluster (2014) (aadrake.com)
362 points by 0xmohit on Sept 11, 2016 | hide | past | favorite | 171 comments



I feel like every time something like this comes up people completely skip over the benefit of having as much of your data processing jobs in one ecosystem as possible.

Many of our jobs operate on low TBs and growing but even if the data for a job is super small I'll write it in Hadoop (Spark these days) so that the build, deployment, and schedluing of the job is handled for free by our curent system.

Sure spending more time listing files on S3 at startup than running the job is a waste but far less than the man hours to build and maintain a custom data transformation.

The main benefit of these tools is not the scale or processing speed though. The main benefits are the fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations provided by the community.


Having worked at a large company and extensively used their Hadoop Cluster, I could not agree more with you.

The author of the blogpost/article, completely misses the point. The goal with Hadoop is not minimizing the lower bound on time taken to finish the job but rather maximizing disk read throughput while supporting fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations as you noted. Hadoop has enabled Hive, Presto and Spark.

The author completely forgets that the data needs to transferred in from some network storage and the results need to be written back! For any non-trivial organization ( > 5 users), you cannot expect all of them to SSH into a single machine. It would be an instant nightmare. This article is essentially saying "I can directly write to a file in a local file system faster than to a database cluster", hence the entire DB ecosystem is hyped!

Finally Hadoop is not a monolithic piece of software but an ecosystem of tools and storage engine. E.g. consider Presto, software developers at Facebook realized the exact problem outlined in the blogpost but instead of hacking bash scripts and command line tools, they built Presto. Which essentially performs similar functions on top of HDFS. Because of the way it works Presto is actually faster than "command line" tools suggested in this post.

https://prestodb.io/


> you cannot expect all of them to SSH into a single machine

Why not?

I can do exactly this (KDB warning). Building one or two very beefy machines is 1000x faster, and a lot cheaper than a Hadoop setup.

> The author of the blogpost/article, completely … The author completely … [Here's my opinions about what Hadoop really is]

This is a very real data-volume with two realistic solutions, and thinking that every problem is a nail because you're so invested in your hammer is one of the things that causes people to wait for 26 minutes instead of expecting the problem to take 12 seconds.

And it gets worse: Terrabyte-ram machines are accessible to the kinds of companies that have petabytes of data and the business case for it, and managing the infrastructure and network bandwidth is bigger time sink than you think (or are willing to admit).

If I see value in Hadoop, you might think I'm splitting hairs, so let me be clear: I think Hadoop is a cancerous tumour that has led many smart people to do very very stupid things. It's slow, it's expensive, it's difficult to use, and investing in tooling is just throwing good money after bad.


> Building one or two very beefy machines is 1000x faster, and a lot cheaper than a Hadoop setup.

You have a few petabytes of data and your working set is 50 TB. You put it on two machines. All your data is now on these SGI UV 3000s or whatever. You now need a bunch of experts because any machine failure is a critical data loss situation, and a throughput cliff situation. Took a fairly low-stakes situation (disk failure, let's say) and transformed it into the Mother of All Failures.

And then you've decided that next year your working set won't be over the max for the particular machine type you've decided on. What will you even do then? Sell this and get the new bigger machine? Hook them both up and run things in a distributed fashion?

And then there's the logistics of it all. You're going to use tooling to submit jobs on this machine, and there's got to be configurable process limits, job queues, easy scheduling and rescheduling, etc. etc.

I mean, I'm firmly in the camp that lots of problems are better solved on the giant 64 TB, 256 core machines, but you're selling an idea that has a lot of drawbacks.


And people with 64TB, 256 core machines don't have RAID arrays attached to their machine for this exact reason?

If it's "machines" plural, than you can do replication between the two. There's your fallover in case of complete failure.


> If it's "machines" plural, than you can do replication between the two.

This is the start of a scaling path that winds down Distributed Systems Avenue, and eventually leads to a place called Hadoop.

(Replication and consensus are remarkably difficult problems that Hadoop solves).


Fair 'nuff, but if you don't distribute compute, and you store the dataset locally on all the systems (not necessarily the results of mutations, just the datasets and the work that's being done), you'll still possibly reap massive perf gains over Hadoop in certain contexts.


> you'll still possibly reap massive perf gains over Hadoop in certain contexts.

Certainly, and unfortunately, the exact point at which Hadoop becomes the better option over big iron is generally an ongoing debate and shifting target. But there's no doubt that such a point actually exists.


...I'm not sure if it does. Bain's Law still stands.

But if it does, then it's a pretty big chunk of data, and a very fast network.


Couldn't find anything for Bain's Law, do you have a reference I could follow?


I believe it's a reference to this old (but insightful) comment: https://news.ycombinator.com/item?id=8902739

As such, it is actually "Bane's Rule" which states, "you don't understand a distributed computing problem until you can get it to fit on a single machine first."

(Thanks to nekopa, who also referenced it further down in this thread.)


Both disk and CPU failures are recoverable on expensive hardware.


"You have a few petabytes of data and your working set is 50 TB. You put it on two machines. All your data is now on these SGI UV 3000s or whatever. "

There's usually a combination of apps that work within the memory of the systems plus huge amount of external storage with a clustered filesystem, RAID, etc. Example supercomputer from SGI below since you brought them up that illustrates how they separate compute, storage, management and so on. Management software is available for most clusters to automate or make easy a lot of what you described in later paragraph. They use one. It was mostly a solved problem over a decade ago with sometimes one or two people running supercomputer centers at various universities.

http://www.nas.nasa.gov/hecc/resources/pleiades.html


Yes, but old-school MPI style supercomputer clusters are closer to Hadoop style clusters than standalone machines for the purpose of this discussion.

Both have mechanisms for doing distributed processing on data that is too big for a single machine.

The original argument was that command line tools etc are sufficient. In both these cases they aren't.


Well, this is actually covered in the accompanying blogpost (link in comments below), and he makes a salient point:

"At the same time, it is worth understanding which of these features are boons, and which are the tail wagging the dog. We go to EC2 because it is too expensive to meet the hardware requirements of these systems locally, and fault-tolerance is only important because we have involved so many machines."

Implicitly: the features you mention are only fixes introduced to solve problems that were caused by the chosen approach in the first place.


"The features you mention are only fixes introduced to solve problems that were caused by the chosen approach in the first place."

The chosen approach is the only choice! There is a reason why smart people at thousands of companies use Hadoop. Fault-tolerance and Multi-user support are not mere externalities of the chosen approach but fundamental to performing data science in any organization.

Before you further comment, I highly highly encourage you to get a "Real world" experience in data science by working at a large or even medium sized company. You will realize that outside of trading engines, "faster" is typically the third or fourth most important concern. For data and computed results to be used across organization, they need to stored centrally, similarly hadoop allows you to centralize not only data but also computations. When you take this into account, it does not matter how "Fast" command line tools are on your own laptop. Since now your speed, is determined by the slowest link, which is data transfer over the network.


"Gartner, Inc.'s 2015 Hadoop Adoption Study has found that investment remains tentative in the face of sizable challenges around business value and skills.

Despite considerable hype and reported successes for early adopters, 54 percent of survey respondents report no plans to invest at this time, while only 18 percent have plans to invest in Hadoop over the next two years. Furthermore, the early adopters don't appear to be championing for substantial Hadoop adoption over the next 24 months; in fact, there are fewer who plan to begin in the next two years than already have."

So lots of big businesses are doing just fine without Hadoop and have no plans for beginning to use it. This seems very much at odds with your statement that "The chosen approach is the only choice!"

In fact I would hazard a guess that for businesses that aren't primarily driven by internet pages, big data is generally not a good value proposition, simply because their "big data sets" are very diverse, specialised and mainly used by certain non-overlapping subgroups of the company. Take a car manufacturer, for instance. They will have really big data sets coming out of CFD and FEA analysis by the engineers. Then they will have a lot of complex data for assembly line documentation. Other data sets from quality assurance and testing. Then they will have data sets created by the sales people, other data sets created by accountants, etc. In all of these cases they will have bespoke data management and analysis tools, and the engineers won't want to look at the raw data from the sales team, etc.


My experience echos the OP of this thread; having data in one place backed by a compute engine that can be scaled is a huge boon. Enterprise structure, challenges and opportunities change really fast now, we have mergers new businesses, new products and the requirements to create evidence to support the business conversations that these generate is intense. A single data infrastructure cuts the time required to do this kind of thing from weeks to hours - I've had several engagements where the hadoop team has produced answers in an afternoon that were then later "confirmed" from the proprietary datawarehouses days or weeks later after query testing and firewall hell.

For us Hadoop "done right" is the only game in town for this usecase, because it's dirt cheap per TB and has a mass of tooling. It's true that we've underinvested, but mostly because we've been able to get away with it, but we are running 1000's of enterprise jobs a day through it and without it we would sink like a stone.

Or spend £50m.


Is there anything in your Hadoop that's not "business evidence", financials, user acquisition etc?

My point is that there are many many business decisions driven by analysing non-financial big data sets that physically cannot be done with data crunched out in five hours. These may even require physical testing or new data collection to validate your data analysis.

Like I mentioned, anyone doing proper Engineering (as in, professional liability) will have the same level of confidence in a number coming out of your Hadoop system as they would in a number their colleague Charlie calculated on a napkin at the bar after two beers. Same goes for people in the pharma/biomolecular/chemical industries, oil and gas, mining etc etc.


Like I mentioned, anyone doing proper Engineering (as in, professional liability) will have the same level of confidence in a number coming out of your Hadoop system as they would in a number their colleague Charlie calculated on a napkin at the bar after two beers. Same goes for people in the pharma/biomolecular/chemical industries, oil and gas, mining etc etc.

What are you talking about?

I personally know people working in mining, oil/gas as well as automotive engineering (which you mentioned previously). All rely on Hadoop. I'm sure I could find you some in the other fields too.

Are you seriously thinking Hadoop isn't used outside web companies or something?

Terradata sells Hadoop now, because people are migrating datawarehouses off their older systems. This isn't web stuff, it is everything the business owns.


One of the developments that we're after is radical improvements in data quality and standards of belief (provenance, verification, completeness).

A huge malady that has sometimes effected business is decisions made on the basis of spreadsheets of data that are from unknown sources, contradicted left, right and centre and full of holes.

A single infrastructure helps us do this because we can establish KPI's on the data and control those (as it's coming to the centre rather than a unit providing summaries or updates with delays) we know when data has gone missing and have often been able to do something about it. In the past it was gone, and by the time that was known there was no chance of recovery.

Additionally we are able to cross reference data sources and do our own sanity checks. We have found several huge issues by doing this, systems reporting garbage, systems introducing systematic errors.

I totally agree, if you need to take new readings then you have to wait for the readings to come in before making a decision. This is the same no matter what data infrastructure you are using.

On the other hand there is no reason to view data coming out of Hadoop as any less good than data coming from any other system, apart from the assertion that Hadoop system X is not being well run, which is more of a diagnosis of something that needs fixing than anything else I think.

There are several reasons (outlined above) to believe that a well run data lake can produce high quality data. If an Engineer ignored (for example) a calculation that showed that a bridge was going to fail because the data that hydrated it came out of my system and instead waited for a couple of days for data to arrive from the stress analysis group, metallurgy group and traffic analysis group would they be acting professionally?

Having said all that I do believe that there are issues with running Hadoop data lakes that are not well addressed and stand in the way of delivering value in many domains. Data audit, the ethical challenges of recombination and inference and security challenges generated by super empowered analysts all need to be sorted. Additionally we are only scratching the surface of processes and approaches to managing data quality and detecting data issues.


Yeah, that sounds fun, dozens of undocumented data silos without supervision that some poor bastard will have to troubleshoot as soon as the inevitable showstopper bug crops up.


Most medium and big enterprises have a working set of data around 1-2TB. Enough to fit in memory on a single machine these days.


> .. cannot expect all of them to SSH into a single machine ..

That's pretty much how the Cray supercomputer worked at my old university. SSH to a single server containing compilers and tooling. Make sure any data you need is on the cluster's SAN. Run a few cli tools via SSH to schedule job, and bam - a few moments later your program is crunching data on several tens of thousands of cores.


But, as I pointed out in another comment, what about systems like Manta, which make transitioning from this sort of script to a full-on mapreduce cluster trivial?

Mind, I don't know the performance metrics for Manta vs Hadoop, but it's something to consider...


Totally agree. It'd be relatively trivial to automate converting this script into a distributed application. Haven't checked Manta out, but I will. For ultimate performance though right now you could go for something like OpenMP + MPI which gets you scalability/fault-tolerance. In a few months you'll also be able to use something like RaftLib as a dataflow/stream processing API for distributed computation (almost ready to roll out the distributed back-end). MPI though has decades of research in HPC to make it the most robust distributed compute platform in existence (though not the most easy to use). You think your big data problems are big...nah, supercomputers were doing todays big data back in the late 90's. Just a totally different crowd with slightly different solutions. MPI is hard to use, Spark/Storm is much easier...but much slower.


From my experience organizations have adopted, Hive/Presto/Spark on top of Hadoop. Which actually solves a whole bunch of problems that "script" approach would not. With several added benefits. Executing scripts (cat, grep, uniq, sort) do not provide similar, benefits, while they might be faster. A dedicated solution such as Presto by Facebook will provide similar if not even faster results.

https://prestodb.io/


Ah, so it doesn't solve data storage, and runs SQL queries, which are less capable than UNIX commmands. If your data's stuck inside 15 SQL DBs, than that'd make sense, but a lot of data is just stored in flat files. And you know what's really good at analyzing flat files? Unix commands.


Did you even read it? Presto reads directly from HDFS, which is as close to distributed "flat files" as you can get. As far as "SQL being less capable than UNIX commands", you have got to be kidding me. SQL allows type checking, conversion, joins all of which are difficult if not impossible with grep | uniq | sort etc.


I read it.

>Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

That doesn't sound like HDFS to me. I mean, I assume it can read from HDFS, but Presto is backend agnostic. You could probably write code to run it on Manta. That would be neat for people who like Presto, I guess.

Type checking and conversions, no, and table joins only matter when you're handling relational data.

Also, how many formats can Presto handle? Unix utilities can handle just about any tabular data, and you can run them against non-tabular data in a pinch (although nobody reccomends it). I doubt Presto is that versitile.


Hive operates on top of HDFS.

Presto absolutley runs directly on HDFS.


Huh. Well then, I don't understand HDFS, or Facebook needs to fix Presto's front page. Both are reasonably likely.


> For any non-trivial organization ( > 5 users), you cannot expect all of them to SSH into a single machine.

That's exactly the use case we built Userify for (https://userify.com)


This is a very good point. I think many people are so caught up in bashing the hype train around big data and data science that they just casually dismiss these incredibly valid points. It's not necessarily about how big your data is right now, but how big your data will be in the very near future. Retooling then is often a lot more difficult than just tooling properly up front, so even if some Spark operations might seem to add unnecessary overhead right now the lower transition cost down the road is often worth it.


I think the point is if you really have big data then it makes sense, but many shops add huge cost and complexity to projects where simpler tools would be more than adequate.


It's the tooling, not the size of the data. Using "big data ecosystem" tools allows you to use all kinds of useful things like Airflow for pipeline processing, Presto to query the data, Spark for enrichment and machine learning etc... all of that without moving the data, which simplifies greatly metadata management which has to be done if you're serious about things like data provenance and quality.


A SQL db + Tableau are vastly more powerful and mature than those tools, they just can't do "big data", that's all.


the data preparation work involved in doing anything trivial with that setup is mind boggling. I'm going to assume that this very cavalier assumption comes from a place of relatively little knowledge. Do some of this and you'll realize very soon you'd rather shoot yourself than doing complex data prep on constantly evolving schema using schema-on-write tools.


...and how often do you need to do all that to a dataset?


That's definitely true too. Being able to accurately assess whether that need will ever exist (or not) early on is invaluable.


These are valid points, and I agree many underestimate the cost of retooling and infrastructure. However, I am working on a team of smart engineers, but shell scripting is new to them, much less learning a full Hadoop / spark setup and associated tools. Luckily, you can often have your cake and eat it too: https://apidocs.joyent.com/manta/example-line-count-by-exten... Super useful system so far, and my goal is to allow our team to learn some basic scripting techniques and then run them on our internal cloud using almost identical tooling. Plus things like simple Python scripts are really easy to teach, and with this infrastructure it can scale quickly!


Day 2 problems don't come to mind during days 0 and 1.


Sometimes they don't come at all.


That was quote from Austin Powers. And a reference to the fact that Day 2 requirements almost never come.


> [...] the benefit of having as much of your data processing jobs in one ecosystem as possible. [...] The main benefits are the fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations provided by the community.

Yep! Elasticity is a pretty nice benefit.

Sure, if you're processing a few gigabytes of data, then you could do that with shell scripts on a single machine. However, if you want to build a system that you can "set and forget", that will continue to run over time as data sizes grow arbitrarily, and that -- as you say -- can be fault tolerant, then distributed systems are nice for that purpose. The same job that handles the few gigabytes of data can scale to petabytes if needed. The same techniques that handle gigabytes scale to petabytes.

A job running on a single machine with shell scripts will eventually reach a limit where the data size exceeds what it can handle reasonably. I've seen this happen repeatedly first hand, to the extent that I'd be reluctant to use this approach in production unless I needed something really quick-n-dirty where scaling isn't a concern at all. Another problem with these single-machine solutions is their reliability. If it's for production use, you really want seamless, no-humans-involved failover, which isn't as straightforward to achieve with the single-machine approach unless you deploy somewhat specialized technology (it ends up being something like primary/standby with network attached storage).

Plus, in an environment where you have one job processing GiBs of data, you tend to have more. While any single solo job handling GiBs of data could be done locally, once you have a lot of them, accessed by many different people at a company and under different workflows, the value of distributed data infrastructure starts to make more sense.

Neat article though. Always good to have multiple techniques up your sleeve, to use the right one for the problem at hand.


What happens if a consultant approaches a company that uses Hadoop and offers them "custom data transformation" solutions for their most frequent processing jobs at lower cost and that beat Hadoop's processing times 100-fold?

The company saves money by choosing the lower cost and time by not having to wait for slow Hadoop processing.

It seems like there is competitive advantage to be gained and money to be made by taking advantage of Hadoop's inefficiencies.

But then things are not always what they seem.


This assumes that company will be making rational choice, which is rarely the case. The choice is usually made based on "no one got fired by using X".


Or/and "my resume will look good because big data and Hadoop and Spark and ..."


Most "big data" problems are really "small data" by the standards of modern hardware:

* Desktop PCs with up to 6TB of RAM and many dozens of cores have been available for over a year.[1]

* Hard drives with 100TB capacity in a 3.5-inch form factor were recently announced.[2]

CORRECTION: THE FIGURE IS 60TB, NOT 100TB. See MagnumOpus's comment below. In a haste, I searched Google and mistakengly linked to an April Fool's story. Now I feel like a fool, of course. Still, the point is valid.

* Four Nvidia Titan X GPUs can give you up to 44 Teraflops of 32-bit FP computing power in a single desktop.[3]

Despite this, the number of people who have unnecessarily spent money and/or complicated their lives with tools like Hadoop is pretty large, particularly in "enterprise" environments. A lot of "big data" problems can be handled by a single souped-up machine that fits under your desk.

[1] http://www.alphr.com/news/enterprise/387196/intel-xeon-e7-v2...

[2] http://www.storagenewsletter.com/rubriques/hard-disk-drives/...

[3] https://news.ycombinator.com/item?id=12141334


Well basically as soon as big data hit the news everyone stopped doing data and started doing big data.

Same thing with data science. Opened a Jupyter notebook, loaded 100 rows, and displayed a graph - "I am a data scientist".


> Hard drives with 100TB capacity in a 3.5-inch form factor were recently announced

That is an April Fools story.

(Of course you can still get a Synology DS2411+/DX1211 24-bay NAS combo for a few thousand bucks, but it will take up a lot of space under your desk and keep your legs toasty...)


Earlier this year, Seagate were showing off their 60 TB SSD[1] for release next year.

So 100 TB in a single drive isn't too far off.

EDIT: Toshiba is teasing a 100 TB SSD concept, potentially for 2018 [2]

[1] 11th August - http://arstechnica.co.uk/gadgets/2016/08/seagate-unveils-60t...

[2] 10th August - http://www.theregister.co.uk/2016/08/10/toshiba_100tb_qlc_ss...


60TB at 500Mb/s transfer will take +1 day to read the data. This is the problem of drinking the ocean through a straw. Even with SSD transfer rates, is still a problem at scale. Clusters give you no only capacity, but also multiplication factor for transfer rates.


Just use 24 of them interleaved/stripped and it will take just one hour for loading the data.


But then you need small disks (eg. 2TB). My point is that huge capacity drives are not appropriate in compute environments, as Hadoop is. They're more for cold storage.


Putting a 100TB of storage on your network isn't hard though. There are off-the-shelf NAS servers that big - eg http://www.ebuyer.com/671448-qnap-ts-ec2480u-rp-144tb-24-x-6...


Right, "data scientists" with experience call themselves statisticians or analysts or whatever. The "data science" or "big data" industry is comprised of people who just think a million rows of data sounds impressively big because they never experienced data warehouses in the 1990s where a million rows was not even anything special...


The first paying job we ran through our Hadoop cluster in 2011 had 12 billon rows, and they were fairly big rows. This was beyond the limit of what our proprietary MPP database cluster could handle in the processing window it had (being fair the poor thing was/is loaded 90%+ which is not a great thing, but a true thing for many enterprises). We couldn't get budget for the scaling bump we hit with the evolution of that machine, but we could pull together a six node Hadoop machine and lo and behold, for a pittance we got a little co-processor that could. One other motivation was/is that use case accumulates 600m rows a day, and we were then able to engineer (cheap) a solution that can hold 6mths of that data vs 20 days. After 6mths our current view is that it's not worth keeping the data, but we are beginning to get cases of regret that we've ditched longer window stuff.

There are queries and treatments that process 100's of billions of substantial database rows on other cheap open source infrastructures, and you can buy proprietary data systems that do it as well (and they are good) but if you want to do it cheaply and flexibly then so far I think that Hadoop wins.

I think that Hadoop won 4 years ago and has been the centre of development every since (in fact before when MS cancelled Dryad) I think it will continue to be the weapon of choice for at least 6 more years and will be around and important for 20 more after that. My only strategic concern is the filesystem splintering that is going on with HDFS/Kudu.


So you have large data storage, and processing that can handle large data (assuming for convenience that you have a conventional x86 processor with that throughput). The only problem that remains is moving things from the former to the latter, and then back again once you're done calculating.

That's (100 * 1024 GB) / (20 GB/s) = 85 minutes just to move your 100 TB to the processor assuming your storage can operate at the same speed as DDR4 RAM. A 100 node Hadoop cluster has (100 * 1024 GB) / (0.2 * 100 GB/s) throughput with commodity disks.

Back-of-the-envelope stuff, obviously, with caveats everywhere.


Problem with that kind of setup is that if you unexpectedly need to scale out of that, you haven't done any of the work required to do that, and you're stuck.


How often do you "unexpectedly need to scale out"? By an order of magnitude at least that is, because under that you could add a few more of those beefed-up machines.

I wonder what happened with YAGNI principle. It has arguable uses in some places, but this one it seems to fit perfectly.


I've had such a situation and we were lucky that we had written software that makes dealing with embarassingly parallel problems embarassingly scalable.


Yes, there are desktops with high amounts of Ram but to buy a machine like that would probably be more than setting up a hadoop cluster on commodity hardware. And for embarrassingly parallel problem, hadoop can scale semi-seemlessly.

In reality, it still takes work... but can be done.


This idea was the subject of a paper at a major systems conference. The paper is called "Scalability! But at what cost?" - It goes well beyond this simple example above to explore how most major systems papers produce results that can be beaten by a single laptop. Here's the paper and the blog post describing it.

http://www.frankmcsherry.org/assets/COST.pdf

http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...


I love this. I hope the 'COST' measurement takes off.

I'm not going to go as far as some in condemning the latest frameworks, but I do agree that they are often chosen with no concept of the overhead imposed by being distributed.

Is there anything similar comparing 'old school' distributed frameworks like MPI to the new ones like Spark. I'm curious how much of the overhead is due to being distributed, network latency and Amdel's law, versus the overhead from the much higher level, and more productive, framework itself.


They use Rust, fantastic! Will put this on my must-read-list. Based on their graphs, it makes one wonder how much literal energy has been wasted using 'scalable' but suboptimal solutions... Of course if you're wishing to start a company competing on data processing (e.g. small IoT startups), being a bit cleverer could let you have the same performance or feature set with 1/10th the overhead costs. So maybe don't let too many people know? ;)


The paper is frankly stupid and a great example of difference between practice and academia. it looks good because they are using a snapshot of Twitter network from 2010. In reality the work flow is complex, e.g. the follower graph gets updates every hour. 10 different teams have their different requirements as to how to set up the graph and computations. These computations need to be run at different (hourly, weekly, daily) granularity. 100 downstream jobs are also dependent on them and need to start as soon as previous job finishes. The output of the jobs gets imported/indexed in database which is then pushed to production systems and/or used by analysts who might update and retry/rerun computations. Unlike a bunch of out of touch researchers the key concern isn't how "fast" calculations finish, but several others such as ability to reuse, fault tolerance, multi user support etc.

I can outrun a Boeing 777 on my bike in a 3 meter race but no would care. The single laptop example is essentially that.


> The paper is frankly stupid and a great example of difference between practice and academia. it looks good because they are using a snapshot of Twitter network from 2010.

We used these data and workloads because that was what GraphX used. If you take the graphs any bigger, Spark and GraphX at least couldn't handle it and just failed. They've probably gotten better in the meantime, so take that with a grain of salt.

> Unlike a bunch of out of touch researchers the key concern isn't how "fast" calculations finish, but several others such as ability to reuse, fault tolerance, multi user support etc.

The paper says these exact things. You have to keep reading, and it's hard I know, but for example the last paragraph of section 5 says pretty much exactly this.

And, if you read the paper even more carefully, it is pretty clearly not about whether you should use these systems or not, but how you should not evaluate them (i.e. only on tasks at a scale that a laptop could do better).


"The paper says these exact things. You have to keep reading, and it's hard I know, but for example the last paragraph of section 5 says pretty much exactly this."

Thanks, that addresses my concern. I take back my comment.

But why stop at Rust implementation, there are vendors optimizing it down to FPGA. This sort of comparison is hardly meaningful.


The only point of the paper is that the previous publications sold their systems primarily on performance, but their performance arguments had gaping holes.

The C# and Rust implementations have the property that they are easy and you don't need to have any specific skills to write a for-loop the way we did (the only "tricks" we used were large pages and unbuffered io in C#, and mmap in Rust).

The point is absolutely not that these are the final (or any) word in these sorts of computations; if you really care about performance, use FPGAs, ASICs, whatever. There will always be someone else doing it better than you, but we thought it would be nice if that person wasn't a CS 101 undergraduate typing in what would literally be the very first thing they thought of.


It's a great paper. I really enjoyed it. Keep hitting them with reality checks they need! :)


How many companies out there playing with big data are at least half of the size of Twitter?


You don't need to be "half the size of Twitter". What does that even mean, in headcount, in TB stored, half of the snapshot they used?

The benefits of using a distributed/hadoop style approach to managing your data assets becomes evident as soon as you have more than 5 employees who access such systems. Unless your workload is highly specific, e.g. in Deep Learning, it makes total sense to use a single machine with as many GPUs as possible.

Let me clarify that I used the exact snapshot, in 2012 (here is post that was even cited by few papers [0]) , However I knew that reality of using this data was far complex, and even though you can write "faster" programs on your laptop (I used GraphLab) than a cluster (I had access to 50 nodes Cornell cluster), it didn't mean much.

[0] https://scholar.google.com/citations?view_op=view_citation&h...


Back when I was working for telecommunications (long time ago), operators had GB of data coming out of network elements all back into the network management systems.

That data was handled pretty well with Oracle OLAP in HP-UX servers.

I don't work with big data, but get to see some of the RFPs we get, and most of them are in the scenario of 2016 laptop being able to process the data.


> 1.75GB

> 3.46GB

These will fit in memory on modest hardware. No reason to use Hadoop.

The title could be: "Using tools suitable for a problem can be 235x faster than tools unsuitable for the problem"


This is exactly the point he was making.

People have a desire to use the 'big' tools instead of trying to solve the real problem.

People both underestimate the power of their desktop machine and the 'old' tools and overestimate the size of their task.


Occasionally designers seem to seek credit merely for possessing a new technology, rather than using it to make better designs. Computers and their affiliated apparatus can do powerful things graphically, in part by turning out the hundreds of plots necessary for good data analysis. But at least a few computer graphics only evoke the response "Isn’t it remarkable that the computer can be programmed to draw like that?" instead of "My, what interesting data".

- Edward Tufte

Applies to more than just design.


>People have a desire to use the 'big' tools

Not only that, people seems to love claiming that they're "big data", perhaps because it makes them sound impressive and bigger than they are.

Very few of us will ever do projects that justifies using tools like Hadoop and to few us are willing to accept that our data fits in SQLite.


Yeah, someone was telling me they need big data for a million rows. I laughed and said SQLite handles that...


I would not want to be the one on-call for a million row SQLite database!


I would. That's a pager that's never going off.


Yeah. No significant software is bug-free, but if there was one, SQLite would be a good candidate.


Especially having looked at how thorough their test suite is.


That's what I'm talking about. What other database tests against power outages and hardware failure?


I've seen MySQL tested against power outages and hardware failure on a dozen occasions. If I got paid overtime for dealing with the results I'd probably own half of Vancouver by now.


I'm sure many do...


Do they do it for every release? What about OOM error testing? IO error tests? fuzz tests? UB tests? Fuzz tests? Malformed DB tests? Valgrind analysis? Memory leak checking? Regression tests?

SQLite does all of that and more with every release (https://sqlite.org/testing.html). There's a reason I consider their test coverage impressive...


I don't know. I remember reading that other DBs use SQLite's test suite


Huh. Well, I learned something new today.


I'm curious why you think SQLite is a database technology that somehow can't handle a million rows.


Thousands of customers, millions of rows, 0 DB issues. All bugs are mine. And the damn thing even does JSON querying now.


I love it when clients think they need a server workstation

Specs be damned!

I need to start selling boxes


Last paragraph of the article: "Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools"

That was exactly his point.


Not necessarily true. Depending on your use cases, it often still makes sense to use Hadoop. A really common scenario is that you'll implement your 3.5 GB job on one box, then you'll need to schedule it to run hourly. Then you'll need to schedule 3 or 4 to run hourly. Then your boss will want you to join it with some other dataset, and run that too with the other jobs. You'll eventually implement retries, timeouts, caching, partitioning, replication, resource management, etc.

You'll end up with a half assed, homegrown Hadoop implementation, wishing you had just used Hadoop from the beginning.



I'd rather pay for 7 c4.large* instances ($.70/hour for all of them) compared to an x1 ($13.38/hour).

*The original article is from 2014 and references c1.medium which are no longer available. c4.large is the closest you can get without dipping down into the t2 class.


...And when you do have 140 TB of chess data, you can move to Manta, and you get to keep your data processing pipeline almost exactly the same. Upwards scalability!

I don't know how the performance would stack against Hadoop, but it'd work.


Manta storage service teaser: https://www.youtube.com/watch?v=d2KQ2SQLQgg


In about 20,000 years the chess DB will get that big. Until then grep should be fine.


Actually, just posted essentially the same thing, before reading your comment. I'm wondering as well how the performance would/will scale. It likely depends on how the data is scattered / replicated, but presumable they've worked out decent schedulers for the system. If not, it is open source! Lovin it.


Well, all the code running against the data would already have the paralellization advantages of a shell script, as described in this article. It would additionally probably be running accross multiple nodes, meaning that the IO speeds increase the number of records that can be processe)d simultaneously. The disadvantage is that that data has to be streamed over a network to the reducer node, which could add a good chunk of latency, depending on how fast that is (if you can do some reduction during the map, it would help, but it's possible that Manta spawns one process and virtualized nods per object (and indeed, this is likely), meaning this is impossible), and how many virtual nodes are running on the same physical hardware (but then you're running into the same boundaries you hit on a laptop, just on a much beefier system), as the network latency is near zero if the reducer and the mapper nodes are on the same physical system.

But if you're processing terrabytes, the network latency is probably barely factoring into your considerations, given how much time you're saving by processing data in parallel in the first place.


That's pretty similar to my thinking on the performance. Though your point about the combination of shell script streaming and parallelization is a good way to express it.

The real benefit of this system would be compared to "traditional" (modern?) big data tools like spark, then the network latency cost of the reduce phases should be comparable. Though since manta localizes the compute to the data, there should be an overal order of magnitude less network transfer which should significantly reduce the of of manta based solutions compared to spark/s3 solutions.

In theory at least, it'd be great to test this on equivalent hardware, or at least equivalent;y priced hardware. But that would require a nice test data set which I don't have the resources to setup. Any suggestions on data code that could test the above assumptions would be handy (ahem HN peeps got anything?).

_Edits: grammar_


I got nothing on data code. You could try running a comparable S3/EC2 against Manta on Joyent, but that would be relatively expensive, and I have no idea of the differences between Amazon and Joyent's datacenter layout, so such a test would not be optimal, although it would test each in its most common use case.

It's also worth mentioning in performance analysis. that Manta is backed by ZFS and Zones, so it has the performance characteristics of those.


Reminds me of one of my all time favorite comments about Bane's Rule:

https://news.ycombinator.com/item?id=8902739


And yet a fourth dimension to the problem: time/difficulty multiplier.

    Going   Multiplier
    -----   ---------
    High    1x
    Wide    4x
    Deep    8x
Make the best decision given your current circumstances and engineering resources.


Bane's "deep" approach saved tens of millions of dollars in one example, months of time in another. In all cases it was many times easier for the team to keep up, day after day, year after year.

Bane advocates good old-fashioned refactoring. With humble tools like Perl, SQL, and an old desktop, he bested new, fancy, expensive products. Bane deserves the salary of a CEO, or at least a vice president, for the good he has brought to his company.

I think ego leads us to make choices that appeal in the short run but are bad in the long run. Which is more impressive sounding: that you bought a distributed network of a thousand of the newest, shiniest machines, running the latest version of DataLaserSharkAttack; or that you cobbled together some Perl, SQL, and shell one-liners on a four-year-old PC?

Also good-old fashioned hard work is painful. It is a good kind of pain, like working out your body, rather than a bad kind of pain, like accidentally cutting yourself. But it is just good old-fashioned, humble, hard work to sit down, work through the details, and come up with a better plan.

Before that, it is even more humble, hard work, to learn the things that Bane had learned. Not just anybody could have done what he did. First you have to learn the ins and outs of Perl, SQL, and all the little shell commands, and all their little options. He knows about a lot of different programming problems, like what a "semantic graph" is (I can't say I do), what an "adjacency matrix" is (nope), whether something is an O(N^2) problem or an O(k+n^2) problem (I know I've seen that notation before).


thanks for sharing, this is exactly what i've been wondering about


Arguable if you can keep everything on one box it will almost always be faster (and cheaper!) than any soft of distributed system. That said, scalability is generally more important than speed because once a task can be distributed you can add performance by adding hardware. As well, depending on your use case you can often get fault tolerance "for free".


No.

Show that you need scalability first. Chances are you don't.

When you do, scale the smallest possible part of your system that is the bottleneck, not the whole thing.


An established, standardised, existing platform is often more maintainable than a custom solution, even if that platform includes a bit more scalability than you actually need.


But straightforward use of cat, grep, xargs and gawk is much less "custom" than depending on specific versions of exotic tools and committing servers for use as part of the current "platform". If I want to run a simple pipeline of UNIX tools on a random system, the worst case prerequisite scenario is reclaiming disk space and installing less common tools (e.g. mawk).


> But straightforward use of cat, grep, xargs and gawk is much less "custom" than depending on specific versions of exotic tools and committing servers for use as part of the current "platform".

There are often subtle incompatibilities between the versions of those tools found on different systems (e.g. GNU vs BSD). Worse, there's no test suite or way to declare the dependency, and you may not even notice until the job fails halfway through. Whereas on a Spark job built with maven or similar the dependencies are at least explicit and very easy to reproduce.

"Exotic" is relative. At my current job I would expect a greater proportion of my colleagues to know Spark than gawk. Unix is a platform too - and an underspecified one with poor release/version discipline at that.

In an organization where Unix is standard and widely understood and Spark is not, use Unix; where the reverse applies, it's often more maintainable to use Spark, even if you don't need distribution.

But my original point was: if you need a little more scalability than Unix offers then it may well be worth going all the way to Spark (even though it's far more scalability than you likely need) rather than hacking up some custom job-specific solution that scales just as far as you need, just because Spark is standard, documented and all that.


You can look for the GNU utils, you know. And you can do explicit dependancy declaration by grepping the version output to see if it's GNU or bsd, and the version number.

It's not convenient, and it could be done better, but it is by no means impossible.

Also, how do your colleagues not know AWK? In a primarily UNIX world, that's something that everybody should know. Besides, you can learn the basics in about 15 minutes.


> You can look for the GNU utils, you know. And you can do explicit dependancy declaration by grepping the version output to see if it's GNU or bsd, and the version number. > It's not convenient, and it could be done better, but it is by no means impossible.

In principle it may be possible, but in practice it's vaporware at best. There's no standard, established way to do this, which means there's no way that other people could maintain - the closest is probably autoconf, but I never saw anyone ship an aclocal.m4 with their shell script. Since unix utilities tend to be installed system-wide, it's not really practical to develop or test (not that the unix platform is at all test-friendly to start with) against old target versions (docker may eventually get there, but it's currently immature) - if your development machine has version 2.6 of some utility, you'll probably end up accidentally using 2.6-only features without noticing.

> Also, how do your colleagues not know AWK? In a primarily UNIX world, that's something that everybody should know. Besides, you can learn the basics in about 15 minutes.

My organization isn't primarily-unix, and people don't use awk often enough to learn it. I'm sure one could pick up the basics fairly quickly, but that's true of Spark too.


Okay, I guess.

For utilities, it's usually safe to assume that whoever is running it is running a comparable version - the most commonly used options are decades old at this point, and it's relatively unlikely you'll use the new stuff.

As for checking for GNU tooling, a grep against <tool> -v does the trick. This can also get you the version number. You can probably even write a command to do it.

It's nonstandard and suboptimal, but it is, once again, possible.


If it's theoretically possible but no-one actually does it then I call it vaporware.


That's... definitely not the right use of that term. I'd be vaporware if it's software that's promised and never appears. This works right now, it's just kind of fragile, hacky, and a PITA, so nobody does it.


You lost me when you put "built with Maven" and "easy to reproduce" in the same paragraph.

Thinking of it, having to "build" (and presumably deploy on multiple application servers) a simple data processing job is a sign of enterprisey overcomplication regardless of the quality of the underlying technology.


Unless it's a bootable image (unikernel), your "simple" job has dependencies. Better to have them visible, explicit and resolved as part of a release process rather than invisible, implicit, and manually updated by the sysadmin whenever they feel like it.


My hadoop experience is dated (circa 2011), do the work nodes still poll the scheduler to see if they have work to do? If so, that's still a giant impediment to speed for smaller tasks. Especially if poll times are in the range of minutes.

If hadoop put effort into making small tasks time efficient, I think your argument has merit, if there's a reasonable chance of actually needing to scale, or to pick up ancillary benefits (fault tolerance, access to other data that needs to be processed with hadoop etc)


There is nothing preventing distributed systems to be faster than one box for this kind of thing. But they don't always bother to pursue efficiency on that level, because things are very different once you have a lot of boxes and something that used to look important for a couple of boxes doesn't anymore.


Yes, there is, you have a lot of overhead in any case for the same tools.


You don't have the same tools. You are probably thinking about emulating POSIX filesystem API and things like that and using those command-line tools on top of that in a single-box kind of way. That's not how you treat your distributed system.

EDIT: For something that beats a single box easily I envision an interpreter with JIT running on each node in a distributed system and on the same process that stores data, having pretty much no overhead to access and process it.


>You are probably thinking about emulating POSIX filesystem API and things like that and using those command-line tools on top of that in a single-box kind of way. That's not how you treat your distributed system.

Yeah, but Manta's mapreduce does something close, and it seems to work okay.


Fancy highly-scalable distributed algorithms have that annoying tendency of starting at 10x slower than the most naïve single-machine algorithm.


This seems to be the Manta [0] way. Letting you run your beloved Unix command pipeline on your Object Store files.

[0] https://www.joyent.com/manta But the youtube videos with Bryan Cantrill are even better at explaining.


If you like to do data analyses in bash, you might also enjoy bigbash[1]. This tool generates quite performant bash one-liners from SQL Select statements that easily crunch GB of csv data.

Disclaimer: I am the main author.

[1] http://bigbash.it


That's pretty cool.

Do you think you can get it to support Manta? I think a lot of people in that ecosystem could benefit from it if you could. I'd help, but I don't really know Java all that well :-(.


It's all about picking the right tool for the job. I think shell scripting is a great prototyping tool and often a good place to start. As the problem gets more complex and bigger, eventually it will warrant a full scale development.


I think people overlook the fact that the author made an even more strong point by using shell scripting, which is relatively inefficient compared to using a compiled language. I guess it would hit the I/O cap without even going parallel.


Date on article: Sat 25 January 2014

I am not a Big Data expert, but does that change any of the comments below with reference to large datasets and memory available?

I use J and Jd for fun with great speed on my meager datasets, but others have used it on billion row queries [1]. Along with q/kdb+, it was faster than Spark/Shark last I checked, however, I see Spark has made some advances recently I have not checked into.

J is interpreted and can be run from the console, from a Qt interface/IDE, or in a browser with JHS.

[1] http://www.jsoftware.com/jdhelp/overview.html


There isn't exactly a direct relationship between the size of the data set and the amount of memory required to process it. It depends on the specific reporting you are doing.

In the case of this article, the output is 4 numbers:

  games, white, black, draw
Processing 10 items takes the same amount of memory as processing 10 billion items.

If the data set in this case was 50TB instead of a few GB, it would benefit from running the processing pipeline across many machines to increase the IO performance. You could still process everything on a single machine, it would just take longer.

Some other examples of large data sets+reports that don't require a large amount of memory to process:

  * input: petabytes of web log data. output: count by day by country code
  * input: petabytes of web crawl data. output: html tags and their frequencies
  * input: petabytes of network flow data. output: inbound connection attempts by port
Reports that require no grouping (like this chess example) or group things into buckets with a defined size (ports that are in a range of 1-65535) are easy to process on a single machine with simple data structures.

Now, as soon as you start reporting over more dimensions things become harder to process on a single machine, or at least, harder to process using simple data structures.

  * input: petabytes of web crawl data. output: page rank
  * input: petabytes of network flow data. output: count of connection attempts by source address and destination port
I kinda forget what point I was trying to make.. I guess.. Big data != Big report.

I generated a report the other day from a few TB of log data, but the report was basically

  for day in /data/* ; do #YYYY-MM-DD
    echo $day $(zcat $day/*.gz | fgrep -c interestingValue)
  done


There's a lot of operational benefits to running on Hadoop/yarn as well. You get operational benefits from node resiliency (host went down? Run the application over there). You also get the Hadoop filesystem which conveniently stores your data in S3 and distributed HDFS.

These systems were designed by people who probably managed difficult etl pipelines that were nothing but what the author suggests: simplified shell scripts using UNIX pipes.

Besides going up against Hadoop MR is easy. I'd like to see you compete against something like Facebook's presto or spark which are optimized for network and memory.


What is the point? Who would want to use Hadoop for something below 10GB? Hadoop is not good at doing what it is not designed for? How useful.


Kind of depends on what the 10 GB is. For example, on my project, we started on files that were about 10 GB a day. The old system took 9 hours to enhance the data (add columns from other sources based on simple joins). So we did it with Hadoop on two Solaris boxes (18 virtual cores between them). Same data; 45 minutes. But wait there's more.

We then created a two fraud models that took that 10+ GB file (enhancement added about 10%) and executed within about 30 minutes a piece. But concurrently. All on Hadoop. All on arguably terrible hardware. Folks at Twitter and Facebook had never though about using Solaris.

We've continued this pattern. We've switched tooling from Pig to Cascading because Cascading works in the small (your PC without a cluster) and in the large. It's testable with JUnit in a timely manner (looking at you PigUnit). Now we have some 70 fraud models chewing over anywhere from that 10+ GB daily file set to 3 TB. All this in our little 50 node cluster. All within about 14 hours. Total processed data is about 50 TB a day.

As pointed out earlier, Hadoop provides an efficient, scalable, easy distributed application development platform. Cascading makes life very Unix-like (pipes and filters and aggregators). This coupled with a fully async eventing pipe line for workflows built on RabbitMQ makes for an infinitely expandable fraud detection system.

Since all processors communicate only through events and HDFS, we add new fraud models without necessarily dropping the entire system. New models may arrive daily, conform to a set structure, and are literally self-installed from a zip file within about 1 minute.

We used the same event + Hadoop architecture to add claim line edits. These are different from fraud models in that fraud models calculate multiple complex attributes then apply a heuristic to the claim lines. Edits look at a smaller operation scope. But in cascading this is pipe from HDFS -> filter for interesting claim lines -> buffer for denials -> pipe to HDFS output location.

Simple, scalable, logical, improvable, testable. I've seen all of these. As the community comes out with new tools, we get more options. My team is working on machine learning and graphs. Mahout and Giraph. Hard to do all of this easily with a home grown data processing system.

As always, research your needs. Don't get caught up in the hype of a new trend. Don't be closed minded either.


i agree that scalable infrastructure is needed to manage a production pipeline, as others have explained well.

i found this article was a useful reminder, because sometimes a job doesnt require a fully grown infrastructure. i commonly get these requests that dont overlap with existing infrastructure and wont need any followup. in that particular case a hadoop cluster, heck even loading into a pg db would be wasted effort.

but i wouldnt want to manage our clickstream analytics pipeline with shell scripts and cron jobs.

is there any lightweight tooling out there that can schedule/run basic pipeline jobs in a shell environment?


Airflow? It might not be what you consider lightweight, though.


Manta? Definitely not lightweight.


In my experience you can classify people into two herds: those who, when faced with a problem, solve it directly; and those who, faced with the same problem, try to fit it to the tools they want to use. I like to think this is a maturity question, but I can't think I've actually seen someone make the transition from the latter to the former type.


> those who, when faced with a problem, solve it directly

I think you mean "use the tools they already know".

> those who, faced with the same problem, try to fit it to the tools they want to use

I think you mean "use the correct tools for the job".

> I like to think this is a maturity question

It is, but the direction of maturity is from the first case to the second.


This is just plain click bait.

Obviously if a dataset is small enough to possibly fit in memory, it will be much faster to run on a single computer.


I think that's his point. Companies are chasing "big data" because it's a great buzzword without considering whether it's something they actually need or not.

A well-rounded, hype-resistant developer would look at the same problem and say, "wha? Nah, I'll just write a dozen lines of PowerShell and have the answer for you before you can even figure out AWS' terms of use..."

I don't think the article talks about this specifically but there's also a tendency to say "big data" when all you need is "statistically-significant data". If you're Netflix, if you just want to figure out how many users watch westerns for marketing purposes, you don't need to check literally every user, just a large enough sample so that you can get a 95% confidence or so. But I've seen a lot of companies use their "big data" tools to get answers to questions like that, even though it takes longer than just sampling the data in the first place.

(Now Netflix recommendations, that's a big data problem because each user on the platform needs individualized recommendations. But a lot of problems aren't. And it takes that well-rounded hype-resistant guy to know which are and which aren't.)


I guess the author should have called it out more explicitly for some, but I think that's the point.

I've seen the testimony dozens of times on HN, and I've heard it from a friend who manages Hadoop at a bank, and I've seen it with people building scaled ELK stacks for log analysis: People are too eager to scale out when things can be done locally, given moderate datasets.


Though sometimes hadoop makes sense even if local computation is faster. For example you might just be using hadoop for data replication.


> For example you might just be using hadoop for data replication.

Good point. The reason someone who holds data for 7+ years uses hadoop is not because it is faster.

The processing aspect of the system is only tangential to the failure tolerance when you consider the age of the data set.

HDFS does waste a significant amount of IO merely reading through cold data and validating the checksums, so that it safe against HDD bit-rot (dfs.datanode.scan.period.hours).

The general argument about failure tolerance is off-site backups, but the backups tend to have availability problems (i.e machine failed, the Postgres backup restore takes 7 hours).

The system is built for constant failure of hardware, connectivity and in some parts, the software itself (java over C++) - because those are unavoidable concerns for a distributed multi-machine system.

The requirement that it be fast takes a backseat to the availability, reliability and scalability - an unreliable, but fast system is only useful for a researcher digging through data at random, not a daily data pipeline where failures cascade up.


The Hadoop article linked is available at https://web.archive.org/web/20140119221101/http://tomhayden3...?


Ok, now do it for >2tb.

Our prod hadoop dataset is now > 130tb, try that!


> Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools. If you have a huge amount of data or really need distributed processing, then tools like Hadoop may be required, but more often than not these days I see Hadoop used where a traditional relational database or other solutions would be far better in terms of performance, cost of implementation, and ongoing maintenance.


You can get 128GB DIMMs these days, so 2tb is easy to fit in memory. 130tb, yes that's a different story.


I am a bioinformatician. 130tb of raw reads or processed data? Are you trying to build a general purpose platform for all *-seq or focusing on something specific (genotyping)?


I think you might be replying to my comment. We just took delivery of a 20K WGS callset that is 30TB gzip compressed (about 240TB uncompressed) and expect something twice as big by the end of the year. We're trying to build something pretty general for variant level data (post calling, no reads), annotation and phenotype data. Currently we focus on QC, rare and common variant association and tools for studying rare disease. Everything is open source, we develop in the open and we're trying hard to make it easy for others to develop methods on our infrastructure. Feel free to email if you'd like to know more.


Note the magic words were "can be faster", not "are faster".

If you'd read the entire article you'd even have picked up that he's explicitly calling out use of hadoop for data that easily fits in memory, not large data sets.


What kind of data do you have? Is this mostly text, or more like compressed time series?


We're analyzing genome sequence data on that scale: https://github.com/hail-is/hail


It's long rows of lots of numbers that have to be crunched with each other in quite straightforward ways.


Some back of the hand calculations show it would take about 3 days using the article's method and a 2gbit pipe.

Out of curiosity, how long do you take to process 130tb on hadoop and where/how is the data stored?


It's about four hours on a on prem commodity cluster with ~PB raw storage on 22 nodes. Each node has 12 4TB disks (fat twins) and two xeons with (I think) 8 cores, and 256GB ram. It's got a dedicated 10GbE network with it's own switches.

The processing is a record addition per item (basically there is a new row for a matrix for every item we are examining, and an old row has disappeared) and a recalculation of aggregate statistics based on the old data and the new data - the aggregate numbers are then off loaded to a front end database for online use. The calculations are not rocket science so this is not compute bound in any way.

I think that we can do it because of data parallelism, the right data is available on each node and every core and every disk, so each pipeline just thumps away at ~50Mbs, there are about 300 of them so that's lots of crunching. At the moment I can't see why we can't scale more, although I believe that the max size of the app will be no more than *2 where we are now.


But what would happen if those exact same command-line tools were used inside a Hadoop node? What would be the optimum number of processors then?


That depends on the tradeoff between management/transfer overhead and actually doing work.

Always in the "word count" style examples, but quite often in real life, the "get the data into the process" takes more time than actually processing it.

When you need to distribute, you need to distribute. However, the point where "you need to distribute" is about 100x more data than the time most hadoop users do, and the overhead costs are far from negligible - in fact, they dominate everything until you get to 100x more data.


you would just be adding management-overhead.

More software != more efficient software.


But faster because parallel.


Yes, you do not need a 100 node cluster to crunch 1.75GB of data. I can do that on my phone. What's the author's point?!


That hadoop's reputation of being worth the hassle and complexity that managing it entails is undeserved.


Is there a place to download the database that isn't a self-extracting executable? (Seriously?)


Not just faster to run but also much faster to write


>cat | grep

why


(2014)

Here's the previous comments from the submission a couple years ago: https://news.ycombinator.com/item?id=8908462


This article is a great litmus test for checking if someone has experience working at scale (Multi Terabytes, Multiple analysts, Multiple job types) or not. Anyone who has had that experience will instantly describe why this article is wrong. It's akin to saying a Tesla is faster than Boeing 777 on a 100 meter track.


I'd hope people who have worked at scale still are capable of recognizing when the tools they used there are totally overkill. I'd suspect they would, since they'd also be more aware of their limitations (vs somebody without experience, who has to believe the "you need big data and everything is easy" marketing).

That you wouldn't use a Boing 777 IF your problem is just a 100m track is the entire point of the article. It's explicitly not saying that you never should use the big tools.


They are not overkill at all, rather they are tuned towards different set of performance characteristics. E.g. in the Boeing 777 example above, transatlantic journey.

In the article above, the data and results stay on the local disk, however in any organization, they need to be stored in a distributed manner, available to multiple users with varying levels of technical expertise. Typically in NFS or HDFS, preferably if they are records stored/indexed via Hive/Presto. At which point the real issue is how do you reduce the delay resulting from transferring data over the network. Which is what the original idea (moving computation closer to data) behind Hadoop/MapReduce.


rolls eyes

The point is that if you've got such tiny quantities of data, why are you storing it in a distributed manner, and why are you breaking out the 777 for a trip around the racetrack? Grab the 777 when you need it, and take the Tesla when you need the performance characteristics of a Tesla.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: