Hacker News new | past | comments | ask | show | jobs | submit login
Your big data toolchain is a big security risk (vitavonni.de)
71 points by BCM43 on April 26, 2015 | hide | past | favorite | 29 comments



There are also no PGP-signed Hadoop packages last time I looked...

The other often misunderstood problem with Hadoop/Spark/... is that the security model is basically the same as NFS.

If you don't use Kerberos any user with access to the Hadoop cluster has at least full read-access and likely even write access if he has superuser rights* on the client machine. => http://www.openwall.com/lists/oss-security/2015/04/16/21

If you have important data on your HDFS and you did not put everything behind a thick firewall and you are not using Kerberos you have a problem. You'll likely still have a problem...

* This should also work without superuser rights, Hadoop just takes the username from the client.


I'm not sure the state of Hadoop is a surprise for anyone who works with the project. I have never heard good things about running Hadoop ops (Zookeeper, which is even more widely deployed manages to have a bad reputation as well) - and these are tools that I'd guess skew farther away than the "20-somethings on macbooks writing JS" crowd.

The article lays its points clearly, but I'm unsure if "don't use hadoop" is a viable conclusion - and I'm sure the Hadoop community knows about its faults better than anyone.


> For example, they only support Ubuntu 12.04 - a three year old Ubuntu is the latest version they support...

A "three year old" long term support release. Isn't that what you wanted?


It would appear that nobody can build hadoop packages from source. Otherwise, one of these large hadoop consultancies would provide packages for ubuntu 14.04, a long term support release which will be supported years longer than 14.04.


(to late to edit, but obviously "years longer that 12.04")


Too many companies do 'big data', as it's currently a hype.

In most cases hadoop is total overkill -- developers don't understand how to reduce problems, might have never heard of sampling, good data-structures, know your problem, etc.

For sure, there are valid use scenarios for map reduce/hadoop, etc. but in many cases it's a big waste of money.


Do you work in big data ? I do and have never actually seen a company do "big data" without needing to. At enterprise level it is often a pretty sizeable investment e.g. buying dedicated appliances and hiring expensive teams and so generally quite a number of architects and engineers are involved in the decision.

And I would disagree that Hadoop is a big waste of money. There isn't anything else really that comes close to its cost effectiveness once you reach a certain data size.


Exactly.

"We don't do big data, we do little data" -- said no dev team ever, since Big Data has appeared.

Loading a 100MB Excel file -- big data. Dumping a few GB to SQLite -- big data. And programmers are now of course "Data Scientists".

Same with services. When microservices became cool, all the other services have disappeared and everyone is doing microservices.


I don't mean to be negative but this is news how if you are doing real big data you keep your cluster well secured and firewalled away from your other networks let alone the internet.

you should also probably have a model cluster to allow you to experiment with up grades.


Ding ding - the part about my cluster being able to phone home made me both laugh and question the author's intent. Did the author really mean to target people like me, pulling Hadoop & Spark toolchain source, building them, and running them; or, was it an opportunity to hop on a soap box? Some decent points were made, but the tone of the article is really off-putting.


> if you are doing real big data you keep your cluster well secured and firewalled away from your other networks let alone the internet.

Exactly.


Did anyone else notice that they both complain about "iFanboys" upgrading every six months and Ubuntu 12.04, an LTS release, being three years old?


Yup, it made it seem either a bit incoherent or that ugh, Everything is Wrong!


There's a difference between customers and vendors here. It would be a waste of resources for a customer to upgrade to every new Ubuntu release. We're at the stage in 12.04's lifecycle at which customers should ideally have their plans to upgrade from 12.04 to the new 14.04 LTS solidified. If you use one of the packages the author mentioned, you can't upgrade because they don't yet offer support for the new LTS.


The problem I imagine OP sees with using a three-year-old ubuntu LTS release is that you will only get security updates for another two years.


Also, it's not the most recent LTS - you might already have an environment standardised on 14.04.


I like this guy. Somehow he deliver a message that is difficult to send - mainly because it is negative - but true.


what a terrible opening to the article. it's just a load of unsupportable rants against things the author doesn't personally like. he even tries to disclaim that in his prologue but provides nothing beyond the assertions.

there may be more substance later but once I hit the bullshit about "iFanboys" I decided not to bother. At best I'll just shake my head wondering why people think their personal anger is a compelling argument,


This post makes a number of very good points.

However, calling out iFanboys for particular scorn detracts a bit from the article's value.


It's a bit incoherent that way. "Macs are bad, because they change every six months" (what?) and also "ooh, using this version of Ubuntu is bad somehow, because it's three years into its five years of support."

They might have a point about using packages instead of "curl | sh", but I'm not sure I take them totally seriously.

This might start conversation, at least?


and the author seems to be a student - id take that criticism more seriously if it was from some one with experience of big data.

BTW I used MR back in the 80's on the then largest prime computer cluster for BT


I tend to agree. None of these criticisms are from real-world experience. I'm not convinced the author can grow a neck beard yet :P


Yeah I hate articles like this where good points get all tied up with boogey-man stereotyping of some group of people the author has some personal beef with. There are clearly problems with how these things are built and deployed, but all the iFanboys stuff is a distracting non-sequitur.


> Make sure that everybody can build everything from scratch, without having to rely on Maven or Ivy or SBT downloading something automagically in the background. > Sign. Code needs to be signed, end-of-story.

Use Sonatype's repo then? http://central.sonatype.org/pages/requirements.html#sign-fil...


What we need is a repository that requires sources and a build script and that the artifact can be repeatably built.

I want to be able to build my entire project hierarchy from source.

Sonatype prefer if you provide sources, but they don't require you do so, also there doesn't seem to be any way of actually verifying that the source matches the binaries.


I can't say i've looked at hadoop but this fear of jars seems.. Overkill? You can decompile jars. Look at any Java project which leverages 3rd party libraries.. It's not uncommon to have dozens of libraries sourced from the community. They could all have malicious code buried in them. Not sure I understand why the hadoop hate?


To me it seems his complaint is not that hadoop pulls in binaries. The complaint is that it pulls in parts with unknown origin. He wants to be able to trace the components backwards, and know the names of the people responsible for building them, ideally reputable people.


Many libraries can at least be recompiled from scratch. From what I hear, hadoop is much worse off.


This is a big eye-opener.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: