

Your big data toolchain is a big security risk - BCM43
http://www.vitavonni.de/blog/201504/2015042601-big-data-toolchains-are-a-security-risk.html

======
nisa
There are also no PGP-signed Hadoop packages last time I looked...

The other often misunderstood problem with Hadoop/Spark/... is that the
security model is basically the same as NFS.

If you don't use Kerberos any user with access to the Hadoop cluster has at
least full read-access and likely even write access if he has superuser
rights* on the client machine. => [http://www.openwall.com/lists/oss-
security/2015/04/16/21](http://www.openwall.com/lists/oss-
security/2015/04/16/21)

If you have important data on your HDFS and you did not put everything behind
a thick firewall and you are not using Kerberos you have a problem. You'll
likely still have a problem...

* This should also work without superuser rights, Hadoop just takes the username from the client.

------
nemothekid
I'm not sure the state of Hadoop is a surprise for anyone who works with the
project. I have _never_ heard good things about running Hadoop ops (Zookeeper,
which is even more widely deployed manages to have a bad reputation as well) -
and these are tools that I'd guess skew farther away than the "20-somethings
on macbooks writing JS" crowd.

The article lays its points clearly, but I'm unsure if "don't use hadoop" is a
viable conclusion - and I'm sure the Hadoop community knows about its faults
better than anyone.

------
veeti
> For example, they only support Ubuntu 12.04 - a three year old Ubuntu is the
> latest version they support...

A "three year old" __long term support __release. Isn 't that what you wanted?

~~~
ploxiln
It would appear that nobody can build hadoop packages from source. Otherwise,
one of these large hadoop consultancies would provide packages for ubuntu
14.04, a long term support release which will be supported years longer than
14.04.

~~~
ploxiln
(to late to edit, but obviously "years longer that 12.04")

------
lessthunk
Too many companies do 'big data', as it's currently a hype.

In most cases hadoop is total overkill -- developers don't understand how to
reduce problems, might have never heard of sampling, good data-structures,
know your problem, etc.

For sure, there are valid use scenarios for map reduce/hadoop, etc. but in
many cases it's a big waste of money.

~~~
threeseed
Do you work in big data ? I do and have never actually seen a company do "big
data" without needing to. At enterprise level it is often a pretty sizeable
investment e.g. buying dedicated appliances and hiring expensive teams and so
generally quite a number of architects and engineers are involved in the
decision.

And I would disagree that Hadoop is a big waste of money. There isn't anything
else really that comes close to its cost effectiveness once you reach a
certain data size.

------
walshemj
I don't mean to be negative but this is news how if you are doing real big
data you keep your cluster well secured and firewalled away from your other
networks let alone the internet.

you should also probably have a model cluster to allow you to experiment with
up grades.

~~~
MoOmer
Ding ding - the part about my cluster being able to phone home made me both
laugh and question the author's intent. Did the author really mean to target
people like me, pulling Hadoop & Spark toolchain source, building them, and
running them; or, was it an opportunity to hop on a soap box? Some decent
points were made, but the tone of the article is really off-putting.

------
icehawk
Did anyone else notice that they both complain about "iFanboys" upgrading
every six months and Ubuntu 12.04, an LTS release, being three years old?

~~~
lsc
The problem I imagine OP sees with using a three-year-old ubuntu LTS release
is that you will only get security updates for another two years.

~~~
NeutronBoy
Also, it's not the most recent LTS - you might already have an environment
standardised on 14.04.

------
zobzu
I like this guy. Somehow he deliver a message that is difficult to send -
mainly because it is negative - but true.

------
oldmanjay
what a terrible opening to the article. it's just a load of unsupportable
rants against things the author doesn't personally like. he even tries to
disclaim that in his prologue but provides nothing beyond the assertions.

there may be more substance later but once I hit the bullshit about "iFanboys"
I decided not to bother. At best I'll just shake my head wondering why people
think their personal anger is a compelling argument,

------
wglb
This post makes a number of very good points.

However, calling out iFanboys for particular scorn detracts a bit from the
article's value.

~~~
pronoiac
It's a bit incoherent that way. "Macs are bad, because they change every six
months" (what?) and also "ooh, using _this_ version of Ubuntu is bad somehow,
because it's three years into its five years of support."

They might have a point about using packages instead of "curl | sh", but I'm
not sure I take them totally seriously.

This might start conversation, at least?

~~~
walshemj
and the author seems to be a student - id take that criticism more seriously
if it was from some one with experience of big data.

BTW I used MR back in the 80's on the then largest prime computer cluster for
BT

~~~
retr0h
I tend to agree. None of these criticisms are from real-world experience. I'm
not convinced the author can grow a neck beard yet :P

------
EdwardDiego
> Make sure that everybody can build everything from scratch, without having
> to rely on Maven or Ivy or SBT downloading something automagically in the
> background. > Sign. Code needs to be signed, end-of-story.

Use Sonatype's repo then?
[http://central.sonatype.org/pages/requirements.html#sign-
fil...](http://central.sonatype.org/pages/requirements.html#sign-files-with-
gpgpgp)

~~~
WatchDog
What we need is a repository that requires sources and a build script and that
the artifact can be repeatably built.

I want to be able to build my entire project hierarchy from source.

Sonatype prefer if you provide sources, but they don't require you do so, also
there doesn't seem to be any way of actually verifying that the source matches
the binaries.

------
spydum
I can't say i've looked at hadoop but this fear of jars seems.. Overkill? You
can decompile jars. Look at any Java project which leverages 3rd party
libraries.. It's not uncommon to have dozens of libraries sourced from the
community. They could all have malicious code buried in them. Not sure I
understand why the hadoop hate?

~~~
im3w1l
To me it seems his complaint is not that hadoop pulls in binaries. The
complaint is that it pulls in parts with unknown origin. He wants to be able
to trace the components backwards, and know the names of the people
responsible for building them, ideally reputable people.

------
xai3luGi
This is a big eye-opener.

