There are also no PGP-signed Hadoop packages last time I looked...
The other often misunderstood problem with Hadoop/Spark/... is that the security model is basically the same as NFS.
If you don't use Kerberos any user with access to the Hadoop cluster has at least full read-access and likely even write access if he has superuser rights* on the client machine. => http://www.openwall.com/lists/oss-security/2015/04/16/21
If you have important data on your HDFS and you did not put everything behind a thick firewall and you are not using Kerberos you have a problem. You'll likely still have a problem...
* This should also work without superuser rights, Hadoop just takes the username from the client.
I'm not sure the state of Hadoop is a surprise for anyone who works with the project. I have never heard good things about running Hadoop ops (Zookeeper, which is even more widely deployed manages to have a bad reputation as well) - and these are tools that I'd guess skew farther away than the "20-somethings on macbooks writing JS" crowd.
The article lays its points clearly, but I'm unsure if "don't use hadoop" is a viable conclusion - and I'm sure the Hadoop community knows about its faults better than anyone.
It would appear that nobody can build hadoop packages from source. Otherwise, one of these large hadoop consultancies would provide packages for ubuntu 14.04, a long term support release which will be supported years longer than 14.04.
Too many companies do 'big data', as it's currently a hype.
In most cases hadoop is total overkill -- developers don't understand how to reduce problems, might have never heard of sampling, good data-structures, know your problem, etc.
For sure, there are valid use scenarios for map reduce/hadoop, etc. but in many cases it's a big waste of money.
Do you work in big data ? I do and have never actually seen a company do "big data" without needing to. At enterprise level it is often a pretty sizeable investment e.g. buying dedicated appliances and hiring expensive teams and so generally quite a number of architects and engineers are involved in the decision.
And I would disagree that Hadoop is a big waste of money. There isn't anything else really that comes close to its cost effectiveness once you reach a certain data size.
I don't mean to be negative but this is news how if you are doing real big data you keep your cluster well secured and firewalled away from your other networks let alone the internet.
you should also probably have a model cluster to allow you to experiment with up grades.
Ding ding - the part about my cluster being able to phone home made me both laugh and question the author's intent. Did the author really mean to target people like me, pulling Hadoop & Spark toolchain source, building them, and running them; or, was it an opportunity to hop on a soap box? Some decent points were made, but the tone of the article is really off-putting.
There's a difference between customers and vendors here. It would be a waste of resources for a customer to upgrade to every new Ubuntu release. We're at the stage in 12.04's lifecycle at which customers should ideally have their plans to upgrade from 12.04 to the new 14.04 LTS solidified. If you use one of the packages the author mentioned, you can't upgrade because they don't yet offer support for the new LTS.
what a terrible opening to the article. it's just a load of unsupportable rants against things the author doesn't personally like. he even tries to disclaim that in his prologue but provides nothing beyond the assertions.
there may be more substance later but once I hit the bullshit about "iFanboys" I decided not to bother. At best I'll just shake my head wondering why people think their personal anger is a compelling argument,
It's a bit incoherent that way. "Macs are bad, because they change every six months" (what?) and also "ooh, using this version of Ubuntu is bad somehow, because it's three years into its five years of support."
They might have a point about using packages instead of "curl | sh", but I'm not sure I take them totally seriously.
Yeah I hate articles like this where good points get all tied up with boogey-man stereotyping of some group of people the author has some personal beef with. There are clearly problems with how these things are built and deployed, but all the iFanboys stuff is a distracting non-sequitur.
> Make sure that everybody can build everything from scratch, without having to rely on Maven or Ivy or SBT downloading something automagically in the background.
> Sign. Code needs to be signed, end-of-story.
What we need is a repository that requires sources and a build script and that the artifact can be repeatably built.
I want to be able to build my entire project hierarchy from source.
Sonatype prefer if you provide sources, but they don't require you do so, also there doesn't seem to be any way of actually verifying that the source matches the binaries.
I can't say i've looked at hadoop but this fear of jars seems.. Overkill? You can decompile jars.
Look at any Java project which leverages 3rd party libraries.. It's not uncommon to have dozens of libraries sourced from the community. They could all have malicious code buried in them. Not sure I understand why the hadoop hate?
To me it seems his complaint is not that hadoop pulls in binaries. The complaint is that it pulls in parts with unknown origin.
He wants to be able to trace the components backwards, and know the names of the people responsible for building them, ideally reputable people.
The other often misunderstood problem with Hadoop/Spark/... is that the security model is basically the same as NFS.
If you don't use Kerberos any user with access to the Hadoop cluster has at least full read-access and likely even write access if he has superuser rights* on the client machine. => http://www.openwall.com/lists/oss-security/2015/04/16/21
If you have important data on your HDFS and you did not put everything behind a thick firewall and you are not using Kerberos you have a problem. You'll likely still have a problem...
* This should also work without superuser rights, Hadoop just takes the username from the client.