
The HDFS Juggernaut - josephscott
https://blog.shodan.io/the-hdfs-juggernaut/
======
pweissbrod
Hadoop (and hence HDFS) is a stack of services designed to work together to
serve a file system and manage jobs. The hadoop stack has a pluggable
authentication/authorization by design. And yes, the default is "no security".

Given the distributed nature, HDFS runs on multiple machines. In linux
distributed service security fits well with kerberos. Normally if you want a
"secure" HDFS you must "kerberize" the services such that any hadoop operation
requires a valid/authorized TGT.

To most people kerberizing a hadoop cluster is a major barrier to getting
hadoop running. I dont see this changing but certain vendor hadoop distros
break down some of the barriers.

Sometimes it is OK if you run a cluster insecure. Please dont do it if youre
handling my financial or medical records though. As Mr.T once said 'dont write
checks that yo ass cant cash'

~~~
jpk
> Sometimes it is OK if you run a cluster insecure.

Sure, but put it behind a VPN or _something_. This article is literally about
clusters accessible via the public Internet. If there's a legitimate, non-
trivial use case in which that's ever okay, I'm curious to know them.

~~~
pweissbrod
I dont think this article is about clusters accessible via public internet. I
think this article is about clusters accessible after breaking through an
easily defeated ssh barrier and gaining full access.

~~~
jpk
Linked from the article:

[https://www.shodan.io/search?query=NODATA4U_SECUREYOURSHIT](https://www.shodan.io/search?query=NODATA4U_SECUREYOURSHIT)

That's a list of hadoop nodes you can access via the public Internet.

------
sbarre
Why do all these products have "insecure by default" configurations, anyways?

Didn't we learn anything from register_globals?

~~~
jandrese
The alternative is "asks you lots of questions at install, and/or makes you
generate keys or something like that", which makes people say it's too
complicated to set up.

Most people want the simple insecure one for their lab to see if it will work
for them before they deploy the locked down version on the internet. Of course
the part where they make sure to secure the production one is where people
forget.

This isn't crazy. If someone asks me to do some benchmarks on an app to see if
it would fit our problem, the last thing I want is to spend a week working up
a federation of keyservers and a fake domain and everything else that the
deployed product will require. If the product doesn't fit our need that's just
wasted time.

~~~
brianwawok
Or there is middle ground

"Allow easy access with no password or username, but only from localhost"

Not foolproof, but as far as I know both this and mongo did the "bind to all
ports AND allow access with no password". It was the combo that really killed
it.

~~~
makmanalp
It's easier than that - just allow two setup options that generate different
config files: "development (insecure)" and "production". First one is quick
and easy and relatively more unsafe, second one is fully secured based on
whatever best practices. "People will find and set all the configuration
options correctly" is what got us into this mess.

~~~
brazzledazzle
I believe this is how Hashicorp's Vault does it.

------
iamjochem
even if node-to-node communication in a cluster (hadoop or otherwise) itself
is not secured, is it not reasonable to secure external access to the cluster
itself (i.e. with a firewall)?

from an outsider perspective (I've never used/run hadoop) I cannot see much
reason for exposing the cluster to the outside world - either a web-app acts
as an intermediary or access can be provided via VPN/ssh-tunnel/etc

... just curious why a fully/publically exposed cluster would be a
"requirement"? or does it come down to the fact that firewalling an AWS
environment is as painful (if not more) than "kerberizing" a [hadoop] cluster?
(I kind of assumed AWS has firewalling functionality that is fairly
plug'n'play ... a quick search does really back that up though)

~~~
joefkelley
I used to work at a big data consulting company and dealt with hadoop clusters
at a bunch of different companies. What you described was absolutely the norm.
The entire cluster closed to the outside world, except for one gateway machine
that allows ssh access, and anything within the cluster is totally open.
Sometimes some web services were open to the company VPN.

Kerberizing is a pain but not usually needed. You're correct that AWS firewall
rules are very easy.

What you're seeing in this article is the exception, people doing it totally
wrong.

------
jarym
I knew it was a bad idea to post 'getting started' tutorials that skipped all
the security steps and replace them with a 'probably don't wanna do it this
way in production' (and usually no documentation on how one should do it)...

Not levelling this comment at HDFS solely but it's about time people stopped
with the 'hello world' style examples.

~~~
dspillett
_> it's about time people stopped with the 'hello world' style examples_

Unfortunately not having the initial "this is the core, look how simple it can
be" example a lot of people will turn off.

 _> (and usually no documentation on how one should do it)_

That is the issue. Good documentation is often the issue for many other
reasons too. For commercial projects time is money and so often not available
in sufficient quantity (at least for externally facing documentation that may
be held to standards that internal materials are not so needs extra resource
for review+rework). For unfunded work (for instance things that start out as
personal projects and/or quick proofs of concept, or many community-only
driven works) documentation often gets left because it isn't the
interesting/sexy part.

------
Danihan
This was back in May, I wonder how it has changed / if anyone parsed some of
this data..

~~~
jlgaddis
A blog comment from two months ago said it was up to 6 PB at that time.

