
How to Build an Optimal Hadoop Cluster - komljen
http://www.atlantbh.com/how-to-build-optimal-hadoop-cluster/
======
lflux
"What is DISADVANTAGE about Rack Awareness at this point is the manual work
required to define it the first time, continually update it, and keep the
information accurate. If the rack switch could auto-magically provide the Name
Node with the list of Data Nodes it has, that would be great."

This is where LLDP comes in handy. Run an LLDP agent on each node and enable
LLDP on the switch access ports. Then it's just a matter of the NameNode
fetching LLDP neighbor information from the switch (usually by using SNMP) and
updating it's Rack Awareness.

(Disclaimer: I know nothing about Hadoop...)

------
oellegaard
I recently tried to setup a hadoop cluster, just for the fun of it - but I
found it slightly painful. Does anyone know if there are good puppet modules
available for configuring everything? I wasn't able to find any.

~~~
photorized
Both Cloudera (CDH) and Hortonworks (HDP) distros make it easy to configure
small clusters.

You just need to make sure all nodes have clocks properly set and
synchronized, hostnames and DNS names are set for each machine, SE Linux is
turned off, and firewall settings (on each node) allow machines to talk to
each other.

Also, Dell offers a downloadable Crowbar ISO for Hadoop - and that's Puppet-
based, if you are familiar with Puppet.

------
mantunovic
Agree in regards of Cloud awareness but as in my case it was in house Cloud
which amount of nodes is not changed so frequently. Anyway when You add new
nodes You have to specify them inside configuration so rack awareness also can
be updated during that maintenance. Regarding which distribution to use -
there are lot of them now such as MapR, Hortonworks, Apache and Cloudera. All
of them have their advantages and disadvantages but if You take support such
we have with Cloudera then even special requests for incorporating fixes from
Apache side is possible. And for OS it doesn't matter which to use but I would
prefer CentOS because it is almost as Redhat (You do not need anything from
RedHat that CentOS cannot give You for Hadoop cluster). Another thing is that
CentOS is supported accross all of these vendors as usually it is made for
RedHat and then works perfect on CentOS. Also there are lot of things that can
improve Hadoop such as: \- Turn of swapping \- ulimit has to be at least 64K
\- It is better to have more disks because of failure on one side and disk
performances on another side. With having less disks than reading data become
bottleneck and so on and so on

------
nisa
It's probably a better idea to use Hadoop 1.0.4 or even 2.0. Lot's of little
annoyances are solved there.

Monitoring is also completely missing: We use Icinga/Nagios and Ganglia.
Especially Ganglia is invaluable to adjust the configuration for optimal
machine usage in my experience.

Another point worth considering is Security. Hadoop by default is secured like
NFS. That means any user that is able to create an _hdfs_ or _hadoop_ user on
a machine that has access to the NameNode can delete your HDFS. Hadoop can use
Kerberos for security.

Also consider adding Snappy Compression to your setup, it speeds up the
shuffle phase.

Last but not least - I've found these slides about Hadoop Tuning invaluable:
<http://www.slideshare.net/cloudera/mr-perf>

@meinuelzen: We use Oracle JDK 7 with en_US.UTF-8 and ntpd on all machines.
Ubuntu 10.04 / Ubuntu 12.04 but the OS should not matter. Lot's of RAM and
lot's of disks are more important.

------
mantunovic
@oellegaard: If You had issues with manual then puppet modules + setup will be
the same. If You just want to try You can use Cloudera Manager to setup cloud
without any knowledge of hadoop. You just need to have ssh using keys as root
on all hosts from host where cloudera manager is installed and it will just
happen in seconds.

[https://ccp.cloudera.com/display/DOC/Documentation#Documenta...](https://ccp.cloudera.com/display/DOC/Documentation#Documentation-
ClouderaManager4.5FreeEditionBetaDocumentation)

------
mantunovic
Thank You! For locale yes as You said UTF-8 is standard and related to date
and time it is better to use synchronization via NTP because so many troubles
can make having different time on servers.

Regards,

------
meinuelzen
Nice one!

What would you recommend as default locales settings for the systems? I guess
_LC_ALL=en_US.UTF-8_ , right?

And what about server date and time? Using NTP or not?

------
photorized
I've seen these graphics before in Dell's presentations. Do you work for Dell?

