
Scheduling Algorithms in Big Data: A Survey [pdf] - Katydid
http://www.ijecs.in/issue/v5-i8/53%20ijecs.pdf
======
thinkmassive
This paper might seem short, but it's a great introduction to various
scheduler types available for Hadoop clusters. Fair Scheduler & Capacity
Scheduler are the only ones I've witnessed in production use, but I see a huge
potential for improvement in the longer term by using some of the adaptive
types.

It's amazing how many enterprise customers encounter severe issues with their
clusters simply because of poor scheduler configuration. Either they never
changed the defaults, or they arbitrarily twisted knobs without understanding
the how various configuration values are dependent on each other.

Finding a healthy balance between having humans describe what they think they
want and letting the system adjust itself will be huge.

~~~
nl
Hadoop scheduling is _hard_ , and the one thing I'd love to give someone some
money to come in and fix.

------
dkuder
It's dated. Uses references to Hadoop 1.2 documentation. Hadoop is at 2.7
currently with 2.8 and 3.0 coming soon.

In particular it refers to slots. YARN no longer use slots.

No mention of Tez, Storm or Spark?

------
mattkrause
Is anyone actually sitting on zettabyte-scale data sets, as the introduction
claims?

There's a three year old xkcd "what-if" that guesstimates Google has about 15
exabytes of data, and that presumably includes lots of different types of data
(e.g., you're not going to run any analysis across youtube videos, gmails, and
self-driving car telemetry).

------
bitmadness
The authors really need to clean this up, the English is awful...

