Hacker News new | past | comments | ask | show | jobs | submit login
Scheduling Algorithms in Big Data: A Survey [pdf] (ijecs.in)
49 points by Katydid on Nov 20, 2016 | hide | past | favorite | 5 comments

This paper might seem short, but it's a great introduction to various scheduler types available for Hadoop clusters. Fair Scheduler & Capacity Scheduler are the only ones I've witnessed in production use, but I see a huge potential for improvement in the longer term by using some of the adaptive types.

It's amazing how many enterprise customers encounter severe issues with their clusters simply because of poor scheduler configuration. Either they never changed the defaults, or they arbitrarily twisted knobs without understanding the how various configuration values are dependent on each other.

Finding a healthy balance between having humans describe what they think they want and letting the system adjust itself will be huge.

Hadoop scheduling is hard, and the one thing I'd love to give someone some money to come in and fix.

It's dated. Uses references to Hadoop 1.2 documentation. Hadoop is at 2.7 currently with 2.8 and 3.0 coming soon.

In particular it refers to slots. YARN no longer use slots.

No mention of Tez, Storm or Spark?

Is anyone actually sitting on zettabyte-scale data sets, as the introduction claims?

There's a three year old xkcd "what-if" that guesstimates Google has about 15 exabytes of data, and that presumably includes lots of different types of data (e.g., you're not going to run any analysis across youtube videos, gmails, and self-driving car telemetry).

The authors really need to clean this up, the English is awful...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact