It's amazing how many enterprise customers encounter severe issues with their clusters simply because of poor scheduler configuration. Either they never changed the defaults, or they arbitrarily twisted knobs without understanding the how various configuration values are dependent on each other.
Finding a healthy balance between having humans describe what they think they want and letting the system adjust itself will be huge.
In particular it refers to slots. YARN no longer use slots.
No mention of Tez, Storm or Spark?
There's a three year old xkcd "what-if" that guesstimates Google has about 15 exabytes of data, and that presumably includes lots of different types of data (e.g., you're not going to run any analysis across youtube videos, gmails, and self-driving car telemetry).