A "data science" track should hence be 90% about algorithms, data structures, linear algebra and computer architecture. People need to know how to compute with matrices, how to use B-trees, what R-trees are, what SIMD is, what cache locality is, what package to use for practical linear algebra, what the gpu is good for and when and how to use it. Teach databases, but include also some of the theoretical database stuff, and teach how to correctly use relational databases in the first place, since this is most commonly useful and people don't really understand it. PostgreSQL is a real pearl, and few people know how much capability it has, how to use the geospatial indexing, the full-text search, how to do basic optimization and profiling.
Even for people who really do large scale computations, learning hadoop or mongodb (is anyone legitimately doing big data really using mongodb?) is just an afterthought, considering how much you have to learn first about mathematics, algorithms and computing to do anything sensible at that scale at all. If you bubble sort, MongoDB won't save you. If you know the fundamentals already, you likely don't need a separate course in mongo or hadoop.
For people looking to learn something more genuine, I would recommend, for example, this book:
I've walked out of job interviews over this sort of thing wherein it's been described as a "performance programming" position for big data and/or machine learning except that everything has to be in Java. And sure, Java performance programming is a thing, but compared to what one can achieve with SIMD, attention to cache-locality and/or running on the GPU for the 20% of the code that eats 80-99% of the cycles, it's laughably uninteresting to me, big bucks aside.
And this is where I get told by data scientists that they don't wish to support such code. And IMO that's fine for piddling around and experimentation. But for production, on thousands to hundreds of thousands of servers, running 24/7, at companies with billions and billions of dollars in the bank, that's leaving way too many transistors and electrons on the table for me to stomach. In contrast, here's what you can achieve when you do pay attention to these things:
I have a pretty good background in a lot of that (can always learn more of course), but I don't know anything about Hadoop and MapReduce (which is conveniently not mentioned in your critique, probably because it does fall under your list of acceptable topics), so I find this course interesting. I find the claim of "if you know the fundamentals, you don't need a course in that" to be dubious. Essentially you are saying that any learning material specifically targeting Hadoop is unnecessary?
Don't worry though, I'm not looking for some quick fix to my business needs, I'm not going to go out and spin up a Hadoop cluster on my 500GB of production data, I just want to learn. You're arguing more against your perceived motivations of the course-takers than the validity of the course itself.
“What is Big Data?” They will teach you fundamental principles of Hadoop, MapReduce, and how to make sense of big data. Developers will learn skills that provide fundamental building blocks towards deriving maximum value from the world's data. Technologists and business managers will gain the knowledge to build a big data strategy around Hadoop.
In my experience, to be successful in engineering in general, one has to learn whole design spaces instead of just individual technologies. This means taking a programming languages course instead of another C++ course, taking a distributed systems course instead of taking a Hadoop course, taking a databases course instead of a MySQL course and so forth. You of course have to fiddle around with the various tools as well, but you don't need a course or an instructor for that, otherwise you often end up just following written or spoken instruction which configuration file to edit, what command or query to type, etc., which is actually much _worse_ than self-directed learning. Once you have this theoretical background and bits of varied practical experience, you can do mature decisions about which tool to pick for a particular job.
So, I would really, really like to avoid anyone I might have a chance of working with learning about "how to build a big data strategy around XXX", whatever product XXX is. You don't build anything "around" up-front assumed technologies. This is just pumping the "big data" bubble, which is certainly good for Cloudera, which makes a living based on that, but doesn't exactly sound like teaching people to make informed technical judgements. I am also not to partial to the stance that learning Hadoop and MapReduce makes you a big data expert (they offer a certificate in big data after completing the course).
And no, there is nowhere near enough algorithms courses at all. The things taught in most undergradute algorithms courses are often not really the things you need for practical large scale data processing. I posted the Jeff Ullman book precisely as an example of how good courses in handling data might look like. This material is taught very rarely.
I'm frankly humbled at how little I know. Thanks for posting it. I'm going to be spending a lot of evenings on it.
My only complaint is with the awful, awful cover. No link, but if you've ever seen it, you'll know exactly what I mean.
Edit: I also don't want to sound close minded or rule out an era where every company, large or small, will have TB's of data on their hands. I just haven't seen any indications that we're going in that direction.
I think that one nice thing about the idea of "big data" is being about to parallelize the problem and just throw more cores at it.
But on the other hand, I do think that when people think of "big data" they have in some magic solution that doesn't really exist. At the end of the day big data is just statistics.
Today there are mass spamvertising campaigns on Big Data, but there are also applications on financial services (making sure our pension funds take the right risk), engineering, telecom and elsewhere that help improve our lives.
(as a comic aside to the buzzwordiness of "Big data", I saw a tweet yesterday that lamented rising use of the term "Hyper data". Which was followed by a reply about approaching "Ludicrous data", with a link to Spaceballs' ludicrous speed scene)
From the post it looks like Udacity too is working on courses that address this.