Hacker News new | comments | show | ask | jobs | submit login

99% of people looking for information about big data and 99% of people looking to do data science, don't have nowhere near big data, and don't need to be taught hadoop. Those people are instead often lacking fundamental knowledge and are looking for a trick technological solution instead of reviewing their basics.

A "data science" track should hence be 90% about algorithms, data structures, linear algebra and computer architecture. People need to know how to compute with matrices, how to use B-trees, what R-trees are, what SIMD is, what cache locality is, what package to use for practical linear algebra, what the gpu is good for and when and how to use it. Teach databases, but include also some of the theoretical database stuff, and teach how to correctly use relational databases in the first place, since this is most commonly useful and people don't really understand it. PostgreSQL is a real pearl, and few people know how much capability it has, how to use the geospatial indexing, the full-text search, how to do basic optimization and profiling.

Even for people who really do large scale computations, learning hadoop or mongodb (is anyone legitimately doing big data really using mongodb?) is just an afterthought, considering how much you have to learn first about mathematics, algorithms and computing to do anything sensible at that scale at all. If you bubble sort, MongoDB won't save you. If you know the fundamentals already, you likely don't need a separate course in mongo or hadoop.

For people looking to learn something more genuine, I would recommend, for example, this book:


100% agree except that I think courses like this are great for people who want to bluff their way through a job interview to get one of those $150K-$250K big data jobs that are the rage right now (watching Andrew Ng's machine learning lectures beforehand as well would be the pro move in my book). In my experience so far, most of these positions appear to be Java programming gigs where, sadly, issues like SIMD, cache locality, and the GPU are actively ignored or dismissed. Autoboxing alone to use Java generics pretty much destroys any hope of efficient cache use.

I've walked out of job interviews over this sort of thing wherein it's been described as a "performance programming" position for big data and/or machine learning except that everything has to be in Java. And sure, Java performance programming is a thing, but compared to what one can achieve with SIMD, attention to cache-locality and/or running on the GPU for the 20% of the code that eats 80-99% of the cycles, it's laughably uninteresting to me, big bucks aside.

It's faster than python and R, dogg.

You can do extreme performance coding for Java, Javascript, R, Python, Haskell or the language of your choice as long as the number-crunching is done by calling low-level heavily optimized code. For example, PyCUDA:


And this is where I get told by data scientists that they don't wish to support such code. And IMO that's fine for piddling around and experimentation. But for production, on thousands to hundreds of thousands of servers, running 24/7, at companies with billions and billions of dollars in the bank, that's leaving way too many transistors and electrons on the table for me to stomach. In contrast, here's what you can achieve when you do pay attention to these things:


You're in a different part of the problem space.

Aren't there already plenty of courses and books out there about algorithms, data structures, RDBMS, etc?

I have a pretty good background in a lot of that (can always learn more of course), but I don't know anything about Hadoop and MapReduce (which is conveniently not mentioned in your critique, probably because it does fall under your list of acceptable topics), so I find this course interesting. I find the claim of "if you know the fundamentals, you don't need a course in that" to be dubious. Essentially you are saying that any learning material specifically targeting Hadoop is unnecessary?

Don't worry though, I'm not looking for some quick fix to my business needs, I'm not going to go out and spin up a Hadoop cluster on my 500GB of production data, I just want to learn. You're arguing more against your perceived motivations of the course-takers than the validity of the course itself.

What really annoys me is this particular blog post, not just the existence of the course, for example this thing:

“What is Big Data?” They will teach you fundamental principles of Hadoop, MapReduce, and how to make sense of big data. Developers will learn skills that provide fundamental building blocks towards deriving maximum value from the world's data. Technologists and business managers will gain the knowledge to build a big data strategy around Hadoop.

In my experience, to be successful in engineering in general, one has to learn whole design spaces instead of just individual technologies. This means taking a programming languages course instead of another C++ course, taking a distributed systems course instead of taking a Hadoop course, taking a databases course instead of a MySQL course and so forth. You of course have to fiddle around with the various tools as well, but you don't need a course or an instructor for that, otherwise you often end up just following written or spoken instruction which configuration file to edit, what command or query to type, etc., which is actually much _worse_ than self-directed learning. Once you have this theoretical background and bits of varied practical experience, you can do mature decisions about which tool to pick for a particular job.

So, I would really, really like to avoid anyone I might have a chance of working with learning about "how to build a big data strategy around XXX", whatever product XXX is. You don't build anything "around" up-front assumed technologies. This is just pumping the "big data" bubble, which is certainly good for Cloudera, which makes a living based on that, but doesn't exactly sound like teaching people to make informed technical judgements. I am also not to partial to the stance that learning Hadoop and MapReduce makes you a big data expert (they offer a certificate in big data after completing the course).

And no, there is nowhere near enough algorithms courses at all. The things taught in most undergradute algorithms courses are often not really the things you need for practical large scale data processing. I posted the Jeff Ullman book precisely as an example of how good courses in handling data might look like. This material is taught very rarely.

A lot of universities are definitely trying to push more and more of these sorts of courses though; check out, for example, cs229r at Harvard: http://people.seas.harvard.edu/~minilek/cs229r/index.html , along with some other course examples at the bottom of that page. Do you think things are changing for the better in this manner?

The Ullman book is intimidating to say the least.

I'm frankly humbled at how little I know. Thanks for posting it. I'm going to be spending a lot of evenings on it.

If you have any suggestions for online courses or other self-learning resources for distributed systems, I would be much obliged. The Ullman book is already in my queue.

Any recommendations for learning more about PostgreSQL's features? I use Postgres for basic web app data store kind of applications, but haven't gone much beyond that. I've skimmed the manual and it doesn't look that dissimilar to MySQL's, but I see a lot of comments heaping praise on Postgres and/or criticizing MySQL. Is there a good book on Postgres?

Man, I love that book. I loved it so much, as a matter of fact, that I went and bought it from Amazon. Even though I haven't directly used a large amount of it (though the locally sensitive hashing stuff and the communicational complexity stuff is easily worth the price), it has definitely made me think about large datasets in a much more productive fashion.

My only complaint is with the awful, awful cover. No link, but if you've ever seen it, you'll know exactly what I mean.

True. You're much more likely to apply data management fundamentals in a project than optimize impala queries on petabytes of user data. If you're a 10-50 person startup...maybe even a 200 person startup, what core/critical internal problems can you think of that would require such large scale computing? Would you allocate precious resources to a 'big data' team to monitor your logs or user activity? You most likely wouldn't need to. For the most part, only the big companies deal with that much data and only a handful of people would be in charge of managing it.

Edit: I also don't want to sound close minded or rule out an era where every company, large or small, will have TB's of data on their hands. I just haven't seen any indications that we're going in that direction.

your list of stuff to learn is rather heavy on implementation - wouldn't it make more sense to use a poor implementation of an algorithm in hadoop that uses 10 machines than to squeeze every ounce of performance using GPU's, cache locality etc. in a single machine, at great expense in programmer time?

I think that one nice thing about the idea of "big data" is being about to parallelize the problem and just throw more cores at it.

But on the other hand, I do think that when people think of "big data" they have in some magic solution that doesn't really exist. At the end of the day big data is just statistics.

This is a fair counterpoint, but if your lack of fundamentals caused you to write an O(n^k); k>1 algorithm, you're not going to be able to pay your way to the solution with more cores if you truly have "big data". Even constant multipliers of a poor O(n) algorithm will cost you serious bucks if your default optimization strategy is "buy more computers" rather than a few afternoons of quiet thought.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact