

Launching our Data Science and Big Data Track - Alex3917
http://blog.udacity.com/2013/11/sebastian-thrun-launching-our-data.html?m=1

======
stiff
99% of people looking for information about big data and 99% of people looking
to do data science, don't have nowhere near big data, and don't need to be
taught hadoop. Those people are instead often lacking fundamental knowledge
and are looking for a trick technological solution instead of reviewing their
basics.

A "data science" track should hence be 90% about algorithms, data structures,
linear algebra and computer architecture. People need to know how to compute
with matrices, how to use B-trees, what R-trees are, what SIMD is, what cache
locality is, what package to use for practical linear algebra, what the gpu is
good for and when and how to use it. Teach databases, but include also some of
the theoretical database stuff, and teach how to correctly use relational
databases in the first place, since this is most commonly useful and people
don't really understand it. PostgreSQL is a real pearl, and few people know
how much capability it has, how to use the geospatial indexing, the full-text
search, how to do basic optimization and profiling.

Even for people who really do large scale computations, learning hadoop or
mongodb (is anyone legitimately doing big data really using mongodb?) is just
an afterthought, considering how much you have to learn first about
mathematics, algorithms and computing to do anything sensible at that scale at
all. If you bubble sort, MongoDB won't save you. If you know the fundamentals
already, you likely don't need a separate course in mongo or hadoop.

For people looking to learn something more genuine, I would recommend, for
example, this book:

[http://infolab.stanford.edu/~ullman/mmds.html](http://infolab.stanford.edu/~ullman/mmds.html)

~~~
varelse
100% agree except that I think courses like this are great for people who want
to bluff their way through a job interview to get one of those $150K-$250K big
data jobs that are the rage right now (watching Andrew Ng's machine learning
lectures beforehand as well would be the pro move in my book). In my
experience so far, most of these positions appear to be Java programming gigs
where, sadly, issues like SIMD, cache locality, and the GPU are actively
ignored or dismissed. Autoboxing alone to use Java generics pretty much
destroys any hope of efficient cache use.

I've walked out of job interviews over this sort of thing wherein it's been
described as a "performance programming" position for big data and/or machine
learning except that everything has to be in Java. And sure, Java performance
programming is a thing, but compared to what one can achieve with SIMD,
attention to cache-locality and/or running on the GPU for the 20% of the code
that eats 80-99% of the cycles, it's laughably uninteresting to me, big bucks
aside.

~~~
sanskritabelt
It's faster than python and R, dogg.

~~~
varelse
You can do extreme performance coding for Java, Javascript, R, Python, Haskell
or the language of your choice as long as the number-crunching is done by
calling low-level heavily optimized code. For example, PyCUDA:

[http://mathema.tician.de/software/pycuda/](http://mathema.tician.de/software/pycuda/)

And this is where I get told by data scientists that they don't wish to
support such code. And IMO that's fine for piddling around and
experimentation. But for production, on thousands to hundreds of thousands of
servers, running 24/7, at companies with billions and billions of dollars in
the bank, that's leaving way too many transistors and electrons on the table
for me to stomach. In contrast, here's what you can achieve when you do pay
attention to these things:

[http://istc-bigdata.org/index.php/mapd-a-way-to-map-big-data...](http://istc-
bigdata.org/index.php/mapd-a-way-to-map-big-data-faster/)

~~~
sanskritabelt
You're in a different part of the problem space.

------
jsaxton86
It seems that the majority of Big Data/Data Science applications are designed
to give advertisers insight into things I don't really want them to have
insight into. That really sucks, because the technology is cool, but I don't
want to help build that kind of future. It's kind of analogous to how I feel
about Computer Vision: there are a handful of legitimate purposes for it, but
most applications of the technology fall somewhere between "I don't like that
idea" to "that's totally unethical".

~~~
mathattack
There are plenty of Big Data applications that aren't unethical. Back in the
90s (before the term Big Data was conceived) two of the biggest users of
Teradata were P&G and Wal*mart. It was more about supply chain and retail
store efficiency than anything nefarious. Big data helped make sure that store
shelves had what people wanted.

Today there are mass spamvertising campaigns on Big Data, but there are also
applications on financial services (making sure our pension funds take the
right risk), engineering, telecom and elsewhere that help improve our lives.

~~~
sanskritabelt
Helping Walmart is, at best, morally suspect.

~~~
Houshalter
Reducing retail waste is better for everyone. Though some of the stuff they do
is questionable, like how carefully items in the store are placed to maximize
the amount of unnecessary crap they sell to impulsive people.

~~~
sanskritabelt
How does 'exploitative labor practices' float your boat.

~~~
Houshalter
Well hopefully those jobs will be automated soon enough as machine vision and
robotics rapidly pick up pace.

------
marrone12
This is a cool idea, but I wish everything wasn't so 'big data' oriented. Most
people will never work with big data. Instead of teaching me map/reduce, how
about teaching me how to model with a mixture distribution? Teach me how to
master small data and then scale those up to big data when and if need be.

~~~
joshz
You might like Data Analysis from Coursera.

[https://www.coursera.org/course/dataanalysis](https://www.coursera.org/course/dataanalysis)

From the post it looks like Udacity too is working on courses that address
this.

~~~
thaddeusmt
I'm working through this course (and took the earlier sister course
Computation for Data Analysis with R). It's quite good so far. They stay away
from the "big" part and focus on the core of data analysis: how to find data,
clean it up, explore it, find relationships and present your findings. We are
using R, which is suitable for most data sizes. It's offered by Johns Hopkins,
and has more of an academic bent than an industry one. Great general purpose
knowledge that I think you would want before you start messing around with
Hadoop.

------
hoprocker
So looking through this 'track', I see one course which seems like it might be
more central to the discipline, "Intro to Data Science"[0]. Has anybody had a
chance to compare this one against Bill Howe's "Introduction to Data
Science"[1] on Coursera?

[0]
[https://www.udacity.com/course/ud359](https://www.udacity.com/course/ud359)
[1]
[https://www.coursera.org/course/datasci](https://www.coursera.org/course/datasci)

------
suhair
Introduction to Hadoop and Mapreduce course seems to have the right amount of
content. It could be completed in one sitting and content is polished, well
presented , and easy to grasp. Respect to Cloudera faculty. As an added bonus,
uses python instead of java for examples.

~~~
Littleme
(Course author here) Thanks. We chose Python because it's a little more
approachable for many people that Java, and is the language used in Udacity's
Comp Sci 101 course. Also, using Hadoop Streaming saved us from having to
explain a bunch of concepts such as WritableComparables, InputFormats etc that
would just have got in the way of the basic MapReduce principles.

------
waitingkuo
Most Udacity courses use Python, but seems this data science series will use
R. Python also has lots of data analysis tool. Just wonder why they are
choosing R.

~~~
minimaxir
R's data science tools are mostly native.

------
noahmarc
This seems like a great bundle of courses. The big data topics caught my
attention but I'm actually looking forward to exploratory data analysis.
Especially with the Tukey mention:
[https://www.udacity.com/course/ud651](https://www.udacity.com/course/ud651)

------
sfeats
This looks great, anyone know how much the program will cost?

~~~
jackgolding
Says on the site mate, about $210 per unit

------
Rickasaurus
Ugh, Hadoop.

