
Mining of Massive Datasets - luu
http://mmds.org/
======
amitkgupta84
I'm taking this course right now, I'm a little ambivalent about it. They cover
various machine learning algorithms, which one can learn anywhere, but they
also talk about how to deal with these things in a massive data context. The
pragmatic tools needed to wrangle large amounts of data so that you can apply
your usual ML algorithms to it is very nice to see.

That said, I don't feel like I'm learning concepts. So far, the techniques
have felt like: break up the data into chunks this way, apply a bunch of hash
functions that way, this is what ended up working for this particular problem.
I guess if you work in the field, the tools you're exposed to will inspire
things in your own work, and you'll feel more like you're building a general
framework.

The homeworks are terrible. There are no mandatory programming assignments,
and the one optional one does nothing to gradually work up to applying the
stuff they teach you, it's just, here's a massive zip that won't fit on your
hard drive, here's an uninteresting computational question to answer about it,
go for it.

The remaining (basic) homeworks are quizzes and they're incredibly tedious.
(There are advanced homeworks as well, but it hasn't been that inspiring). One
of the recent homeworks was really just a rehash of some high school linear
algebra, another one involved doing some computations with a bunch of
different points. The points weren't provided in a list, they were drawn onto
a jpeg so you had to manually copy all of them down. That's the kind of course
it's been.

It's a very light weight course, which is nice if you're working. If your
basic math skills are good and you already have some familiarity with ML and
distributed computing, 5 hrs/wk is enough to watch the videos (at 2x speed
plus occasionally hitting the 10-second-fast-forward) and do the basic
homeworks.

~~~
mathattack
Forgive my ignorance, but are you taking this as a MOOC? If so, what were you
expecting?

~~~
amitkgupta84
A good course. There are MOOCs that are good courses. This one is so-so. I
don't understand your question.

~~~
mathattack
I guess I consider it harsh to judge a book based on the MOOC that uses it.

~~~
amitkgupta84
That would be harsh. Who's doing that?

------
koffiekop
We are currently using this book for a course at the Leiden institute for
advanced computer science. It's pretty up to date.

It covers LSH, cosine similarity, Jaccard similarity as well as recommender
systems applicable to the Netflix challenge and so forth.

~~~
aquance
I am a student highly interested in data mining, do you think that book would
be a good start? What prerequisites do you think it needs?

~~~
koffiekop
It's a good book and the entry-level is not that high. However, you probably
want to have some kind of basis in maths(algebra and stuff). Also you want to
know some of the datamining terminology. But, it's free and open, so check it
out. Also; the slides are very helpful.

------
polskibus
There's a related course on Coursera done by the book authors.
[https://class.coursera.org/mmds-001/lecture](https://class.coursera.org/mmds-001/lecture)

------
amelius
I guess the focus of this book is on computing rather than on the underlying
math (statistics). Is the math of this book still up to date? I.e., are these
the methods that are still used in practice?

------
incunix
Pretty good resource but not sure where the large-scale part is other than
Chapter 12

~~~
thecopy
Chapter 2 & 3 goes through LSH and map-reduce which is used for large data
sets, where comparing all-with-all is impossible. Chapter 4 goes through
streams where you take one item at a time and fit your model to that (so
instead of optimizing a (for example) SVM with the whole data-set your stream
it one after another. Chapter 9 also includes "online" recommendation
algorithms and 11 is dimension reduction.

Sidenote: A nice way to reduce data-set size for clustering is by constructing
coresets from the original dataset [0], it is possible to create a coreset in
parallell using map-reduce. After this k-means will produce a very good
approximation

[0]
[http://las.ethz.ch/files/feldman11scalable.pdf](http://las.ethz.ch/files/feldman11scalable.pdf)

------
krat0sprakhar
I'm currently pursuing this course on Coursera. We are currently only halfway
through the course but I think I can share a few thoughts on the course.

 _Pros_

\- Faculty: Like most MOOCs, MMDS is taught by one of the best faculty from
the field. I've been an avid follower of Anand Rajaraman's blog [0] before I
joined this course and I have to say the enthusiasm of the faculty is
infectious and their expertise with the material is markedly evident.

\- Difficulty: MMDS is a CS graduate level course (CS246) from Stanford. That
means the topics are not trivial, the lectures are dense and you as a student
are expected to invest significant time into understanding the material. Since
this is hard, grasping the concepts and getting the quiz right is quite
gratifying. Few lectures from every week are tagged as advanced and students
who view and answer all advanced questions get a certificate of distinction [
not quite relevant but might provide the necessary incentive / motivation to a
few students]

\- Material: The syllabus and the topics covered in this blog are extremely
relevant for any one aspiring to work in the data mining / machine learning
field. Having done Andrew Ng's ML course, this course acts a perfect
supplement and covers a lot of practical aspects of implementing the
algorithms when applied to _massive_ data sets. For example, a recent lecture
talked about how the BFR algorithm[1] for finding clusters works better than
k-means for a very large dataset.

\- Book: The accompanying MMDS book is just awesome and the lectures build
upon the content and examples from it. For someone who finds the book a bit
too challenging (probably because your math is a bit rusty) the lectures make
the material quite approachable.

 _Cons_

\- Theoretical: The course is primarily theoretical in both its presentation
and exercises. This is not to say that algorithms are presented without
examples, but that the examples are trivial and do not illustrate the issues
with implementing or applying various algorithms in real-life datasets.

\- Programming Assignments: In sharp contrast to Andrew Ng's course, there are
no compulsory programming assignments. The exercises are all quizzes which
check how well you have understood the concepts. There is just one programming
assignment which is also optional.

Overall, I'm really liking the course. The professors emphasize citing
industry examples wherever necessary (the PageRank algorithm and accompanying
Google's implementation was covered for 3 lectures), which is a welcome change
from other CS courses. Along with the book, I believe the course is a
wonderful primer to the field of Data Mining.

[0] -
[http://anand.typepad.com/datawocky/](http://anand.typepad.com/datawocky/)

[1] -
[http://www.dmi.unict.it/~apulvirenti/agd/BFR98.pdf](http://www.dmi.unict.it/~apulvirenti/agd/BFR98.pdf)

~~~
garyrob
Thanks for your review. One question. You say: "Difficulty: MMDS is a CS
graduate level course (CS246) from Stanford."

But the post itself says: "The book, like the course, is designed at the
undergraduate computer science level with no formal prerequisites. "

Any thoughts about the discrepancy?

~~~
krat0sprakhar
In my opinion, MOOCs tend to underplay pre-requisites. The content in the
initial classes lead to a bit of furore amongst the students on exactly the
same topic. Evidently, a lot of students found the notation and mathematics
(e.g. Algebra, matrices, eigenvalues, calculus) very hard to understand.

In response to this, one of the faculty - Jeff Ullman categorically stated in
the forums that this course is taught to graduates in Stanford (CS 2xx is grad
level) and an undergraduate course in mathematics is pre-requisite. Although
most of the mathematics covered in the course is covered in a typical
undergrad class, IMO the overall content is quite advanced and graduate class
worthy.

That being said, the forums (and the book) are quite helpful, and given you
put in enough time, you will sail through.

Hope this answers your question!

