
CS246: Mining Massive Data Sets - Anon84
http://web.stanford.edu/class/cs246/
======
moultano
I wrote a guide to the Minhashing family of algorithms that goes into a little
more depth than is covered in the LSH section of this course.

[https://moultano.wordpress.com/2018/11/08/minhashing-3kbzhsx...](https://moultano.wordpress.com/2018/11/08/minhashing-3kbzhsxyg4467-6/)

You might find it useful if you've ever thought about using MinHash and
wondered how to incorporate weights rather than treating everything as a set.

~~~
moultano
I periodically submit this to HN, but it never seems to get traction.
[https://news.ycombinator.com/item?id=22268810](https://news.ycombinator.com/item?id=22268810)

------
senderista
I've done the class and can't recommend it highly enough (except for Ullman's
lectures). I recommend the advanced track if it's still offered.

~~~
fizwhiz
What's wrong with Ullman's lectures?

~~~
deepGem
Man the first edition of this course on coursera had Ullman's lectures. I am
not sure if it's the online delivery mechanism or his tonality - I couldn't
sit through a single lecture.

One of the toughest courses out there.

------
DrNuke
New telescopes with crowdworked astronomy needing good algos to process
gazillion images at yet-to-be-seen detail.

------
tasubotadas
I've taken the MOOC version of this course and it was very poorly explained. I
hope the content and lecturing has changed since 2013 because the content
itself is really interesting.

I wasn't impressed with the quality of the book as well. I did learn quite a
few methods there (minhash) that I got to use later so thanks for that, but
compared to MLPR, Learning from Data, or TESL books the quality of the former
pales.

~~~
greymalik
Can you clarify what the MLPR and TESL books refer to?

~~~
tasubotadas
Machine Learning and Pattern Recognition

The Elements of Statistical Learning

~~~
0x31a
The title of Bishop's book is Pattern Recognition and Machine Learning.

~~~
tasubotadas
Yep.

------
commandlinefan
I haven’t done the course, but +1 for the textbook. It’s freely downloadable
and very readable. I learned a lot from it and the exercises are just the
right level of difficulty to help you really internalize the material. I wish
there were self-check answers in the “back” of the book, though.

~~~
rahimnathwani
This book?

[http://www.mmds.org/](http://www.mmds.org/)

~~~
datlife
Yes

------
datlife
Love this course. Pretty rigorous, but the professor explains very well.

------
diehunde
Looks very interesting. Too bad they don't have video lectures.

~~~
mrlatinos
There's a link to lecture videos towards the top of the page -
[https://www.youtube.com/channel/UC_Oao2FYkLAUlUVkBfze4jg/vid...](https://www.youtube.com/channel/UC_Oao2FYkLAUlUVkBfze4jg/videos)

Here's a different YouTube channel I found with the full course:
[https://www.youtube.com/playlist?list=PLLssT5z_DsK9JDLcT8T62...](https://www.youtube.com/playlist?list=PLLssT5z_DsK9JDLcT8T62VtzwyW9LNepV)

------
master_yoda_1
What no deep learning, and nobody is claiming to solve all of your ranking
problem in 5 lines of code ;) do we really need this course if we have fast ai
?

~~~
benrbray
I think the ML hype is actually slowing down. I've been interviewing for ML
positions as a new grad and a number of companies have told me they have an
excess of data scientists who can train the models, but a dearth of engineers
who can actually scale the models up to production. Friends at FAANGs have
similar stories.

~~~
NerdyDrone
Thanks for sharing! As a new grad also looking for work, maybe I'll apply to
more SE jobs, fewer data science-focused jobs lol.

TBH applying for jobs is scary asf.

~~~
pc86
Maybe it's just because I've been in industry for 10+ years now, but how are
you qualified as a new grad to apply for both?

~~~
tudelo
Double major statistics + cs? masters statistics? Seems those would
potentially qualify you if you are a solid candidate?

------
iagovar
One day someone has to make a dataset of how much talent we waste in the EU.
Bsc gets paid by EU taxes, guy ends up teaching in US university or working
for US company.

It has become the cycle of life.

~~~
wait_a_minute
If highly productive people can't keep the fruit of their labor, they'll leave
to a place where they can! We have less social programs in the US but the
majority of productive people will probably prefer working here since they can
take home a significantly higher salary even after taxes.

~~~
Anon84
And after you add the expenses needed to make up for the social programs you
don't have, you might even realize that what you're left with is not that
significantly higher.

~~~
wait_a_minute
I don't think that's going to apply to the majority of high-income earners,
because the difference is greater than the cost of not having free college or
socialized healthcare. A college education is a one-time cost. You might pay
$30,000 over time but when you're earning $130,000 or more, you can crush that
really quickly and not be burdened by the higher taxation and reduced
opportunity you'd be facing each year otherwise.

The same is true with healthcare in America. It's actually not that bad if you
have a decent insurance plan and some cash. For high-income earners, which is
the people we are talking about leaving from the UK, it's most likely going to
be a net gain over time to have the higher salaries and lower taxes.

There's a reason the best engineers are going to want to leave the UK, Canada,
India, China, etc, and come to the USA. It's worth it. I personally could work
from anywhere including the UK, but why would I subject myself to such lower
pay for little or no real gain?

~~~
zozbot234
Healthcare is "not that bad" if you have some cash, because then you can
afford to go abroad, e.g. India for your care. Other than that, it _is_
terrible.

~~~
ashtonbaker
...do you live in the US? I have lots of problems with our health system, even
as someone with health insurance, but I very rarely find myself traveling to
India for healthcare.

------
fnord77
wish stanford would put videos online. But I guess then a lot fewer people
would spend the $6000 for the class.

~~~
tmpz22
Didn't Berkley recently take down all their educational videos because they
were sued for not providing them in a fully accessible format?

------
streetcat1
The problem with such courses is that "massive" tend to change every year in
exponential manner.

For example, if I can have a single machine with 32 cores and 1TB memory? what
is massive in this context?

~~~
moultano
The datasets grow to meet the computing available to them. The things
gathering the data themselves become more powerful, and so more of that data
makes it downstream.

I'd define "massive" data as anything where n^2 is too big, where "too big" is
bigger than either my ram or my patience.

~~~
JimmyAustin
I've heard everything from "it doesn't fit in Excel" to "it doesn't fit in ram
for a standard dev's laptop (~10gb) to "it doesn't fit in ram in a decent
sized EC2 instance (~250gb).

~~~
moultano
I started worrying at one point that all the techniques I learned when I
started my career for working with big data were becoming obsolete, but they
aren't. What you needed to do before to make things possible is now needed to
make it fast.

~~~
Qu3tzal
Isn't it the same as before? If 4gb of data was too big because you had 2gb of
RAM, then the methods used at that time are the same you would apply for a
500gb dataset that can't fit in a 250gb RAM machine, right?

New issues appear when you have to analyze 2Tb with a 32gb RAM machine, but
when the order of difference is the same, the issues and thus the answers are
the same as before?

~~~
streetcat1
No. Because the number of use cases where you have 1TB or 2TB of data is
smaller in comparison.

Also, the rest of the use cases (which fits into a single machine memory now),
can be handled much more efficiently with memory base algorithm, instead of
I/O based algorithms.

The goal of Hadoop, as well as most of the theory on disk-based indices (E.g.
BTREE), was to overcome the I/O bottlenecks. But as memory is getting bigger
and cheaper there is a trend to drop Hadoop in favor of reading data directly
from the cloud and into memory.

