
Yahoo Releases the Largest-Ever Machine Learning Dataset for Researchers - denzil_correa
http://yahoolabs.tumblr.com/post/137281912191/yahoo-releases-the-largest-ever-machine-learning
======
rodionos
Note that they're sharing this dataset only with *.edu, which is unfortunate
for the rest of us. I wish they would allow access to a fraction of the
dataset, e.g. 5% of records, for the rest of the community.

~~~
boomzilla
Would it be suffice with an .edu email, or does one need a formal document
from university officials? I tried to click through from the sandbox link, but
a Yahoo account is required.

~~~
Namrog84
Most places only care that you were a student with active edu. I graduated
years ago and I still use the edu without problem. Most things say you need to
be a student from university. Which I am/was. Very few places say actively
enrolled student. It's a legal gray area in my opinion.

~~~
dexterdog
Many schools will give alumni edu addresses as well

------
frik
Sad the future of "Yahoo!" (the tech company, not the Alibaba stock) is
uncertain. They were always very open with their research. Thinking back to
2008/09 they had the biggest Hadoop clusters, etc. even the first edition of
O'Reilys Hadoop books says "Yahoo press".

~~~
alceufc
Flickr was -- and maybe still is -- very useful for the computer vision
research community.

~~~
TeMPOraL
How do they use it? Does it have a good programmatic access? Because UX-side,
I'm surprised they still exist. I don't know of a single photo-related web
site that has worse UI and is more annoying to use than Flickr.

~~~
stass
This is extremely biased. A lot of people, myself included, find it easy and
pleasant to use. Honestly, I'm not aware of a single alternative with a better
UI/UX -- 500px maybe?

As for programmatic access -- Flickr has a good API interface.

~~~
danieldk
_I 'm not aware of a single alternative with a better UI/UX_

I don't use it anymore, but SmugMug.

~~~
stass
SmugMug is great! It's a little bit different from Flickr though -- SmugMug is
more of a personal photo hosting/portfolio website, while Flickr is more about
photographers community. I believe the social aspects of Flickr are much more
important than it's actual photo storage capabilities!

------
BetaCygni
I'm torn. I love open data, but I fully expect that someone will (partially)
deanonymize this.

~~~
rectang
I share your concern.

Once data like this is deanonymized, it's out there forever -- there's no
going back to fix it like you would a software bug. So you need perfect
understanding and provable security at release time to guaranteed safety into
the indefinite future. That's not an easy constraint to satisfy.

~~~
joatmon-snoo
The thing that terrifies me specifically is that there's been work done - I
believe it was a branch of the US military studying network traffic patterns -
showing that you can reconstruct profiles based on behavior patterns and link
them back to the original user with high success rates.

------
mooreds
It is actually 1.5TB compressed. Direct link to the dataset:

[http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did...](http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=75)

"The dataset may be used by researchers to validate recommender systems,
collaborative filtering methods, context-aware learning, large-scale learning
algorithms, transfer learning, user behavior modeling, content enrichment and
unsupervised learning methods."

Edit: added quote

------
glxc
13.5 TB - that is pretty huge!

Great to get some truly "Big Data" sets out there. I consider "Big Data" to be
data that can't be conventionally processed on a commodity machine, else it's
just analytics

Yahoo must be applauded for supplying various data sets and helping progress
machine learning research

~~~
collyw
I saw a course advertised in my email yesterday. Big data with MySQL. The
description talked about queries and aggregate functions. That isn't big data
- that's just "using a database" before the term "big data" appeared in the
mainstream.

~~~
sbarre
Ugh.. I had a boss years back who insisted on using "big data" to refer to our
analytics and reporting work (which was nowhere near big data in terms of data
size - we had _maybe_ a million rows in our database across all our tables),
and I fruitlessly tried for months to explain to him that anyone who really
knows what "big data" means would immediately see through his bullshit..

~~~
collyw
I really wish we could get rid of the hipster / buzzword / fashionista aspect
of our industry. Way too much churn as a result. I would far rather spend time
honing SQL skills to perfection rather than having to learn another NoSQL
database. Unfortunately job descriptions prefer the latter.

------
wahsd
Can someone please explain to me why this dataset needs to be one big file?
They couldn't have broken it down? I need to download the full 1.5TB? Also,
they couldn't have simply made the data available on one of the "big-data"
services? Seems to redundant and inefficient.

------
boltzmannbrain
It's unfortunate Yahoo assumes only those with .edu email addresses make up
"the research community".

~~~
rovr138
No they don't.

[http://webscope.sandbox.yahoo.com](http://webscope.sandbox.yahoo.com)

~~~
boltzmannbrain
You're kidding... "TO BE ELIGIBLE TO RECEIVE WEBSCOPE DATA, UNLESS SPECIFIED
IN A PARTICULAR DATASET, YOU MUST: \- Be a faculty member, research employee
or student from an accredited university \- Send the data request from an
accredited university .edu or domain name (for international universities)
email address"

------
wdr1
Is it possible to get the readme w/o downloading the entire thing?

They state "The readme file for this dataset is located in part 1 of the
download. Please refer to the readme file for a detailed overview of the
dataset.", but I only see an option to get the full 1.5T.

------
jedberg
It's too bad they aren't publishing this as an EBS snapshot. That would
probably be the most useful way their intended audience could consume it given
that most universities get a ton of free Amazon credits for exactly this type
of research.

~~~
lqdc13
My university had no Amazon credits (2 years ago). I did have access to
several supercomputers though, which would work out much better for this type
of data.

Yahoo is also somewhat closer to Microsoft than to Amazon.

------
satyajeet23
Released Publicly?

You need an .edu mail address, a yahoo account with verified sms to download
this!

Very unfortunate.

~~~
zo1
It is unfortunate. But who knows what sort of restriction have to be imposed
by the various sources of the data and other various contractual obligations?
I'd imagine most of us would feel quite differently if we knew that we were
sources for certain parts of the dataset.

------
inglor
People who downloaded this - does this contain any form of tagging of the
data? For example, do news articles contain visit counts? Article sentiment?
Any form of structured information?

Otherwise, what benefit does this have over scarping news sites?

~~~
GrantS
The interesting data here aren't the news articles themselves, but the news-
browsing history of 20 million people over a 4 month period.

To answer your first question, though, according to the official description
of the dataset [1], "On the item side, we are releasing the title, summary,
and key-phrases of the pertinent news article."

[1]
[http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did...](http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=75)

------
fsaintjacques
I really need a yahoo account with verified sms to download this?

------
jonesb6
1) Begin registration to a community college.

2) Get .edu email address

3) Profit

------
IshKebab
Not the most interesting dataset though.

------
astazangasta
I am so sick of the implication that all data is equivalent, and there is some
generic notion of "big data" that we generic "data scientists" can learn how
to "mine" using some generic technique called "deep learning" that will give
us all the answers we need like some kind of oracle.

I study biology. The shape of the data, the way it is structured, the problems
we face in analyzing it, are quite different than the ones faced in user-news
interaction data. Techniques that are useful for reshaping and summarizing one
dataset are not necessarily applicable to another.

~~~
blazespin
Word2Vec!

