Yahoo Releases the Largest-Ever Machine Learning Dataset for Researchers (yahoolabs.tumblr.com)
397 points by denzil_correa on Jan 14, 2016

Note that they're sharing this dataset only with *.edu, which is unfortunate for the rest of us. I wish they would allow access to a fraction of the dataset, e.g. 5% of records, for the rest of the community.

To clarify from source[1]

TO BE ELIGIBLE TO RECEIVE WEBSCOPE DATA, UNLESS SPECIFIED IN A PARTICULAR DATASET, YOU MUST: Be a faculty member, research employee or student from an accredited university Send the data request from an accredited university .edu or domain name (for international universities) email address

UNLESS SPECIFIED IN A PARTICULAR DATASET, WE ARE NOT ABLE TO SHARE DATA WITH: Commercial entities Employees of commercial entities with university appointment Research institutions not affiliated with a research university


[1] http://webscope.sandbox.yahoo.com/

Since nowadays universities are run as start-up accelerators with closed source software, the .edu restriction is antiquated and obnoxious

That really sucks. I hate the whole attitude that you need to be a Ph.D. To do research.

I mean that's what Ph.D's do in most fields, the whole point of getting a Ph.D is so that you can do research in that field. Learning how to write well researched white papers that can withstand peer review.

Prior to that, you can do some research as perhaps with a Master's or as a graduate student, but usually you're assisting on a Ph.D's project.

The process of getting a Ph.D in many fields is basically the process of learning how to research and report your research.

So the idea of limiting high level research tools to those who've proven to be capable of high level research isn't that ridiculous to me.

If you want to research for a living and report it, is it so ridiculous to ask that you train traditionally in researching and reporting?

I will only speak for myself and people like me.

Many of us have deliberately chosen not to go the Masters/PhD/Academic route because of how broken/backward we believe the academic system to be, and that we can have better access to both research, learnings, materials, methods, and just a general quality of life, both materially and intellectually, outside of it.

That doesn't mean we're not interested in, or aren't doing, research. Indeed, we believe we're often doing it better.

These edu/PhD requirements are incredibly frustrating, and there's no technical or intellectual reason they should exist: Only social, political and economic reasons.

I know you did not mean any offence by posting your comment, and i do not mean any by mine, but to answer your question:

"If you want to research for a living and report it, is it so ridiculous to ask that you train traditionally in researching and reporting?"

Yes, yes it is. Research should not be for a closed shop of inward looking, often archaic and relatively backward academic institutions.

>" and there's no technical or intellectual reason they should exist"

I disagree with this completely, and I described the objective and intellectual reasons behind it in my other post!

Very, very few people are capable of taking a post-graduate course-load and producing journal-quality research without any formal training.

It's become a trope to dislike academia and to make the assumption that it is incapable of producing a result that is worthwhile (all while most of us in programming make intense, constant fundamental use of functions, ideas, algorithms, systems, etc that are the results of PhD in Computer Science or academia in general)

The objective reason is that academia produces better researchers who are far better prepared to publish their results. The intellectual reason is that a system of learning statistics, experimental design, and a subject matter under the tutelage of a proven expert which requires the person to literally succeed in their job of researching and publishing before even being given the title PhD produces a very high rate of high level success.

>Yes, yes it is. Research should not be for a closed shop of inward looking, often archaic and relatively backward academic institutions.

IT ISN'T! Just because you're not a part of centralized, systemized education and can't access this tool doesn't mean you can't research.

You also don't have access to a NMR, or an fMRI, or any number of hundreds of others of tools that cost $100k - 1m+, but you can still as a layperson educate yourself and attempt to contribute.

But I think to assert that randos-at-home are the literal equivalent of the academic process and PhD's is beyond absurd, and that every man should have identical access to what earning an PhD earns you, is also absurd to me.

But when it comes to the best tools that cost the most and the best opportunities, those who are a part of the system will get those options. If you want to study medicine you can do that anywhere but if you're not apart of a med school getting access to medical cadavers will be difficult.

There is so much disrespect for institutions and what we as a people are capable of when we systemize. I understand the appreciation for decentralization and the FOSS model behind technology but there is so often a complete unwillingness to show any respect to any centralization at all, it's very strange.

Your PhD is obviously not in history. Academia itself would not exist if what you said was fundamentally true. The assertion to placate the layman with the response "still enough to educate yourself and attempt to contribute" is simply laughable. If you fear getting shitty research from people -- that's fine -- but I believe that the current journal system addresses that fear, or at least attempts to.

There was not really an assertion that "randos-at-home" are the equivalent of academics - only an assertion that they could and do provide great value again and again and again. Every goddamn century people come out of the fucking woodwork and shock the world; Ramanujan anyone?

With all that being said - I believe in the right of the data owners/distributors in choosing whatever 'licensing' they want.

If I'm not mistaken Ramanujan received a PhD at the age of 29. Sure, he came from a non-traditional background, and did some amazing math before his PhD. However, it was only after he studied at University that his writing was clear enough to be published. That of course makes the parent post's point fairly well.

A better example might be Subramanyan Chandrasekhar, who won the 1983 Nobel Prize for work he did at 19 (prior to the commencement of his PhD program). Ironically apparently he was quite upset about this.

No. It doesn't. It was my error. I just need to switch out Ramanujan so you don't trip up on the small stuff. The point stands.

What if ramanijam was a CS scholar? you would refuse Ramanujan access to this data until after he got into a school program which he would not be able to do without access to this data.

I will not try to debate the relative merits of academia, because good or not, the question is whether we should allow data to be available to people outside of those institutions or place restrictions on the data we are releasing that limit it specifically to academic institutions even when we've already deemed it suitable for release.

We are also not talking about whether the average joe is equal to the average PhD, but whether it is right to actively exclude those who are willing and able to contribute to research because they are not part of the academic institutions.

Let us be clear in this case that we are talking of several TB's of data, well within the remit of a setup an intelligent employed professional on a decent salary is able to set up in his own home with a bit of money, time and work. Odds are if someone is able to obtain this data and store it properly, they have already passed a basic competency barrier that suggests they are able to contribute.

Now, it is a common academic myth that universities have access to the best resources and abilities, and that those outside do not. But I, for example, have now been employed with several bodies...arguably/theoretically with more/as much money, data and resources as many universities.

I'm not going to comment on fMRI or NMR, thats not my field, but i'm not going to discount intellectually curious people's ability to gain access to tools/machines outside the academy, and i'm not even touching upon grey/black methods of doing so, which in computing can mean often that those outside academia can often bring more power/insight to problems than those inside.

You aren't providing any reasons why anyone not blessed with the appropriate three letters should be barred from accessing this data.

The dislike of academia - specifically of higher education post B.S - partly stems from the archaic admission process that strictly limit the opportunity for the majority of people to participate in it. You have to be doing the correct things no later than, let's say your junior year in college to be able to have a chance for a PhD at a top tier research university (less than top tier PhD and you hit the massive road block for tenure after your PhD, so it's almost the same thing).

There are people did the correct things early in their life. The majority of us make mistakes. How would even a mid-late 20s (young, by any account) good software engineer get through the gatekeepers and get his PhD? There was even a thread on HN a few days ago asking about the exact question.

To put it another way, it's not that we don't want the training. We don't have the option to do so (at least in the traditional way).

As someone who doesn't have a degree that gets contracted to work with/help build (hardware/software) such tools like EEG devices and fMRI's to make them usable for people with PhD's where most don't really understand the physics behinds such systems and having seen first hand how bastardized the academic process™ has become, its laughable to think that just because things may have been one way in the past, means that it will continue to be the way in the future. So much waste, which means so much opportunity…

>You also don't have access to a NMR, or an fMRI, or any number of hundreds of others of tools that cost $100k - 1m+

I call bullshit, because if you think outside of the box and have a friend who has access (which many people on HN probably do know someone who does), you can always jump in on their pilots in exchange for whatever works for you both. But outside the box thinking is definitely not apart curriculum for most researchers… unless such was prescribed in the literature lol

>There is so much disrespect for institutions

Damn right, but doesn't stop them from trying to co-opt a generation after having loaded them up on debt to feed the gravy train, and I don't see why individuals can't do the same to institutions.

>…and what we as a people are capable of when we systemize.…

People are not limited to existing institutions, we can seek beyond.

>I understand the appreciation for decentralization and the FOSS model behind technology but there is so often a complete unwillingness to show any respect to any centralization at all, it's very strange.

Here's a hardware/software FOSS project[0] of a bunch of different people from many orgs and some without for electrophysiology research, and this stuff is just the beginning once all the contributors (and others like them) realize they can pursue their other interests in life with out constraining themselves to the academic process™.

If anyone in the boston/prov area is interested potentially doing some software/hardware work or want to connect more to stuff like this and has a software + mathematics + physics background or side interests and not involved now, you can contact me at my username at NSA's favorite email provider (gmail) :P


But if you understand how the system works and it seems not surprising that you will face tougher challenges. I am not sure in how far this is different from rejecting the notion that a driving license is necessary to operate a car. You can certainly learn how to operate one yourself and be really good at it. But if your are a car rental company and want to reduce your liabilities it seems reasonable (but probably not fair) to let only people with a driving license rent one.

I second that.

I mean that's what Ph.D's do in most fields, the whole point of getting a Ph.D is so that you can do research in that field.

Um. Are you arguing that one can't do research without a PhD? I'm fairly sure that isn't true[1].

The famous Paleoanthropologist Richard Leakey doesn't even have a college degree.

I believe that Fabrice Bellard doesn't have a PhD either.

[1] http://lemire.me/blog/2009/02/05/skip-the-phd-go-straight-to...

Yes, because only people who want to do research for a living should be allowed to do any research at all.

Yes but then we only find what we've been taught to see for the sake of institutional structure and wonder why life isn't more inspiring.

To yahoo: that seems like an unbelievably lame restriction. Even aside from commercial entities, there are many many people working to improve machine learning with no university affiliation.

You're basically taking something really cool here and shooting yourself in the knee with a shotgun.

I hope other companies don't think open data initiatives count if they're not actually open. If you want to keep your data internal and top secret, totally fine, but open data should be available to anyone or it doesn't count.

Almost yahoo, almost...


>> I didn't see the word "open" mentioned once in this article...

Touché sir. Sentiment still stands: "released to researchers" and "released to the public" should not be different things.

I didn't see the word "open" mentioned once in this article...

I imagine you'll be able to torrent it fairly soon.

Would it be suffice with an .edu email, or does one need a formal document from university officials? I tried to click through from the sandbox link, but a Yahoo account is required.

Most places only care that you were a student with active edu. I graduated years ago and I still use the edu without problem. Most things say you need to be a student from university. Which I am/was. Very few places say actively enrolled student. It's a legal gray area in my opinion.

Many schools will give alumni edu addresses as well

Do you mean to technically get access, or to be legally allowed to use it ? Because for the latter, merely having a .edu isn't enough.

It's probably more complicated than that, because I assume there are places in the world where you can legally ignore someone trying to enforce such claim where they are outside of such jurisdiction.

I have a lifelong edu address but it's not what they mean of course. My reading is that you have to be an enrolled student or researcher.

Sad the future of "Yahoo!" (the tech company, not the Alibaba stock) is uncertain. They were always very open with their research. Thinking back to 2008/09 they had the biggest Hadoop clusters, etc. even the first edition of O'Reilys Hadoop books says "Yahoo press".

That's really irrelevant here. I think we should focus on what an incredible contribution this is. Perhaps a sign of good things to come from Yahoo.

I hope they revive Yahoo tubes, if not as an actual tool, maybe as a teaching device. I was sad to see it go even though it was seriously falling apart at the end.

If you're referring to Yahoo Pipes, I wholeheartedly agree!

The owners of Yahoo don't care if there are more good things to come or even if Yahoo is gutted and killed as long as they don't have to pay taxes on their windfall Alibaba gains.

Flickr was -- and maybe still is -- very useful for the computer vision research community.

This is old news but we put together a 100M item dataset from Flickr as well, all Creative Commons licensed, including lots of metadata from users as well as some pre-computed features. It's called the YFCC100M.


Is a Creative Commons license for research only or can it be used for commercial purposes if you attribute the original author?

Flickr is great for finding Creative Commons-licensed images as well.

How do they use it? Does it have a good programmatic access? Because UX-side, I'm surprised they still exist. I don't know of a single photo-related web site that has worse UI and is more annoying to use than Flickr.

This is extremely biased. A lot of people, myself included, find it easy and pleasant to use. Honestly, I'm not aware of a single alternative with a better UI/UX -- 500px maybe?

As for programmatic access -- Flickr has a good API interface.

I'm not aware of a single alternative with a better UI/UX

I don't use it anymore, but SmugMug.

SmugMug is great! It's a little bit different from Flickr though -- SmugMug is more of a personal photo hosting/portfolio website, while Flickr is more about photographers community. I believe the social aspects of Flickr are much more important than it's actual photo storage capabilities!

It's funny, since Flickr was at the beginning the best of them. Google's PicasaWeb was comparatively bad.

Yahoo really let themselves get leapfrogged on that one.

My trouble with the UX is that it is slow, slow, slow. I live in Frontier country and I have DSL, but there are many photo web sites that are faster than flickr.

I hardly upload anything to flickr anymore because the interface for that is so slow.

Amazing...did not know that. Thanks.

As a brand, I'm honestly impressed that they've lasted through so many sea changes.

I'm torn. I love open data, but I fully expect that someone will (partially) deanonymize this.

I share your concern.

Once data like this is deanonymized, it's out there forever -- there's no going back to fix it like you would a software bug. So you need perfect understanding and provable security at release time to guaranteed safety into the indefinite future. That's not an easy constraint to satisfy.

The thing that terrifies me specifically is that there's been work done - I believe it was a branch of the US military studying network traffic patterns - showing that you can reconstruct profiles based on behavior patterns and link them back to the original user with high success rates.

Probobly. Hopefully so have they learned from the fallout from the AOL search log case ( https://en.wikipedia.org/wiki/AOL_search_data_leak ). That case was certainly a big mess.

I'm not sure why you're downvoted, the AOL search log case was a huge mistake on the part of AOL and I'm quite surprised that Yahoo! would take a risk like this. The real risk is never in just this data but in combining it with other public datasets. I haven't looked at what is in this particular dump in detail but if there is data that had to be anonymized (as they claim) then you can bet that there will be people already busy trying to reverse that.

Yep, and this is why stuff like this should have a formal opt-in process.

It is actually 1.5TB compressed. Direct link to the dataset:


"The dataset may be used by researchers to validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods."

Edit: added quote

13.5 TB - that is pretty huge!

Great to get some truly "Big Data" sets out there. I consider "Big Data" to be data that can't be conventionally processed on a commodity machine, else it's just analytics

Yahoo must be applauded for supplying various data sets and helping progress machine learning research

I saw a course advertised in my email yesterday. Big data with MySQL. The description talked about queries and aggregate functions. That isn't big data - that's just "using a database" before the term "big data" appeared in the mainstream.


Ok, they are comparing MySQL to Excel, so in relative terms the data could be bigger...


Ugh.. I had a boss years back who insisted on using "big data" to refer to our analytics and reporting work (which was nowhere near big data in terms of data size - we had maybe a million rows in our database across all our tables), and I fruitlessly tried for months to explain to him that anyone who really knows what "big data" means would immediately see through his bullshit..

I really wish we could get rid of the hipster / buzzword / fashionista aspect of our industry. Way too much churn as a result. I would far rather spend time honing SQL skills to perfection rather than having to learn another NoSQL database. Unfortunately job descriptions prefer the latter.

Can someone please explain to me why this dataset needs to be one big file? They couldn't have broken it down? I need to download the full 1.5TB? Also, they couldn't have simply made the data available on one of the "big-data" services? Seems to redundant and inefficient.

It's unfortunate Yahoo assumes only those with .edu email addresses make up "the research community".

You're kidding... "TO BE ELIGIBLE TO RECEIVE WEBSCOPE DATA, UNLESS SPECIFIED IN A PARTICULAR DATASET, YOU MUST: - Be a faculty member, research employee or student from an accredited university - Send the data request from an accredited university .edu or domain name (for international universities) email address"

Is it possible to get the readme w/o downloading the entire thing?

They state "The readme file for this dataset is located in part 1 of the download. Please refer to the readme file for a detailed overview of the dataset.", but I only see an option to get the full 1.5T.

It's too bad they aren't publishing this as an EBS snapshot. That would probably be the most useful way their intended audience could consume it given that most universities get a ton of free Amazon credits for exactly this type of research.

My university had no Amazon credits (2 years ago). I did have access to several supercomputers though, which would work out much better for this type of data.

Yahoo is also somewhat closer to Microsoft than to Amazon.

Released Publicly?

You need an .edu mail address, a yahoo account with verified sms to download this!

Very unfortunate.

It is unfortunate. But who knows what sort of restriction have to be imposed by the various sources of the data and other various contractual obligations? I'd imagine most of us would feel quite differently if we knew that we were sources for certain parts of the dataset.

People who downloaded this - does this contain any form of tagging of the data? For example, do news articles contain visit counts? Article sentiment? Any form of structured information?

Otherwise, what benefit does this have over scarping news sites?

The interesting data here aren't the news articles themselves, but the news-browsing history of 20 million people over a 4 month period.

To answer your first question, though, according to the official description of the dataset [1], "On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article."

[1] http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did...

Did you read the article? It seems like they are providing data on the stories people clicked, and at what time, so you can draw temporal and recommendation hypotheses. Some device and location specifics are provided. Scraping can only tell the scraper's story. This data tells millions of people's stories.

I really need a yahoo account with verified sms to download this?

1) Begin registration to a community college.

2) Get .edu email address

3) Profit

Not the most interesting dataset though.

I am so sick of the implication that all data is equivalent, and there is some generic notion of "big data" that we generic "data scientists" can learn how to "mine" using some generic technique called "deep learning" that will give us all the answers we need like some kind of oracle.

I study biology. The shape of the data, the way it is structured, the problems we face in analyzing it, are quite different than the ones faced in user-news interaction data. Techniques that are useful for reshaping and summarizing one dataset are not necessarily applicable to another.


