TO BE ELIGIBLE TO RECEIVE WEBSCOPE DATA, UNLESS SPECIFIED IN A PARTICULAR DATASET, YOU MUST:
Be a faculty member, research employee or student from an accredited university
Send the data request from an accredited university .edu or domain name (for international universities) email address
UNLESS SPECIFIED IN A PARTICULAR DATASET, WE ARE NOT ABLE TO SHARE DATA WITH:
Employees of commercial entities with university appointment
Research institutions not affiliated with a research university
Prior to that, you can do some research as perhaps with a Master's or as a graduate student, but usually you're assisting on a Ph.D's project.
The process of getting a Ph.D in many fields is basically the process of learning how to research and report your research.
So the idea of limiting high level research tools to those who've proven to be capable of high level research isn't that ridiculous to me.
If you want to research for a living and report it, is it so ridiculous to ask that you train traditionally in researching and reporting?
Many of us have deliberately chosen not to go the Masters/PhD/Academic route because of how broken/backward we believe the academic system to be, and that we can have better access to both research, learnings, materials, methods, and just a general quality of life, both materially and intellectually, outside of it.
That doesn't mean we're not interested in, or aren't doing, research. Indeed, we believe we're often doing it better.
These edu/PhD requirements are incredibly frustrating, and there's no technical or intellectual reason they should exist: Only social, political and economic reasons.
I know you did not mean any offence by posting your comment, and i do not mean any by mine, but to answer your question:
"If you want to research for a living and report it, is it so ridiculous to ask that you train traditionally in researching and reporting?"
Yes, yes it is. Research should not be for a closed shop of inward looking, often archaic and relatively backward academic institutions.
I disagree with this completely, and I described the objective and intellectual reasons behind it in my other post!
Very, very few people are capable of taking a post-graduate course-load and producing journal-quality research without any formal training.
It's become a trope to dislike academia and to make the assumption that it is incapable of producing a result that is worthwhile (all while most of us in programming make intense, constant fundamental use of functions, ideas, algorithms, systems, etc that are the results of PhD in Computer Science or academia in general)
The objective reason is that academia produces better researchers who are far better prepared to publish their results. The intellectual reason is that a system of learning statistics, experimental design, and a subject matter under the tutelage of a proven expert which requires the person to literally succeed in their job of researching and publishing before even being given the title PhD produces a very high rate of high level success.
>Yes, yes it is. Research should not be for a closed shop of inward looking, often archaic and relatively backward academic institutions.
IT ISN'T! Just because you're not a part of centralized, systemized education and can't access this tool doesn't mean you can't research.
You also don't have access to a NMR, or an fMRI, or any number of hundreds of others of tools that cost $100k - 1m+, but you can still as a layperson educate yourself and attempt to contribute.
But I think to assert that randos-at-home are the literal equivalent of the academic process and PhD's is beyond absurd, and that every man should have identical access to what earning an PhD earns you, is also absurd to me.
But when it comes to the best tools that cost the most and the best opportunities, those who are a part of the system will get those options. If you want to study medicine you can do that anywhere but if you're not apart of a med school getting access to medical cadavers will be difficult.
There is so much disrespect for institutions and what we as a people are capable of when we systemize. I understand the appreciation for decentralization and the FOSS model behind technology but there is so often a complete unwillingness to show any respect to any centralization at all, it's very strange.
There was not really an assertion that "randos-at-home" are the equivalent of academics - only an assertion that they could and do provide great value again and again and again. Every goddamn century people come out of the fucking woodwork and shock the world; Ramanujan anyone?
With all that being said - I believe in the right of the data owners/distributors in choosing whatever 'licensing' they want.
We are also not talking about whether the average joe is equal to the average PhD, but whether it is right to actively exclude those who are willing and able to contribute to research because they are not part of the academic institutions.
Let us be clear in this case that we are talking of several TB's of data, well within the remit of a setup an intelligent employed professional on a decent salary is able to set up in his own home with a bit of money, time and work. Odds are if someone is able to obtain this data and store it properly, they have already passed a basic competency barrier that suggests they are able to contribute.
Now, it is a common academic myth that universities have access to the best resources and abilities, and that those outside do not. But I, for example, have now been employed with several bodies...arguably/theoretically with more/as much money, data and resources as many universities.
I'm not going to comment on fMRI or NMR, thats not my field, but i'm not going to discount intellectually curious people's ability to gain access to tools/machines outside the academy, and i'm not even touching upon grey/black methods of doing so, which in computing can mean often that those outside academia can often bring more power/insight to problems than those inside.
There are people did the correct things early in their life. The majority of us make mistakes. How would even a mid-late 20s (young, by any account) good software engineer get through the gatekeepers and get his PhD? There was even a thread on HN a few days ago asking about the exact question.
To put it another way, it's not that we don't want the training. We don't have the option to do so (at least in the traditional way).
>You also don't have access to a NMR, or an fMRI, or any number of hundreds of others of tools that cost $100k - 1m+
I call bullshit, because if you think outside of the box and have a friend who has access (which many people on HN probably do know someone who does), you can always jump in on their pilots in exchange for whatever works for you both. But outside the box thinking is definitely not apart curriculum for most researchers… unless such was prescribed in the literature lol
>There is so much disrespect for institutions
Damn right, but doesn't stop them from trying to co-opt a generation after having loaded them up on debt to feed the gravy train, and I don't see why individuals can't do the same to institutions.
>…and what we as a people are capable of when we systemize.…
People are not limited to existing institutions, we can seek beyond.
>I understand the appreciation for decentralization and the FOSS model behind technology but there is so often a complete unwillingness to show any respect to any centralization at all, it's very strange.
Here's a hardware/software FOSS project of a bunch of different people from many orgs and some without for electrophysiology research, and this stuff is just the beginning once all the contributors (and others like them) realize they can pursue their other interests in life with out constraining themselves to the academic process™.
If anyone in the boston/prov area is interested potentially doing some software/hardware work or want to connect more to stuff like this and has a software + mathematics + physics background or side interests and not involved now, you can contact me at my username at NSA's favorite email provider (gmail) :P
Um. Are you arguing that one can't do research without a PhD? I'm fairly sure that isn't true.
The famous Paleoanthropologist Richard Leakey doesn't even have a college degree.
I believe that Fabrice Bellard doesn't have a PhD either.
You're basically taking something really cool here and shooting yourself in the knee with a shotgun.
I hope other companies don't think open data initiatives count if they're not actually open. If you want to keep your data internal and top secret, totally fine, but open data should be available to anyone or it doesn't count.
Almost yahoo, almost...
>> I didn't see the word "open" mentioned once in this article...
Touché sir. Sentiment still stands: "released to researchers" and "released to the public" should not be different things.
As for programmatic access -- Flickr has a good API interface.
I don't use it anymore, but SmugMug.
Yahoo really let themselves get leapfrogged on that one.
I hardly upload anything to flickr anymore because the interface for that is so slow.
Once data like this is deanonymized, it's out there forever -- there's no going back to fix it like you would a software bug. So you need perfect understanding and provable security at release time to guaranteed safety into the indefinite future. That's not an easy constraint to satisfy.
"The dataset may be used by researchers to validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods."
Edit: added quote
Great to get some truly "Big Data" sets out there. I consider "Big Data" to be data that can't be conventionally processed on a commodity machine, else it's just analytics
Yahoo must be applauded for supplying various data sets and helping progress machine learning research
Ok, they are comparing MySQL to Excel, so in relative terms the data could be bigger...
They state "The readme file for this dataset is located in part 1 of the download. Please refer to the readme file for a detailed overview of the dataset.", but I only see an option to get the full 1.5T.
Yahoo is also somewhat closer to Microsoft than to Amazon.
You need an .edu mail address, a yahoo account with verified sms to download this!
Otherwise, what benefit does this have over scarping news sites?
To answer your first question, though, according to the official description of the dataset , "On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article."
2) Get .edu email address
I study biology. The shape of the data, the way it is structured, the problems we face in analyzing it, are quite different than the ones faced in user-news interaction data. Techniques that are useful for reshaping and summarizing one dataset are not necessarily applicable to another.