

Yelp extends academic dataset - Zephyr314
http://engineeringblog.yelp.com/2014/08/the-yelp-dataset-challenge-goes-international-new-data-new-cities-open-to-students-worldwide.html

======
minimaxir
Note the restrictions in the terms of use, which prevent redistribution of the
data or any summaries "on any website or other electronic media not covered by
this agreement."

Would this prohibit making a blog post on an analysis of the data?

~~~
Zephyr314
A blog post analyzing the dataset is fine, and many have been published. Our
limitation is on redistribution of Yelp's dataset itself.

~~~
izyda
What about visualizations of analysis of the data (say using an R Shiny app),
ensuring of course, that the raw data could not be downloaded?

~~~
Zephyr314
Visualizations and analysis of the data are allowed and encouraged. If you are
a student you should submit to the challenge as well!

------
rpedela
Maybe I am reading it wrong, but the license seems to say that Yelp
exclusively owns IP rights to all derivative works. Hopefully I am reading it
wrong, but if not then definitely not cool in my opinion.

~~~
Zephyr314
Yelp retains rights to all derivative works of the dataset itself. This would
not apply to papers, articles, or algorithms analyzing the dataset.

~~~
brendano
What does derivative work mean? Why is a graph or data summary not a
derivative work?

~~~
Zephyr314
[http://en.wikipedia.org/wiki/Derivative_work](http://en.wikipedia.org/wiki/Derivative_work)

------
danso
This is great...I'm teaching a couple of classes next year on data analysis
and it's incredibly helpful to have real-world data that is _fun_...things
like Census/NOAA data are great, but too abstract (initially) for the average
novice to really ask interesting questions of.

But everyone knows what it's like to eat at a crappy/great restaurant, or
where such places might be located, or how people might review as a
cluster...and so everyone comes in with testable assumptions and hypotheses
that are fun to explore.

This item brought to mind yesterday's front page post on "Seven habits of
highly fraudulent users"
([https://news.ycombinator.com/item?id=8116047](https://news.ycombinator.com/item?id=8116047))...does
Yelp do any _extra_ processing of this academic set (beyond whatever regular
cleaning they do of spam accounts)? It'd be interesting to test hypotheses on
signals of spammy/fake accounts (OTOH, I imagine Yelp would probably prefer
such trends not to be so apparent in bulk data)

~~~
Zephyr314
You can imagine this as a scrape of our website for the specific businesses
listed (and their associated recommended reviews and tips, and the associated
users for that content) on the date the dataset was generated (which you can
assume is the last timestamp).

For more information about recommended reviews please check out:
[http://officialblog.yelp.com/2013/11/yelp-recommended-
review...](http://officialblog.yelp.com/2013/11/yelp-recommended-reviews.html)

Let me know if you have any other questions!

------
fiatjaf
Less research is needed:
[http://blogs.plos.org/speakingofmedicine/2012/06/25/less-
res...](http://blogs.plos.org/speakingofmedicine/2012/06/25/less-research-is-
needed/)

~~~
izyda
I think this competition is a good examples of what the author advocates for -
asking the right questions, not the tireless ones. There's no specific prompt
and no cliche single purpose to the exercise. Further, the competition is
aimed at students - which means in all likelihood, most projects will be
relatively short, succinct examples of cool things that can be done with the
data (the door is not open to more data collection, to studies that require a
lot of time, and the dataset is small enough that computing
time/infrastructure should not be an issue).

The real challenge is having the creativity to come up with a question worth
asking. From Yelp's perspective, I imagine the most value they get from
hosting this competition is the (potentially monetizable) ideas for what types
of business problems could be solved using their treasure trove of data.

I'm sure they could have their data science team build an incredibly
sophisticated model for some specific prediction task (probably more
sophisticated than what any student could reasonably submit) - but if that
specific prediction task turns out difficult or not monetizable, that's a huge
investment lost. Having many students look at many different ideas is more
likely to result in finding the right question.

