

Data from Yelp's Dataset Challenge - glaugh
https://www.statwing.com/demos/yelp-dataset-challenge#workspaces/21456

======
glaugh
Some notes about the data, and in particular differences between how it's
presented here and its raw form via Yelp:

1\. Businesses can be in multiple neighborhoods in the original dataset. In
this version businesses can only be in one (the more common of the
neighborhoods the business was listed in). There's some nice presentation and
analysis advantages to this.

2\. We dropped categories with less than 50 businesses in them because of some
limitations of Statwing (it slowed us down a lot without much benefit, for
reasons I'm happy to explain but are pretty boring.

3\. Instead of taking the number of stars typically presented on a business
(1.0, 1.5, 2.0, etc.), we grabbed an average from Yelp's dataset of reviews
for each of these businesses, so you end up having businesses with ratings
like 1.37 or 3.22. There's spikes at 1, 1.5, 2, etc. because of businesses
with very few reviews, so filtering to only include businesses with >25
reviews is pretty handy.

4\. This is only one of several datasets Yelp provides (one for each business,
one for each review, one for each user, etc.)
[http://www.yelp.com/dataset_challenge](http://www.yelp.com/dataset_challenge)

Final note is that we're of course always interested in feedback, so have at
it.

~~~
minimaxir
> _3\. Instead of taking the number of stars typically presented on a business
> (1.0, 1.5, 2.0, etc.), we grabbed an average from Yelp 's dataset of reviews
> for each of these businesses, so you end up having businesses with ratings
> like 1.37 or 3.22._

I don't believe that derivation is equivalent.

From my own tabulation of the data:

# of reviews in Yelp's reviews dataset: 1,125,458 reviews

# sum reviews among all reviews for businesses in Yelp's business dataset:
1,236,445 reviews

So the aggregate will fail to account for about 10% of the rating data.

~~~
glaugh
There's definitely some inconsistency here.

An even larger issue is probably that the way Yelp calculates ratings for a
business isn't a straight average, it involves a notion of a prior
expectation. I'd go into more detail here but I'm struggling to find the (I
think official?) URL talking about this.

------
thalesfc
Wow, what a fantastic tool. I liked it.

------
minimaxir
This is _explicitly_ against Yelp's Terms of Use for the challenge dataset.
Any redistribution of the raw data is disallowed.

Source:
[https://news.ycombinator.com/item?id=8121730](https://news.ycombinator.com/item?id=8121730)

~~~
glaugh
We have authorization from Yelp representatives to show their data in this
fashion.

