Hacker News new | comments | show | ask | jobs | submit login
Using Yelp Data to Predict Restaurant Closure (towardsdatascience.com)
105 points by yarapavan 45 days ago | hide | past | web | favorite | 21 comments

I wonder how accurate this is in areas where tourism is a large contribution to the local economy. You don't actually have to be much good at running a business if you've got an endless stream of new people for several months out of the year and don't need to rely on repeat business. You just rename the place and hire a new GM when the negative reviews start overwhelming you (this applies more generally than restaurants btw).

I would probably try new restaurants more frequently if I could be more sure I wasn't gonna pay $10 for a $5 burger and help buy some sleazy J1-slave-driver (owner is too nice of a word) a new Land Rover in the process.

definitely more generally than restaurants. We have a tuxedo shop in our area that marks up cheap imports you can buy for like 1/2 the price straight off Amazon. They have a going out of business sale, closing sale, etc etc, which really means they're just moving next door in the same plaza. They've been bouncing around for years, and racked up a few lawsuits at some point. Guess it works because they're still doing it.

What does the J1 in J1-slave-driver mean?

J-1 is a category of exchange visa issued by the US.


How did he get the data? It's pretty hard to pull the reviews and the data from yelp. I tried to do that to do some querying, but their search isn't so great and they pull a lot of stunts to prevent you from scraping.


Oh, I see he's using the kaggle data. That's not guaranteed to be reliable.

Eh it's possible, the reviews are harder though.

I wrote a scraper which pulled address info / phone number / star rating / review count for pretty much every restaurant in the US.

It was "easy" because all of that data is available within the search page, and you just need to correctly parse it out.

The hardest part was getting around their really crazy rate limiting and IP blocking.

I managed to get myself IP banned from yelp prior to ever trying to scrape by just doing a bunch of searches manually pretty quickly over like 20 min, next thing I knew I could no longer access anything on Yelp.

That's not suprising. You can get yourself IP blocked just by opening things in other tabs to queue them up to read. (If you notice yourself getting random 404s .. that's when you're being watched)

FWIW, I just tried gathering some current1 user-generated Yelp Data from Internet Archive and it was very easy to gather a list of all restaurants for a city2 and then all reviews for each restaurant.

1 https://web.archive.org/save/[url] and most recent crawl: http://web.archive.org/web/20180109124942/https://yelp.com/

2 By incremental searches that each return under 1000 results.

That’s a great tip, thanks!

This [1] also contains a superset of the data used in this post and is direct from Yelp.

[1]: https://www.yelp.com/dataset/challenge

This dataset contains sparse data from businesses in different cities around the world. It has a very small overlap with the dataset used in the study. Focusing on a particular city helps to understand the underlying trends better.

The Kaggle data was matched with recent information about the restaurants which was pulled from the Yelp API.

Yelp publishes reviews in JSON-LD format so the parsing is trivial.

They don't publish all of their reviews. They're limited to 3 reviews.

Their API only exposes three, but in their HTML they embed JSON-LD metadata for all of them (but sadly not the "not recommended" reviews).

As the author mentioned changes in rent are a huge factor. Did the date of closures coincide with a new lease which can range from 1 - 10 years. Seeing a distribution of the age of the restaurant when closed could show them.

The other huge factor is cost of labor. Maybe looking at the minimum wage could be another feature. The news usually has those articles about how restaurants are struggling and the incremental minimum wage increase will hurt their business. It'd be interesting to see how strong of a factor that is in restaurant closures.

Also factors that could be tough to get but important * Cost of the ingredients like meat, vegetables etc.. * General Economic conditions, are consumers going out to eat?

It sounds like they un-anonymized the data, which strikes me as slightly unethical. (I mean it's not medical data or anything, but I don't think that was the intended use of the anonymized data.)

Further, it seems like the results of this will be used to deny loans to restaurants that are not doing so great, thus ensuring that they fail because they can't get funding for renovations and improvements.

The original dataset already contained the names, addresses and coordinates of each restaurant. Finding the restaurant ids does not reveal any additional information. It just makes it easier to reveal recent information from yelp which is available through their API anyway

I don't think this model uncovered anything new for lenders. Chains with little competition where people expect them do well.

Very nice! I like how you used multiple data sources to enable a study that couldn't be done with just one.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact