Hacker News new | comments | show | ask | jobs | submit login
Playing with 80M Amazon Product Review Ratings Using Apache Spark (minimaxir.com)
189 points by minimaxir on Jan 18, 2017 | hide | past | web | favorite | 34 comments

If you want more insight into the data processing (and code which didn't work out), I strongly recommend looking at the R Notebook for the post: http://minimaxir.com/notebooks/amazon-spark/

R Notebooks have been a tremendous help for my workflows. (I do have a post planned to illustrate their many advantages over Jupyter Notebooks)

This is really cool. Looking forward to the R Notebook vs. Jupyter shootout.

Question: under "Distribution of average scores", I notice both distributions have a trend of oscillating up/down on every other bar. Is that a binning artifact, or somehow inherent in the Amazon rating system? With counts of O(1e5) I was expecting much smoother histograms.

Keep in mind the cumulative distribution of reviews. With 5 reviews minimum, the average has few sig-figs of precision, which is why the binning is also set to 1 sig-fig. It also makes the chart more readable. (2 sig-figs would add up to 10x as many columns, with potential for gaps with missing values)

Average Amazon product review:

    I haven't used it yet but I'm sure it's great!! (Five stars)

    Looks neat! (Four stars)


Should be interesting.

Not to mention all the fake reviews and reviews that hide the text 'I received this product at a discount/free for my honest unbiased review'.

Sidenote: I refuse to buy products that have reviews like that. Instantly lost my trust.

It's difficult to sell an item when you have no ratings at all or very litte ratings and a competitor has hundreds, so I can understand why they give products away for ratings. I think Amazon are looking into marking those reviews more clearly.

When I am looking for something, I go by the amount of reviews, too, so I really can't blame them.

Personally, I just ignore those reviews. Reviews in the 2-4 star range are more useful anyway.

Try fakespot.com before you buy.

Why in the world would use Spark for such a tiny data set?

I imagine the author used this as an excuse to play around with Spark. If it were me doing this for work, yeah I'd drop this in Postgres. Most of these analyses would be short SQL queries.

I believe that was the intention of the author, write a DIY post to demonstrate usefulness of Spark, not to gain insights out of the Amazon reviews.

A bit of both. I wanted an excuse to test out Spark to find the kinks which were ommited from the documentation (and boy did I find kinks), and also provide a practical demo.

Thanks for writing this! I'm thinking about using Spark for a little 2M-data-point project that I'm working on, just for the learning experience.

Out of curiosity, what kinks did you find?

Essentially the same Spark caveats of lazy evaluation and immutability of caches: neither are a big deal on small datasets, but making a mistake on either on a large dataset can result in a lot of lost time or confusion.

Then there are the massive shuffle read/writes that result in 50GB i/o which are not great for SSDs.

lets say i tried to load up postgres and a data set on my ..fairly powerful.. laptop to run queries. how many records could i get up to? 100m? 1b? say 16gb ram

One would need to know the size of the record. This is an exercise you'll often do if you're doing capacity analysis / growth analysis for planning (or in conjunction with FP&A).

1,000,000 4kb records takes up, as you'd guess, 4GB of RAM. You can obviously go well beyond your allocation of RAM and still have the database perform, but you'll find you're now bottlenecked on the speed of IO from your SSD / HDD. Throughput will quickly decrease, queries will run slower, etc.

This is why you'll often find that DB benchmarks that never exceed RAM can be "false" comparisons if the expected workload will always exceed available memory, and data not resident to memory will need to be loaded.

So, to truly answer your question, it's actually less a question of RAM, and more a question of HDD. The 2015 ACS (American Community Survey) plus some geographic data is around 100GB, and I comfortably run analysis against it on my wimpy 2015 Macbook (8GB RAM, 1.3Ghz Core M).

you can load up a billion records with half that RAM but the most important part is the type of queries you want to run. in most cases even the most complex select queries are okay (as longh as you're not running 100+ in parallel). if you are running a lot of inserts/updates, you'll probably run into issues. but selects are fairly trivial as long as you have the space.

Spark has a number of features and constructs that can make it very powerful to work with, even on "small" data sets. Big data isn't just measured by size, it's also measured by computational complexity. 80,000,000 rows is massive if the operation you're performing against it is O(N^2), as an example.

Very cool to see R playing nicely with Spark via sparklyr package. The new flex dashboarding feature out of knitr is awesome. The R/RStudio team definitely knows what they're doing and very excited to see what's next for the data science community.

Shiny is slowly expanding its dashboarding capabilities. I have rolled out dashboards with hundreds of thousands of rows which don't need any explicit pagination and can be used simultaneously by dozens of internal customers. All this spun up in a matter of hours.

Can one use this data set for commercial purposes? May sound like a silly question, and the answer may be no, but this sort of data would be very useful to build something cool.

"And this post doesn’t even look at the text of the Amazon product reviews or the metadata associated with the products!"

time to bust out the sentiment analysis

Interesting write up. Where did the data come from? Does Amazon publish the raw data in the format that is loaded or was it scraped?

From the article the data is available at: http://jmcauley.ucsd.edu/data/amazon/

"I wrote a simple Python script to combine the per-category ratings-only data from the Amazon product reviews dataset curated by Julian McAuley, Rahul Pandey, and Jure Leskovec for their 2015 paper Inferring Networks of Substitutable and Complementary Products"

Anyone have a torrent?

On the page it says:

Amazon review data will be made available (for research purposes) on request

Just make a request. When I did so, I heard back within a few hours and the only requirement was that I cite their work, which is entirely reasonable. Similar datasets are more likely to be created if people are at least mentioned for their efforts.

Ah thanks! Too many links back to back that I didn't see it there.

We've used this dataset to build a product review classification pipeline as an example application that can be developed using our project, KeystoneML (which runs on spark) - code is here: https://github.com/amplab/keystone/blob/master/src/main/scal...

It would be really interesting to see the same analysis for verified reviews only, and contrast it with the overall numbers and non-verified reviews. I would actually want to read that more than this (which was still interesting).

Do you mean reviews done by verified purchasers? I believe I have read that firms buying reviews get around that qualifier by giving the reviewer money/gift cards to purchase the item, then write a positive review. They could easily set it up so that the reviewer has to send back the sample item, too, to reduce the cost.

Exactly this. Amazon claims they're fixing it but I don't think that'll stop the problem. If anything it'll cause them to do reviews without posting a notice about it being sponsored or potentially biased.

Fiverr is full of people offering to do reviews on this basis, for example.

Does make you wonder what value the ratings have. Given that most of the ratings are 4-5 you would think most products are wonderful on amazon. Also makes you wonder how many are real users and how many are paid.

I think that has to do more with bias...

I've noticed on a few applications/sites I've made...

Star ratings tend to bias high, but ratings attached to reviews tend to bias low. If you're happy with the product, you just give it a 5... if you're unhappy, you give it a 1 and a review... but that takes more effort... so the happy outweighs the unhappy it seems... just my own theory, but its played out in a few places...

What kind of api did you use to pull all reviews out of amazon? As I see they have blocked giving reviews via api they give it via iframe in api nowadays. How did you curate the list of each and every product present in amazon?

P.S. Awesome analysis.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact