
Playing with 80M Amazon Product Review Ratings Using Apache Spark - minimaxir
http://minimaxir.com/2017/01/amazon-spark/
======
minimaxir
If you want more insight into the data processing (and code which didn't work
out), I strongly recommend looking at the R Notebook for the post:
[http://minimaxir.com/notebooks/amazon-
spark/](http://minimaxir.com/notebooks/amazon-spark/)

R Notebooks have been a tremendous help for my workflows. (I do have a post
planned to illustrate their many advantages over Jupyter Notebooks)

~~~
semi-extrinsic
This is really cool. Looking forward to the R Notebook vs. Jupyter shootout.

Question: under "Distribution of average scores", I notice both distributions
have a trend of oscillating up/down on every other bar. Is that a binning
artifact, or somehow inherent in the Amazon rating system? With counts of
O(1e5) I was expecting much smoother histograms.

~~~
minimaxir
Keep in mind the cumulative distribution of reviews. With 5 reviews minimum,
the average has few sig-figs of precision, which is why the binning is also
set to 1 sig-fig. It also makes the chart more readable. (2 sig-figs would add
up to 10x as many columns, with potential for gaps with missing values)

------
orthecreedence
Average Amazon product review:

    
    
        I haven't used it yet but I'm sure it's great!! (Five stars)
    

or

    
    
        Looks neat! (Four stars)
    

or

    
    
        GOT THIS LAPTOP FOR MY GIRLFRIEND AND SHE COULDNT FIGURE OUT HOW TO USE IT. I CANNOT BELIEVE THIS. FOR $200 I SHOULD GET A TOP-END MACHINE NOT THIS TRASH. I AM RETURNING. (ONE STAR)
    
    

Should be interesting.

~~~
dawnerd
Not to mention all the fake reviews and reviews that hide the text 'I received
this product at a discount/free for my honest unbiased review'.

Sidenote: I refuse to buy products that have reviews like that. Instantly lost
my trust.

~~~
yAnonymous
It's difficult to sell an item when you have no ratings at all or very litte
ratings and a competitor has hundreds, so I can understand why they give
products away for ratings. I think Amazon are looking into marking those
reviews more clearly.

When I am looking for something, I go by the amount of reviews, too, so I
really can't blame them.

Personally, I just ignore those reviews. Reviews in the 2-4 star range are
more useful anyway.

------
spullara
Why in the world would use Spark for such a tiny data set?

~~~
teej
I imagine the author used this as an excuse to play around with Spark. If it
were me doing this for work, yeah I'd drop this in Postgres. Most of these
analyses would be short SQL queries.

~~~
dpweb
lets say i tried to load up postgres and a data set on my ..fairly powerful..
laptop to run queries. how many records could i get up to? 100m? 1b? say 16gb
ram

~~~
wjossey
One would need to know the size of the record. This is an exercise you'll
often do if you're doing capacity analysis / growth analysis for planning (or
in conjunction with FP&A).

1,000,000 4kb records takes up, as you'd guess, 4GB of RAM. You can obviously
go well beyond your allocation of RAM and still have the database perform, but
you'll find you're now bottlenecked on the speed of IO from your SSD / HDD.
Throughput will quickly decrease, queries will run slower, etc.

This is why you'll often find that DB benchmarks that never exceed RAM can be
"false" comparisons if the expected workload will always exceed available
memory, and data not resident to memory will need to be loaded.

So, to truly answer your question, it's actually less a question of RAM, and
more a question of HDD. The 2015 ACS (American Community Survey) plus some
geographic data is around 100GB, and I comfortably run analysis against it on
my wimpy 2015 Macbook (8GB RAM, 1.3Ghz Core M).

------
uptownfunk
Very cool to see R playing nicely with Spark via sparklyr package. The new
flex dashboarding feature out of knitr is awesome. The R/RStudio team
definitely knows what they're doing and very excited to see what's next for
the data science community.

~~~
IndianAstronaut
Shiny is slowly expanding its dashboarding capabilities. I have rolled out
dashboards with hundreds of thousands of rows which don't need any explicit
pagination and can be used simultaneously by dozens of internal customers. All
this spun up in a matter of hours.

------
fowlerpower
Can one use this data set for commercial purposes? May sound like a silly
question, and the answer may be no, but this sort of data would be very useful
to build something cool.

------
EternalData
"And this post doesn’t even look at the text of the Amazon product reviews or
the metadata associated with the products!"

time to bust out the sentiment analysis

------
koolba
Interesting write up. Where did the data come from? Does Amazon publish the
raw data in the format that is loaded or was it scraped?

~~~
praveenster
From the article the data is available at:
[http://jmcauley.ucsd.edu/data/amazon/](http://jmcauley.ucsd.edu/data/amazon/)

"I wrote a simple Python script to combine the per-category ratings-only data
from the Amazon product reviews dataset curated by Julian McAuley, Rahul
Pandey, and Jure Leskovec for their 2015 paper Inferring Networks of
Substitutable and Complementary Products"

~~~
samfisher83
Anyone have a torrent?

On the page it says:

Amazon review data will be made available (for research purposes) on request

~~~
vinay427
Just make a request. When I did so, I heard back within a few hours and the
only requirement was that I cite their work, which is entirely reasonable.
Similar datasets are more likely to be created if people are at least
mentioned for their efforts.

------
etrain
We've used this dataset to build a product review classification pipeline as
an example application that can be developed using our project, KeystoneML
(which runs on spark) - code is here:
[https://github.com/amplab/keystone/blob/master/src/main/scal...](https://github.com/amplab/keystone/blob/master/src/main/scala/pipelines/text/AmazonReviewsPipeline.scala#L25-L39)

------
arafa
It would be really interesting to see the same analysis for verified reviews
only, and contrast it with the overall numbers and non-verified reviews. I
would actually want to read that more than this (which was still interesting).

~~~
will_pseudonym
Do you mean reviews done by verified purchasers? I believe I have read that
firms buying reviews get around that qualifier by giving the reviewer
money/gift cards to purchase the item, then write a positive review. They
could easily set it up so that the reviewer has to send back the sample item,
too, to reduce the cost.

~~~
dawnerd
Exactly this. Amazon claims they're fixing it but I don't think that'll stop
the problem. If anything it'll cause them to do reviews without posting a
notice about it being sponsored or potentially biased.

------
coldcode
Does make you wonder what value the ratings have. Given that most of the
ratings are 4-5 you would think most products are wonderful on amazon. Also
makes you wonder how many are real users and how many are paid.

~~~
jmcdiesel
I think that has to do more with bias...

I've noticed on a few applications/sites I've made...

Star ratings tend to bias high, but ratings attached to reviews tend to bias
low. If you're happy with the product, you just give it a 5... if you're
unhappy, you give it a 1 and a review... but that takes more effort... so the
happy outweighs the unhappy it seems... just my own theory, but its played out
in a few places...

------
techaddict009
What kind of api did you use to pull all reviews out of amazon? As I see they
have blocked giving reviews via api they give it via iframe in api nowadays.
How did you curate the list of each and every product present in amazon?

P.S. Awesome analysis.

