R Notebooks have been a tremendous help for my workflows. (I do have a post planned to illustrate their many advantages over Jupyter Notebooks)
Question: under "Distribution of average scores", I notice both distributions have a trend of oscillating up/down on every other bar. Is that a binning artifact, or somehow inherent in the Amazon rating system? With counts of O(1e5) I was expecting much smoother histograms.
I haven't used it yet but I'm sure it's great!! (Five stars)
Looks neat! (Four stars)
GOT THIS LAPTOP FOR MY GIRLFRIEND AND SHE COULDNT FIGURE OUT HOW TO USE IT. I CANNOT BELIEVE THIS. FOR $200 I SHOULD GET A TOP-END MACHINE NOT THIS TRASH. I AM RETURNING. (ONE STAR)
Sidenote: I refuse to buy products that have reviews like that. Instantly lost my trust.
When I am looking for something, I go by the amount of reviews, too, so I really can't blame them.
Personally, I just ignore those reviews. Reviews in the 2-4 star range are more useful anyway.
Out of curiosity, what kinks did you find?
Then there are the massive shuffle read/writes that result in 50GB i/o which are not great for SSDs.
1,000,000 4kb records takes up, as you'd guess, 4GB of RAM. You can obviously go well beyond your allocation of RAM and still have the database perform, but you'll find you're now bottlenecked on the speed of IO from your SSD / HDD. Throughput will quickly decrease, queries will run slower, etc.
This is why you'll often find that DB benchmarks that never exceed RAM can be "false" comparisons if the expected workload will always exceed available memory, and data not resident to memory will need to be loaded.
So, to truly answer your question, it's actually less a question of RAM, and more a question of HDD. The 2015 ACS (American Community Survey) plus some geographic data is around 100GB, and I comfortably run analysis against it on my wimpy 2015 Macbook (8GB RAM, 1.3Ghz Core M).
time to bust out the sentiment analysis
"I wrote a simple Python script to combine the per-category ratings-only data from the Amazon product reviews dataset curated by Julian McAuley, Rahul Pandey, and Jure Leskovec for their 2015 paper Inferring Networks of Substitutable and Complementary Products"
On the page it says:
Amazon review data will be made available (for research purposes) on request
I've noticed on a few applications/sites I've made...
Star ratings tend to bias high, but ratings attached to reviews tend to bias low. If you're happy with the product, you just give it a 5... if you're unhappy, you give it a 1 and a review... but that takes more effort... so the happy outweighs the unhappy it seems... just my own theory, but its played out in a few places...
P.S. Awesome analysis.