
Amazon2csv: Amazon products scraper to CSV (no API token required) - tducret
https://github.com/tducret/amazon-scraper-python
======
yoaviram
Why not use the API? Disclaimer: I'm the author of python-amazon-simple-
product-api [1]

[1] [https://github.com/yoavaviram/python-amazon-simple-
product-a...](https://github.com/yoavaviram/python-amazon-simple-product-api)

~~~
AznHisoka
Are you referring to the Product Advertising API?

Doesnt that require you to have a quota of affiliate sales to keep using it? I
can’t find where they state this requirement but I remembered they were very
sneaky about disclosing this. If you dont have any affiliate sales after X
months, your API key will stop working.

~~~
ZoomStop
Currently you have to be a member of their affiliate program to get API
access. To become a full "member" you have to be a prospect who generates
three referral sales (iirc) within a 30 day period. So once in you have the
API, but getting in isn't as easy as filling out a form. From there you can
get your API rate limits increased from the default 1x call per second up to
10 based on your prior 30 day affiliate sales.

~~~
greglindahl
Last I looked that limit was per IP address.

------
amingilani
Scraping Amazon is fun and all, but when you start overdoing it they rate-
limit your IP and show you my worst nightmare: the Dogs of Amazon (a 500 page
with pictures)

Why do I know this? Because I'm the CTO at Nazdeeq.com where we let users buy
Amazon products from countries where they don't ship easily, like Pakistan.

Edit: totally open to partnerships in more countries

~~~
yasoob
Hi Amin, your platform seems nice. Just wanted to give you a heads-up that
your website is being classified as ["phishing" by
Avast]([https://i.imgur.com/SmuuRfD.png](https://i.imgur.com/SmuuRfD.png)). I
think if you replace "Amazon" in the url with something else it should work
fine. Best of luck!

~~~
always_good
Reminds me of how nobody could see one of my user's avatars because the url (a
hash) had started with an "ad" segment (for bucketing), as in
"/avatars/ad/ad3adb33f". So adblockers blocked it.

My protest against such a ridiculous heuristic was to not fix it.

~~~
amingilani
It makes sense why you would choose to do that and I can certainly empathize,
but in the interest of user experience I try to fix these problems because my
customers deserve a good user experience.

------
Jdam
The issue with those tools is that Amazon changes the product layout very
often and heavily conducts A/B tests. I’ve once even heard that computer
vision is the most stable way to scrape Amazon. I guess this library will stop
working rather soon.

~~~
RhodesianHunter
>I’ve once even heard that computer vision is the most stable way to scrape
Amazon

At a former employer we scraped Amazon many millions of times per day with
very simple old tools that rarely needed updating.

~~~
mxvzr
Are you able to share some details? How often did you have to get new IP
addresses? What about user agent? Were the scapers "straight to the point"
like amazon2csv (ie: make a request directly to the search page) or did they
have randomized behavior (eg: re-use sessions from time to time; click a
random link on the page; start from the homepage...)? Did the scrapers ever
went against amz's robots.txt directives (eg: interacting with the cart page)?
Ever heard from amz itself about your employer's activities on their site?

~~~
lapnitnelav
There are services dedicated to scrapping which can take care of proxy-ing
your requests so you don't have to worry about IP bans.

For example, Scrapinghub's Crawlera (the guys behind the Scrapy python lib)

------
bufferoverflow
I remember trying to build a scraper for Amazon. I quickly discovered that
there are many types of item pages, and they change over time too. A/B testing
probably. Just to get the price of the product out of their HTML markup
reliably was a nightmare, I had to build a huge tree of if-this-then-maybe-
that logic.

------
AdamRoberts
The company I work for (zinc.io) has this:
[https://zincapi.com/](https://zincapi.com/)

We brand it as an ordering API, but we also offer retrieving the product data
(item details/pricing.) We put a LOT of engineering resources into data
quality and maintenance, as the API is core to our flagship product, PriceYak.
If you have questions or want a token, email adam@zinc.io and mention this
post.

------
ikeboy
If you're using this for anything serious, it's probably better to sign up for
the keepa API at about $50/month and they scrape Amazon for you. Worth it to
not need to deal with the complexities.

------
AdamM12
Nice. From my experience I've found Parsel [1] (used by scrapy) to be an
easier to use HTML parsing library than Beautiful Soup. That's just imo.

[1] [https://github.com/scrapy/parsel](https://github.com/scrapy/parsel)

------
microdrum
Hm, another no-API option (at least if you are on WordPress) is:
[https://wpcommission.com](https://wpcommission.com)

------
alex_sp
So how many calls is one allowed before getting banned? Any guidelines on how
to use this without breaching T&Cs?

------
staticautomatic
Am I the only one who thinks this is rather weird, or at least unconventional
code for a scraper in Python?

~~~
dec0dedab0de
I just took a glance, but nothing seemed too off. Do you care to elaborate?

~~~
staticautomatic
Sure. I'm not really trying to criticize the code, it's just that a lot of
this looks foreign and unconventional to me.

1\. requests.Session() is a class. IDK what request.session() invokes (see
[https://github.com/tducret/amazon-scraper-
python/blob/master...](https://github.com/tducret/amazon-scraper-
python/blob/master/amazonscraper/client.py#L39)).

2\. Isn't one of the points of using Session() that it'll persist stuff like
cookies and headers? So why is it re-defining the headers multiple times?
(e.g. both GET and POST in the same session have their own respective but
identical headers).

3\. Is the use of `arg=""` idiomatic? For example in
[https://github.com/tducret/amazon-scraper-
python/blob/master...](https://github.com/tducret/amazon-scraper-
python/blob/master/amazonscraper/client.py#L70)

4\. Using raw list indices without some kind of helper function to catch index
and other errors when parsing is not really a good idea in scraping (e.g.
`selection[0].text.strip()`.

~~~
rckclmbr
Its a good thing its open source, submit a PR!

~~~
staticautomatic
I know this is HN and all but I am not even entirely confident about my own
remarks. I asked "Am I the only one..." earnestly, not as a way of softening
criticism. I'm a self-taught amateur and have never submitted a PR before.

------
kull
It is also illegal to scrape AZ, since if you scrape it , it means you don’t
own this content and you are just stilling products data added to the site by
produsts proper owners.

~~~
zeusk
why aren't Larry and Sergey behind bars, then? Scraping publicly available
information is far from illegal.

Also, Interestingly only Alibaba's bots are completely blocked from crawling:
[https://www.amazon.com/robots.txt](https://www.amazon.com/robots.txt)

~~~
kull
Check amazon api T&C, also try to do the same with Craigslist and see how long
you they will let you do it. scraping data is always a shady business if you
do it without a permission of content owner

~~~
zeusk
It is anything but shady. They can send you a C&D or file a suit and seek
injunction but there is no way they can get you in trouble with the law for
scraping publicly available data.

