I've had success with a headless Chrome instance in a virtual display (xvfb) driven with Selenium, backed by Postgres. It's as close you can get to scripting a real browser.
var webPage = require('webpage');
var page = webPage.create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
You could take the time to build in spoofs for these issues. But for testing (and scraping), you're going to be better off if your headless browser is the same as your GUI browser.
We used to use Selenium + Firefox for the checkout processes (run manually), but it was too much maintenance overhead so we switched to requests+lxml.
We generally find the more "single page app" a website is, the easier it is to scrape, because we can just use the API that's backing the SPA directly, rather than parsing data out of the HTML.
How does it approach throttling or rate limiting? I didn't see this mentioned in the readme examples. Would be nice if there were some simple config to kick requests back into a queue to be re-run once limits aren't exhausted.
Minimal support for caching / ETag / etc would be a nice addition.
Since it is py2, i suppose asyncio is out of the picture
Is there anything missing that prompted you to reimplement ?
I don't believe BS is a full scraping solution, it's only the HTML parsing/querying isn't it? In that case, this project actually uses lxml for that part - a relatively well known alternative to BS.
I highly recommend lxml, the API isn't perfect, but in my experience it's much more powerful than BS, and significantly faster as well. We run custom scrapers for a large number of websites, and apart from a few where we use JSON feeds, the majority use lxml, it has been very useful.
and then scrap whatever is missing or not fresh enough. The scrapping process can be quite intense on servers.
Also, generator expressions would make the examples more readable IMO.
self.extend((tag, QuoteMiner(self.geturl(href))) for tag, href in self.acc)