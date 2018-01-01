I've looked around a lot but nearly all resources I find are the same. A short description, a small code snippet and that's it.
I'm really looking for more.
I also have an advanced scraper, than can harvest AJAX heavy site like http://venture-capital-firms.findthecompany.com/. I completely scraped their site, using a chrome plugin, exporting results through a web server. Kind of a complex procedure as we have to be inside a live browser to hijack their results. The VC site, even avoid headless browsers so it was tricky.
I can share the code, in case you are interested.
And scaling scraping, is an interesting process.
You can start reading this article about the BFS algorithm : https://fr.khanacademy.org/computing/computer-science/algori...
I did a personnal webcrawler using PHP, Redis, Gearman on a single (personnal) computer with many VMs to emulate AWS instances and it works great ! You can surely improve this by using other technologies than PHP (python, C, nodejs) and Gearman (Kafka, rabbitmq).
Hope this helps
Saving everything in a way for use it later is much harder (and expensive), IMHO.
If you have a good data model the categorizing, storing and searching of the final result the isn't a big problem and the scraping is the complicated part. If you don't have a specific kind of resource you are scraping and just dump everything into some storage solution with no structure that's going to be the hard part while scraping is the easy part.
Anyway Im a noob, but reading here and there is what I decided to use.
For scraping Im using scrapy + selenium and a modified js script that uses chrome (webscraper.io).