
Ask HN: Advanced web crawling resources? - throwawayasdasd
Does anyone know any good resources for advanced web scraping (scraping at scale, getting around various tricks to prevent crawling, etc.)?<p>I&#x27;ve looked around a lot but nearly all resources I find are the same. A short description, a small code snippet and that&#x27;s it.<p>I&#x27;m really looking for more.
======
pesfandiar
Scrapinghub writes some useful blog posts at
[https://blog.scrapinghub.com/](https://blog.scrapinghub.com/). It obviously
has to do with using their frameworks and services, so it may not be very
useful in your case.

------
bootcat
A sample crawler i wrote to harvest Yelp results, Feel free to gain insight on
how it was written. Might not work as yelp might have had cosmetic changes.
But the theme would help you write one on your own !
[https://github.com/deepanprabhu/yelp-
crawler](https://github.com/deepanprabhu/yelp-crawler) .

I also have an advanced scraper, than can harvest AJAX heavy site like
[http://venture-capital-firms.findthecompany.com/](http://venture-capital-
firms.findthecompany.com/). I completely scraped their site, using a chrome
plugin, exporting results through a web server. Kind of a complex procedure as
we have to be inside a live browser to hijack their results. The VC site, even
avoid headless browsers so it was tricky.

I can share the code, in case you are interested. And scaling scraping, is an
interesting process.

------
rguillaume
Hi there,

You can start reading this article about the BFS algorithm :
[https://fr.khanacademy.org/computing/computer-
science/algori...](https://fr.khanacademy.org/computing/computer-
science/algorithms/breadth-first-search/a/the-breadth-first-search-algorithm)

I did a personnal webcrawler using PHP, Redis, Gearman on a single (personnal)
computer with many VMs to emulate AWS instances and it works great ! You can
surely improve this by using other technologies than PHP (python, C, nodejs)
and Gearman (Kafka, rabbitmq).

Hope this helps

------
cond289123
I did this for sites with paging.
[https://github.com/indatawetrust/reporter](https://github.com/indatawetrust/reporter)
It saves it in a json file by pulling the data according to the desired
properties. It is not very good but it can be brought to a better condition if
you wish.

------
z3t4
Scraping is only one part. How are you going to categorize, store and search
the data !?

~~~
bruno2223
Yes, indeed, scraping is the easiest part.

Saving everything in a way for use it later is much harder (and expensive),
IMHO.

~~~
dewey
I'd argue that this is highly dependent on the type of data you scrape and the
what you want to do with the data.

If you have a good data model the categorizing, storing and searching of the
final result the isn't a big problem and the scraping is the complicated part.
If you don't have a specific kind of resource you are scraping and just dump
everything into some storage solution with no structure that's going to be the
hard part while scraping is the easy part.

~~~
z3t4
In theory, say you want to index one billion (10^9) web sites. Using modern
hardware, you should be able to crawl, 10,000 web pages per second, which
would take ca 30 hours, and if you save 1kb of text from each web site, that
would be ca 1 TB of data. Doing a text search of 1TB of text would take some
time though, maybe minutes. You could partition the data between servers
though.

