

Caterpillar - A PHP Web Crawler using parallel requests - jqueryin
http://www.jqueryin.com/caterpillar-the-php-site-crawler/

======
jqueryin
I created this library awhile back to be called from the CLI as a cronjob. The
sole purpose is to crawl your entire domain and create a database of all
pages, inbound link counts, last modified times, etc. This data can then be
used by a separate script for statistical data or generating a sitemap XML
file. The inbound links count gives you the capability of adding priorities to
your sitemap file. The last modified time gives you a fairly accurate
depiction of when the content was last updated for the sitemap file as well.
This is a huge step up for sites with dynamic data that don't have any form of
modified timestamp association.

The largest site I tested this on had ~500 pages. I would recommend you have
your memory_limit to at least 32MB in php.ini as the crawler can be fairly
memory intensive when it spawns 5 parallel processes for crawling. I did some
fairly extensive optimizations to keep the memory limit down; if you spot
anything that could be improved upon please let me know.

