

Tiny, dirty, iffy, good enough, basic multi-threaded web crawler in Python - rangeva
http://blog.webhose.io/2015/08/12/tiny-basic-multi-threaded-web-crawler-in-python/

======
tokenizerrr
The regex will break on

    
    
        <a href='actualLink' _href='spoofedLink' 
    

And will return spoofedLink instead of actualLink, while browsers will follow
actualLink. This is why you shouldn't be trying to parse xml/html with
regexes.

~~~
adwf
Whilst it's true in principal that you should always parse xml/html with a
proper parser, in practise it can be a little different. On the web you'll
find just as much malformed html that will throw an error in any parser, as
you will bad html that will throw a regex. As long as you have proper
verification of the link at the end, you'll be safe enough.

Then when you factor in that constructing and extracting links from the DOM
might take 500ms vs. the 5ms of a regex - a 100x increase in crawl speed in
exchange for a little potential dirtiness is more than acceptable;
particularly if you have a million pages to crawl. That's the difference
between a crawl taking 1 week, vs. 2 years...

And anyway, all web crawls are inevitably going to end up dirty and scrappy,
that's just the nature of the web. I once crawled a website that returned
"_ontent-Type" instead of "Content-Type", making it impossible to tell what
kind of file was about to be downloaded by my polite, "text only" crawler.

------
sebcat
Crawling can be broken down to:

    
    
      1) fetching resources
      2) finding out what new resources to fetch
    

1) is an network bound problem, 2) is mostly disk/CPU bound. Realizing the
difference between these two things and separating them is the key to building
a good crawler.

Depending on how you find out what resources to fetch (parsing static
documents vs. dynamic JS analysis with multiple dependencies on other
resources (included JS &c)), "good-enough" crawlers are mostly bound to the
network.

I've seen people running 1 crawl/process on their back-end and some management
guy saying "we need to crawl faster, add more threads per crawl" when one
crawl cycle spends 10x times more waiting on the network than it does parsing
a document.

~~~
tokenizerrr
If you're crawling different origins in parallel then you could gain a boost
in speed by threading. It's usually bound by the network/processing on the
receiving side. Your server can probably handle several outgoing connections
at once.

~~~
sebcat
You can have several outgoing connections in one single thread, at the same
time.

Threading is for CPU bound problems that requires a shared memory space.

I/O multiplexing and/or cooperative multitasking means you can have thousands
concurrent connections per thread.

~~~
tokenizerrr
Sure. The parent comment referred to one download per thread so I did as well.

------
anc84
I highly recommend you check out
[https://github.com/chfoo/wpull](https://github.com/chfoo/wpull)

------
roma1n
Nice to see a tiny, useful code example.

------
emilssolmanis
Also happens to parse XML with regexes. Lovely.

