

Crawling Billions of Pages: Building Large Scale Crawling Cluster, part 1 - warrenmar
http://engineering.bloomreach.com/crawling-billions-of-pages-building-large-scale-crawling-cluster-part-1/

======
rb2k_
I guess this fits in here:

Once upon a time I wrote my thesis on building a web crawler. The (tiny) blog
post with an embedded preview:

[http://blog.marc-seeger.de/2010/12/09/my-thesis-building-
blo...](http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blocks-of-a-
scalable-webcrawler/)

The PDF itself:

[http://blog.marc-seeger.de/assets/papers/thesis_seeger-
build...](http://blog.marc-seeger.de/assets/papers/thesis_seeger-
building_blocks_of_a_scalable_webcrawler.pdf)

It's mostly a "this is what I learned and the things I had to take into
consideration" with a few "this is how you identify a CMS" bits sprinkled into
it. These days I would probably change a thing or two, but people told me it's
still an entertaining read. (Not a native speaker though, so the English might
have some stylistic kinks)

------
jordiburgos
Part 2, is already there [http://engineering.bloomreach.com/crawling-billions-
of-pages...](http://engineering.bloomreach.com/crawling-billions-of-pages-
building-large-scale-crawling-cluster-part-2/)

------
viraptor
> The Windows operating system can dispatch different events to different
> window handlers so you can handle all asynchronous HTTP calls efficiently.
> For a very long time, people weren’t able to do this on Linux-based
> operating systems since the underlying socket library contained a potential
> bottleneck.

What? select()'s biggest issue is if you have lots of idle connections, which
shouldn't be an issue when crawling (you can send more requests while waiting
for responses). epoll() is available since 2003. What bottlenecks?

~~~
enigmo
Turns out crawlers spend a lot more (wall clock) time waiting for a complete
response than they do requesting it. However, scheduling is a much (much,
much, much) harder problem to deal with than async i/o but it's not what many
people here need to worry about.

------
krokoo
The challenges with crawling on a large scale still persist as is evident by
bloomreach and many other companies building custom solutions because
available open source tools cannot handle the scale of such products. SQLBot
aims to solve this problem. Product a few weeks from launch. If any is
interested:
[http://www.amisalabs.com/AmisaSQLBot.html](http://www.amisalabs.com/AmisaSQLBot.html)

------
exacube
From part 2 of their article:

> Currently, more than 60 percent of global internet traffic consists of
> requests from crawlers or some type of automated Web discovery system.

Where is this number from and how accurate can you make it?

~~~
Axsuul
It currently feels like 60 percent of global internet traffic is actually
crawling their blog right now. Still waiting for it to load.

------
kaivi
I wish there were more articles about determining the frequency at which one
page should be crawled. Some pages never change, some change multiple times
per minute, and we do not want to crawl them all equally often.

~~~
petewailes
This is a problem I've researched fairly extensively the last few months. My
ideal solution looks something like:

* Initial pull * Secondary pulls x time later, where x doubles each time, up to a maximum value, y

y is the one that's tricky to define. For us, it's a value computed based on
the frequency of update of similar URLs for that domain, the domain as a
whole, similar content, and a few other bits and pieces. Essentially, our
thinking is that if we can understand how alike any page is to another cluster
of pages, we can use their average frequency of update to give reasonably
likely initial values for x, and sensible thresholds for y. We also temper
this with how much change there is, to determine whether the differences are
something we care about.

Obviously, should the system notice that if its change timings are
particularly outside where it'd expect given the cohort assigned, it's then
able to start moving around its comparison. An example would be a blog
category page which updates so infrequently that it's particularly unusual, or
a page with a lot of social feeds on it where there's a lot of flux
constantly.

Works pretty well, but if anyone's got a better solution I'd love to hear of
it.

