

Building blocks of a scalable webcrawler. - 0x44
http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blocks-of-a-scalable-webcrawler

======
shrikant
IIRC, sriramk from around here (<http://news.ycombinator.com/user?id=sriramk>)
had also 'rolled his own' web-crawler as a project in college about 5-6 (?)
years back. He blogged about it fairly actively back then, and I really
enjoyed following his journey (esp. when after months of dev and testing, he
finally 'slipped it into the wild'). Tried to dredge up those posts, but he
seems to have taken them down :( A shame really - they were quite a
fascinating look at the early-stage evolution of a programmer!

Sriram, you around? ;)

~~~
sriramk
Thanks but mine was definitely a toy. I think I got it to around 100K pages or
so but that's about it (seemed like a big deal back then).

You can see some of those posts here
([http://web.archive.org/web/20041206230457/www.dotnetjunkies....](http://web.archive.org/web/20041206230457/www.dotnetjunkies.com/weblog/sriram/)).
Quite embarrassing to see the quality of my output from back then

Basically, I did the following

\- Pull down dmoz.org's datasets (not sure whether I crawled it or whether
they had a dump - I think the latter) \- Spin up crawlers (implemented in C#
at the time) on various machines, writing to a central repo. The actual design
of the crawler was based on Mercator (check out the paper on citeseer) \- Use
Lucene to construct TF.IDF indices on top of the repository \- Throw up a nice
UI (with the search engine name spelled out in a Google-like font). The funny
part is that this probably impressed the people evaluating the project more
than anything else.

I did do some cool hacks around showing a better snippet than Google did at
the time but I just didn't have the networking bandwidth to do anything
serious. Fun for a college project.

The funny thing is a startup which is involved in search contacted me a few
weeks ago precisely because of this project. I had to tell that person how
much of a toy it was :)

~~~
rb2k_
Do you remember how fast the "toy" was? (pages/second, domains/s, ...) :)

~~~
sriramk
Not really but given the terrible hardware/network connectivity , wouldnt have
made much sense now.

Because of this thread, I looked through my old backups and I actually still
have the code. Should get it working again sometime

~~~
sandGorgon
are you gonna put up your code ?

It would be interesting to see how to think through building a crawler (as
opposed to downloading Nutch and trying to grok it)

------
rb2k_
Uh, look what the cat dragged in: my thesis :)

Hope some of you enjoy the read, I'm open for comments and criticism

~~~
arkitaip
Very timely and interesting. I am currently looking for a crawler that tightly
integrated with Drupal and that can be easily managed through Drupal nodes.
Any suggestions on a solution for a small site that only needs to handle
thousands of pages/urls?

~~~
toumhi
scrapy (<http://scrapy.org/>) is a well-documented and open source python
scraping framework that I've used in a couple of projects.

~~~
rb2k_
Indeed, seems like a great framework.

Considering the timespan of the project, I had to rely on something I'm pretty
ok at (Ruby), but I remember hitting a lot of posts about scrapy on the way

------
yesno
I like Ted Dziuba solution:

<http://teddziuba.com/2010/10/taco-bell-programming.html>

Full-stack programmer at work!

~~~
rb2k_
I loved his approach too, but if you want to end up with something that you
can search freely for properties it gets a little tiresome with just bash and
*nix tools :)

------
inovica
A good read and very timely from my perspective. We created a crawler in
Python a couple of years ago for RSS feeds, but we ran into a number of issues
with it, so put it on hold as we concentrated on work that made money :) We
started to look at the project last week and we've been looking at rolling our
own versus looking at frameworks like Scrapy. The main thing for us is being
able to scale. Anyone who has knowledge of creating a distributed crawler in
Python I'd welcome some advice.

Thanks again. Really good post

~~~
rb2k_
After having written the thesis and thought about that stuff for another few
weeks, my résumé would be:

\- Use asynchronous I/O to maximize single-node speed (twisted should be a
good choice for python). It might be strange in the beginning, but it usually
makes up for it, especially with languages that aren't good at threading
(ruby, python, ...).

\- Redis is awesome! Fast, functional, beautiful :)

\- Riak seems to be a great distributed datastore if you really have to scale
over multiple nodes.

\- Solr or Sphinx are just better optimized than most datastores when it comes
to fulltext-search

\- Take a day to look at graph databases (I'm still not 100% sure if I could
have used one for my use cases)

~~~
inovica
Thanks for the tips! I really appreciate it. I'll check these out. All getting
very exciting for my Christmas project!

------
richcollins
I'm having good luck using node.js's httpClient and vertex.js for crawl state
/ persistence.

~~~
rb2k_
Oh, node.js is definitely a great direction to go!

One of my problems was that a lot of the "usual" libraries are written in a
synchronous/blocking manner behind the scenes. This is something that the
node.js ecosystem would probably solve right from the start.

The downside of a relatively new library like httpClient is, that it is
missing things like automatically following redirects. While this can be
implemented in the crawler code, it complicates things.

How big are the datasets that vertex.js/tokyo cabinet is able to handle for
you?

Node.js is on the list of things I'd like to play with a bit more (just like
Scala, Erlang, graph databases, mirah, ...). Is your crawler's source code
available by any chance?

~~~
richcollins
My dataset is still small, but you can scale a single TC db to nearly
arbitrary size (8EB). It can also write millions of kv pairs / second.

Vertex.js can't quite keep up with TC as its written in javascript. However,
it does let you batch writes into logical transactions, which you can use to
get fairly high throughput.

The source isn't open as its fairly specific to my app,
<http://luciebot.com/>. I'd be happy to chat about the details without
releasing the source. richcollins@gmail.com / richcollins on freenode.

~~~
rb2k_
Be sure to check out this post:
[http://stackoverflow.com/questions/1051847/why-does-tokyo-
ty...](http://stackoverflow.com/questions/1051847/why-does-tokyo-tyrant-slow-
down-exponentially-even-after-adjusting-bnum)

I did some experimentation with tokyo* and experienced that slowdown myself. I
just didn't want to disable journaling in the end...

~~~
richcollins
Thanks -- I've seen that. I'm just going to make frequent backups and hope
that lack of journaling doesn't bite me in the ass o_O

------
nl
Can someone please explain what FPGA-aware garbage collection is?

~~~
rb2k_
Straight from the man himself: <http://buytaert.net/files/fpl05-paper.pdf>

Abstract:

During codesign of a system, one still runs into the impedance mismatch
between the software and hardware worlds. This paper identies the different
levels of abstraction of hardware and software as a major culprit of this
mismatch. For example, when programming in high-level object-oriented
languages like Java, one has disposal of objects, methods, memory management,
that facilitates development but these have to be largely . . . abandoned when
moving the same functionality into hardware. As a solution, this paper
presents a virtual machine, based on the Jikes Research Virtual Machine, that
is able to bridge the gap by providing the same capabilities to hardware
components as to software components. This seamless integration is achieved by
introducing an architecture and protocol that allow recongurable hardware and
software to communicate with each other in a transparent manner i.e. no
component of the design needs to be aware whether other components are
implemented in hardware or in software. Further, in this paper we present a
novel technique that allows recongurable hardware to manage dynamically
allocated memory. This is achieved by allowing the hardware to hold references
to objects and by modifying the garbage collector of the virtual machine to be
aware of these references in hardware. We present benchmark results that show,
for four different, well- known garbage collectors and for a wide range of
applications, that a hardware-aware garbage collector results in a marginal
overhead and is therefore a worthwhile addition to the developer's toolbox.

