

Ask YC: Help with how to save data from crawler - groovyone

Hi all!<p>We are creating a web spider/crawler for experimenting with classification of sites.  We are at a point where we can't make a decision and I was hoping someone out there who has done this kind of thing before might help us make a decision.<p>Here's what we are doing as an experiment<p>1. Taking 1-2 million domain names and crawling index page plus approx 10 internal pages based on whatever links we get<p>2. The above will be done in a polite way so not to overload the resulting server and the spider will have a link back to us.<p>3. We want to run 3-4 downloaders on individual machines and we are using either Twisted or Pyro to do this<p>The above bit we're OK with and its done.  We have two options we think for the next stage. Either:<p>- Push all downloaded data into mysql to process and for our parser machine to access and parse/classify<p>or<p>- each downloader saves the data as a file to its HD and then our parsing machine takes this information across the network<p>I can't find easy information about how search engines save their data (into a database or just locally to the filesystem) and we kind of feel at this end that it is fundamental at this point to decide on the correct path.<p>Any help or advice appreciated.  Even criticism :)<p>John
======
aristus
When you are working on big problems, it's sometimes easy to let yourself get
stuck on some unimportant decision. Usually it's a sign that you are unsure of
something more important but you don't want to think about it.

If you just want to run an experiment on 10M pages, then use whatever you feel
comfortable with. The important thing is NOT files vs sql but whether your
classification idea is worth spending time on. Who cares if it's inefficient?
That's not what your experiment is about.

------
jws
Smells like 500GB of data. I'd save keep the crawled data in filesystems on
the crawling boxes. Then you can load your mysql database and when it fails
because <<insert-unforeseeable-circumstance>> you can take another shot at
loading it from your data.

After you resign yourself to working with a subset of the data in mysql you
will learn how to compute what you really want to know and write a fast
processor to just scan the spooled data you have on your search machines and
put that into the database instead of the raw data.

[[edit: maybe 500GB instead of 5TB, got a little crazy on my zero key in bc]]

~~~
pz
I agree that initially you should dump it to a local filesystem. Since this is
an experiment you don't want to get bogged down in DB performance details.

Also, if HD space is a concern, occasionally tar/zip up a bunch of the data.
HTML is very redundant and I'd bet you could squeeze 500GB of HTML down to <
50GB, even more if you have a lot of pages from the same site.

Really, a lot of this depends on what resources you have available and how you
want to process the data later on. If you are classifying pages independently
of one another then why bother pooling them to a centralized DB? Just run your
classifier on each node and pool those results instead.

An alternative solution is S3, which I've used for crawling storage before.
Its not ideal for data processing since you have to constantly pull data over
the network, but its an easy way to get centralized storage.

------
yourabi
Take a look at what is out there.

If you run a simple crawl with Heritrix (as example) you'll notice it stores
everything in 'ARC' files which are basically compressed (zip) archives with
an index to access individual records (via offsets).

I would avoid sticking everything in a database, although you could probably
get away with it -- but I agree with aristus that it probably doesn't matter
at this point.

Another idea is that you could look at a static html dump of Wikipedia and see
how their structure their tree (three letter prefixes)

On the flip side, having it in the DB will probably be easier in terms of
managing it (one place to backup) and possibly as an easier way of splitting
up workload across multiple boxes -- ex: three boxes could query db, suck down
all pages for a couple hundred domains, do processing, insert when done

------
anamax
> I can't find easy information about how search engines save their data (into
> a database or just locally to the filesystem) and we kind of feel at this
> end that it is fundamental at this point to decide on the correct path.

There are other possibilities, including both local and remote datastores that
aren't really databases.

However, their approach doesn't matter because your problem is significantly
(>1000x) smaller and different (for one, you're not running continuously).

------
ks
I don't know what the big search engines do, but storing the data in a
database for the single purpose of parsing it later sounds a bit unecessary.
If you are only using the database as storage, the file system will do a
better job.

