
Ask YC: Any ideas about intelligent crawlers :) - franklymydear
Hi everyone. Not sure if I should be posting this to a forum or not. Curious to see what peoples answers are here as I read this site a lot<p>I'm thinking of creating an intelligent crawler in Python. I have a project with a friend where we'd like to crawl a few specific car-related websites, grab some of the info and look for new entries.  I am wondering if there is any existing technology out there where a crawler is sent to a site and either trained (visually?) or which can understand repeating information like tables that we could use to create a proof of concept.  I'd appreciate any critique of my ideas which is:<p>1. create a visual tool - probably windows/mac based which uses the browser to navigate a site and to highlight elements that we would like to capture, such as car name, description, price. This would also have to be able to automatically/manually work out repeating elements<p>2. this tool would create some kind of file (xml?) which would then be used by the main crawler to understand how to navigate the site<p>3. The crawler, which we'd write in python would visit the site every week to look for new information<p>Am I going about this the right way or does anyone have any ideas<p>One point, we would seek permission from the sites before crawling - it would be to their benefit as we're looking to push people their way.<p>Appreciate any thoughts anyone might have<p>All the best<p>John
======
adrianh
I wanted to do this same thing a while ago and have done a lot of research and
reading in this area. Here are some search terms that will likely help you:

* automatic wrapper generation

* information extraction

* removing noisy information from Web pages

* template detection

* wrapper induction

"Wrapper" is a fancy computer-science term for "scraper."

I wrote some Python code that does this -- given X sample documents, detect
the differences between them and automatically create a scraper tailored to
those documents. I released the first version open source -- it's called
templatemaker: <http://code.google.com/p/templatemaker/> .

But that version of templatemaker is quite brittle, because it was designed to
work on plain text as much as on HTML. I've since written an HTML-aware
version of templatemaker that is really frikkin' awesome (if I may say!) and
beats the pants off the old one. I don't know if I'm going to open-source it,
as it's quite valuable to my own startup.

Hope this helps!

------
mig
Don't write your own crawler. Use nutch.

It is designed to scale and do mapreduce kind of parallel processing. I would
strongly recommend you to take a look before writing your own.

<http://lucene.apache.org/nutch/>

~~~
imsteve
mapreduce? Just how many requests will you be making to third-party sites at
once? Sounds like a good way to get blocked fast.

------
aristus
I think you are in over your head, but it's a great way to learn about the
plumbing and underbelly of the Web.

This visual tool is basically what a company called onDisplay was doing back
in 1999, before they were bought by consulting firm Vignette for an obscene
amount of money. But scraping against the html structure is a losing battle.

A better approach is to use clues in the information itself to guess its
content: something with a "$" is a price. Something containing "toyota" is
probably a name, "blue" a color, more than 20 words containing "good", "v8" is
a description, etc. That way your scraper is resistant to structure changes.

All that is separate from the problem of a crawler. It takes a long time and a
lot of effort to convince content sites that what you are doing is a) helpful
to them and b) something they should not be doing themselves.

It's like jumping on stage with the band and starting to play. You better be
really good and friendly and prepared to get the crap beaten out of you.

------
franklymydear
Not sure if this is appropriate - as not responding to one particular person -
but I'd like to send a BIG Thank You to everyone out there for the advice,
encouragement and sometimes the reality "keep your feet on the ground" type
stuff. This has inspired me to move forward with this. To those who have done
stuff like this before, thanks for the links and I'm grateful for you sharing
your experience.

If its OK I'd like to let people know about my experiences. Oh, if anyone is
interested in collaborating or just sharing ideas then I'd be happy to do
likewise

All the best

~~~
akkartik
Good idea. Add an email or website at
<http://news.ycombinator.com/user?id=franklymydear> and then you can exchange
private messages by multicast rather than broadcast.

Feel free to ask me more questions by email. I spend a fair bit of time
thinking about html parsers.

~~~
franklymydear
Thanks! I've just done this. Going to do some research into this area and then
make a plan to start in the next week.

------
akkartik
For 1 you mean you want to build a _parser_ for arbitrary html that your
crawler returns. Hard problem, as others have said. My advice:

1\. Use an html parsing library. Beautiful soup (python) or hpricot (ruby) are
good building blocks.

2\. Practice manually building parsers for a few sites, then see if it leads
you to any insights about how to generalize the process.

3\. Ignore everything else until you do 2. Just use wget as your crawler. Skip
the visual interface for now; just parsing arbitrary pages is a hard enough
problem to bite off.

------
ntoshev
Someone mentioned dapper.net and I upmodded it, but I think it will get lost
in the noise.

As far as I understand, they are very close to what you are trying to do, so
study them carefully as a competitor.

~~~
franklymydear
Yes, this is close to what I want to do in terms of functionalioty. My idea
was to use a wizard-like approach to record the elements of a page that we
need to capture and how to navigate through a specific site. They appear to be
doing this, though the system has failed a few times on me - they're doing it
in a browser, whereas I'd planned to create an app. Very interesting though.
Anyone who knows of anything similar or who is interested in building
something like this, get in touch

~~~
ntoshev
Actually I might be interested - please leave an email.

------
showerst
I know in php it's possible to load an html document and parse the DOM tree
using XPath expressions, presumably that capability exists in python.

So i guess in theory you could write a frontend (firefox extension?) where you
could highlight / select a screen area (webdeveloper already does this), then
pass it's DOM information (i.e. #body table tr td#username ) to your backend,
which would then scrape that field(s) from any applicable site pages.

This of course assumes that 1) The website(s) are well formed enough for your
parser and 2) Well programmed enough that the same info is in the same place
in the DOM tree, and preferably ID'd, which are pretty HUGE assumptions, but
could be worked around if you were determined enough.

Not sure if this is what you're looking for, and it seems a bit circuitous,
but it's a plausible idea anyway.

------
spoonyg
I have built up something similar to what you are describing and it was a fun
project. My first reaction to #1 is that if the information you want to
repeatably in the same place you are probably better off just doing things
manually and not going to the trouble of building a visual tool. In my
experience the interesting pieces of data tend to move around and something
like regex is the best way to handle this. I used wget to grab data because it
was quick and easy. I then did the post processing in the background,
separating out the grabbing of data and the interpretation of data.

------
akkartik
I remember seeing a screencast of some startup that did something like this.
It was maybe a year ago. You click on elements, it shows you other hypotheses,
you correct if necessary, and you get an RSS feed compiled from the page
structure.

Anybody remember this?

~~~
yubrew
dapper.net

~~~
akkartik
Yes!

------
inovica
What you are talking about is the deep web I think and I don't think anyone
has managed this yet! Essentially you want a system that can fill in forms and
pull back results. I think it needs to be done on a per-site basis

------
popephatt
John,

One thing that I would point out about the script you intend to write is that
it requires an awful lot of maintenance (when sites change layout) and is
frequently not very reusable. One solution that I tried is Mozenda
(<http://www.mozenda.com>). The have all the stuff you're looking for (i.e.
visual, browser-based tool, writing to XML) but also have error handling and
notifications, so that if an agent breaks you'll know and be able to fix
things inside the visual tool.

------
imsteve
It's called "scraping" and I've done it lots of times with python, very easy.
Don't bother with those other specialized, non-python, frameworks that people
are suggesting.

<http://wwwsearch.sourceforge.net/mechanize/>

And if you need to do complicated html parsing in combination with that:

<http://www.crummy.com/software/BeautifulSoup/>

From there, it's cake.

------
DarrenStuart
I know this is ruby but might be worth a look. Will help you no end and no
need for a web browser.

<http://mechanize.rubyforge.org/mechanize/>

might be worth a look <http://www.crummy.com/software/BeautifulSoup/>

------
akkartik
Another talk I saw at shdh in october
(<http://superhappydevhouse.org/SuperHappyDevHouse20>):

<http://tagtheplanet.net>

They seem to be attempting an intelligent crawler as well.

------
toddcw
You ought to check out screen-scraper (<http://www.screen-scraper.com/>). It's
a commercial app, but the best I've used for this kind of thing. They also
offer a freeware version.

------
bluelu
What do you want to do?

At least in Germany, there exists a few solutions which do exactly that. If a
person puts his car on sale (a bargain) on one of the car related websites, he
get's the first call in about 20 seconds from someone using these programs.

------
ivan
And still one thing .. if you want to ask the site owner for permissions, why
not ask them to produce some specific xml file for you?

~~~
akkartik
Because granting permission is easy. Why would they go to more effort than
that for random people?

~~~
ivan
Why thousands of job sites produce custom xml output for simplyhired or indeed
??

~~~
imsteve
They like buzzwords?

------
sonink
webharvest ?

------
benn
Does anyone want to write a search engine? Python and C++. I thought we could
analyze the links between pages and come up some kind of ranking algorhythm.
We'd have to make the system widely parallel - but I think we could be
breaking some new ground here.

Any takers?

