

Ask YC: Any crawler experts out there? - justtease

Hi. Wondering if there are any crawler experts on here who can help me.  We're wanting to create a crawler to visit some sites that have forms, lists of items and detail of items. They're all in the real estate market and we want to capture the properties and pull out the latest ones. I'm being told that we need to create a specific crawler for each site, but I was wondering if we could create a generic crawler that has some kind of plug-in or pattern matching file (that we build manually) for each site. Anyone who is super-skilled in this area - I'd appreciate some advice.  We're using Python also.  One caveat - I'm not the tech guy as I tried to program and failed, but I do understand what we need and have a very good understanding of technology, I'm just inept at taking my ideas and doing anything with them :)
======
danohuiginn
use beautiful soup (best python scraping system I know of). Maybe combine it
with mechanize for navigating between the pages. Don't try to create your own
pattern-matching file. Write a generic crawler class, subclass it for each
site. In the end you should just need to write a couple of short site-specific
functions for each site.

It doesn't take long once you get going - that is, until you run into sites
that are unnavigable piles of javascript and unstructured html.

~~~
justtease
Thank you. I'll look into these. I'm constantly amazed by the power of python
- especially in comparison to PHP which is my background

------
wehriam
I've developed a Twisted Python crawler that does something very similar to
that. The possibility that it would work well seemed dubious at first, but
I've been pleasantly surprised with the results.

Email me at johnwehr@gmail.com - I'd be happy to discuss the technology and
progress I've made so far.

------
skmurphy
You might talk to the guys at New Idea Engineering about their xpump
technology <http://www.ideaeng.com/ds/xpump.html> I used it a few years ago to
process all of the hardware data sheets on the Cisco website and extract 13
parameters such as height, width, depth, weight, power consumption (AC and
DC), etc...Because Cisco's products come from many different acquisitions the
datasheets were in many different formats, which sounds similar to your
problem.

One point I would make is that a fast crawler is not always the best for this
type of application: crawling at about the speed a user would click on pages
is more friendly to a site and less likely to have them take steps to block
your access.

------
conorh
I work for <http://streeteasy.com> . We are experts in this area - especially
in real estate ;) Feel free to contact me at ch AT streeteasy.com. It is _not_
an easy task to build what you are looking for. We've been building and
improving our system for two years now (ruby on rails.) Scraping the data is
just one part of the problem. Validating the data is also a big issue. These
sites often have incorrect or stale information. MLS's are good, but they may
have restrictions on what you can do with the data, or (as in NYC) they may
not even exist.

------
scumola
You can build a generic crawler that pulls pages from sites quickly and then
process the pages offline with whatever lnguage you'd like. It's better to
have a distributed way of doing things. Plus, there are standards that you
need to comply with when crawling someone's website., like not crawl them too
fast, or to check their robots.txt file to make sure that you're crawling
"allowable" pages. Then once you've pulled their data off, you process it
offline and do whatever you need to do with the data. It's not a simple
procedure, but it's do-able if you want to spend some time doing it properly.

------
yters
I've been thinking that there must be an easy way to tie the emacs webbrowser,
macros, and regexs together to make powerfully customizable crawlers, but I
haven't really investigated. Anyone know of this being done?

------
vikram
I've been working on this for a little while now. You can definitely write a
plug-in or pattern matching file for each site. Building a specific crawler
for each website doesn't make sense.

The custom bits you need are then ones that fill the form and then extract the
results. For scraping results ruby's Scrubyt is best as you can write
templates for each type of page.

------
bprater
If you need to do form stepping, you need to look at something like Perl's
Mechanize package. (Ruby has one too.)

Spend time reading articles related to Mechanize. Your resulting code is going
to be fairly terse, so you don't really need to spend much time worrying about
making a generic crawler.

------
sonink
If you are into java then I will suggest webharvest...if you want something
broader(read generic) and brave enough then you can even try nutch.

------
toddcw
I'm with screen-scraper, (<http://www.screen-scraper.com/>), and we've dealt a
lot with scraping real estate data. Building a generic crawler for this kind
of thing is quite a bit more complicated than it might seem. You might give
our software and services a look, though. Our app integrates quite nicely with
Python.

------
pchristensen
See this post from a couple weeks ago:
<http://news.ycombinator.com/item?id=96057>

Business idea? Selling smart crawlers to YCNews readers? There seems to be
recurring interest :)

------
scumola
Also, there should be some way of getting MSLP (?) data from a service via RSS
or something somewhere - which is a TON nicer than crawling several peoples'
webpages.

------
nickmerwin
If your environment supports Ruby, I have lots of experience with a great
library called Scrubyt... yickster at gmail

------
latone
Please contact latone at gmail.com and we can discuss.

------
scumola
MLS. :)

