

Ask YC: Grabbing content - qwestion

First of all I would like to thank some of you who offered some life saving advice for my previous question.<p>Now that we have pulled up our socks and are making some headway, we have run into our first content problem.<p>We are building a business listings directory for a niche market. We have identified a bunch of ad-hoc sources for sourcing content. We have manually gathered content for our mockup, but to take it to the next level we need a lot more content than what we have going.<p>What are the different techniques used to gather content? It would be great if some of you have had specific experience in gathering content for some kind of a business listings directory.<p>Looking forward to some valuable insight!<p>Thanks again!
======
spage
My favorite technique is:

wget URL > HTML tidy HTML > XHTML xslt [identity transform based content
extraction] XHTML > XML XML > DB

The whole process glued together with PERL or shell scripts. Depending on how
you construct your content extraction, this technique can weather lots of the
inevitable content style changes and easily adjusts when changes need to be
made.

------
thorax
Hmm, I really don't know how to help. I did just inquire into something not
specifically related but might be randomly useful.

There's a company called DeepData @ <http://www.deepdata.com> which might be
of some use? They have APIs for business listing queries, and may be able to
help on some categorizations.

------
lux
If you mean screen-scraping, isn't this roughly what you're looking for?

<http://www.crummy.com/software/BeautifulSoup/>

------
davidw
Just be very very careful if you grab any application forms;-)

------
FiReaNG3L
Dapper.net is your friend

------
kleevr
lixto

