

Ask HN: MP3 Crawler - obaid

I have been looking into a good way to implement this. I am working on a simple website crawler that will go around a specific set of websites and crawl all the mp3 links into the database.<p>I don't want to download the files, just crawl the link, index them and be able to search them. So far for some of the sites i have been successful, but for some they use url redirects and stuff which confuses the crawler..<p>any ideas? how does beemp3.com index all these links?<p>thanks
======
jm4
They probably use a better crawler than the one you've put together. Reliable
crawling is not the easiest problem to solve. There are a lot of crappy sites
out there. When you're Google you can tell them to screw off. When you're
small and you need to crawl the content you have to figure out a way to make
things work.

To accurately collect links you've got to be able to follow redirects (this is
really a no brainer), interpret JavaScript, handle DOM events, have AJAX
support, possibly parse Flash files for content or links, etc. There are still
plenty of sites out there that use Flash for navigation and don't provide a
fallback. I recently saw a site that used the window.onload event to call a
function that wrote out the HTML for the entire page using document.write.

Depending on what your needs are you could end up with anything from a small
script to a full fledged browser. You could either develop something yourself,
use an open source crawler or script Mozilla or IE. With a couple Perl modules
you could have your own headless Mozilla.

Once you have a good crawler it's still going to be tricky to use. There are
all sorts of spider traps out there- circular navigation, unique URLs that
produce duplicate content, etc. Sometimes it's deliberate; most of the time
it's not. People just don't usually design sites with web crawlers in mind. It
may taking a little prodding (site-specific configuration) to make it work.

------
westside1506
Our service, 80legs, will let you easily do this. We let you specify seed
links, how deep you want to crawl, and control many other aspects of the
crawl. By default, we control the hard bits, like redirects and spider traps,
but if you want to override our default functionality you can easily insert
your own code to do it.

Our default functionality will let you identify mp3 files by regex or keyword,
but if you need something more sophisticated you can override that too. I'm
pretty sure, based on what you've said, that you could simply put in a few
parameters and start running some jobs within a few minutes of getting started
with 80legs that will do exactly what you want. If not, adding custom code to
80legs is pretty simple too.

Just send us your contact info on our website (<http://www.80legs.com>) and
mention HN and I'll make sure you get a beta invite. BTW - we're still in
private beta and the service is still free for right now.

------
deutronium
You may be interested in Nutch (<http://lucene.apache.org/nutch/>), an Open
Source crawler, which handles indexing etc. for you. Its also based on Hadoop,
so it should scale nicely just by throwing more machines at the job.

------
ScottWhigham
How is this different from doing a google search for filetype:mp3? I don't
remember...

