

Focused Crawling - bravura
http://bixolabs.com/about/focused-crawler/

======
pedalpete
Are these guys just offering to write 80legs crawlers for customers?

Maybe I'm missing it, but with the ease of web crawling, and the capabilities
of open source data mining tools, I think these guys are serving a VERY small
market.

~~~
kkrugler
Hi Pedalpete - I'm the founder of Bixo Labs, so this comment caught my eye. My
first thought is "We need to add some additional context to that page", for
cases like this where it's the first time somebody is reading about what we
do.

As to your question, we don't use 80legs. We build workflows on top of
Hadoop/Cascading/Bixo, and run them in EC2.

As for the size of the market, I agree that just writing a webcrawler isn't
that interesting; you can easily use Nutch, Heritrix, or (for small stuff)
roll your own, though the ins and outs of large scale webcrawling are
definitely non-trivial.

We've found our sweet spot to be customers who use the raw webcrawl results as
the starting point for a data processing workflow. Sometimes what happens next
is simple (extract particular types of data off pages, turn into XML, push to
the next step) and other times it's more complicated, where natural language
processing & machine learning are key steps.

You mentioned open source data mining tools - which ones are your favorites?
We use a lot of open source, but after my Krugle startup days I know there are
always 10 more alternatives to every project that I haven't yet heard about.

Thanks,

\-- Ken

