
Yahoo open sources Anthelion web crawler for parsing structured data - fangwang
https://github.com/yahoo/anthelion
======
loopbit
Just a couple of minor clarifications:

\- Athelion is a plugin for Apache Nutch, which is the web crawler part and
has been open source for a long time.

\- As far as I can tell, Athelion parses structured data (microformats,
microdata, RDFa...) but that's not the most interesting bit. The online
classifier of pages and scoring of new links discovered looks like the real
important piece.

\- Actually, the other two parts of the plugin are modifications of existing
Nutch plugins.

All in all, I can't wait to have some time to see it working.

------
praveenster
Just curious. Wasn't Yahoo using Google and Bing for search? Is this crawler
being used at Yahoo internally?

~~~
peterhadlaw
If I recall correctly, Yahoo had a couple other offerings / search tools. They
had something called YQL which you enabled you to treat a specific, public
(supported) websites, like Craigslist for example, as a SQL table. SELECT *
from <house_for_rent> etc, etc, etc.

I have no idea if this is what was used here but I know this is an example
where I'm sure they didn't out source the search work to Google.

------
sanxiyn
I note that the package namespace is under com.yahoo.research. So this is
probably from Yahoo Research.

------
tigrank
I want to build a specialized search site only for cars. Think of it as
indeed.com for cars. Is this the magic sauce I've been waiting for?

