

Diffbot launches a web page classifier API: analyzes a day of Twitter - miket
http://betakit.com/2012/08/16/diffbot-adds-page-classifier-api-to-help-developers-categorize-the-web

======
miket
Wanted to thank the HN community for all your encouragement. I first released
the Diffbot API as a "Show HN:" post last year
(<http://news.ycombinator.com/item?id=2310852>). $2M+ and lots of hard work
later, we're powering some of the largest destination sites out there like
Stumbleupon and the new Digg.

~~~
thomasfl
The visual analysis method looks very interesting Mike! Will you identify or
extract information from pages with product info, events, food recipes or
product reviews?

~~~
miket
yes, its on the roadmap. a full photo api is next.

------
ig1
Conceptually I like the product, it's something I would consider paying for.
But in practice it doesn't seem to perform that well. It misclassify things it
should get right (article hosted on posterous; a youtube page; hacker news)
and for some queries it just returns results for a completely different
webpage.

The page tagging technology looks good though.

~~~
miket
Thanks. Yes, the page classifier still does miss, which is why it's still in
beta status and not fully in production as with our other APIs. Would love to
hear about your use case.

~~~
ig1
B2B customer segmentation in CRM systems based on customer homepage.

------
dave_sullivan
Really like the vision approach to classifying web pages, I've been thinking
google should add this to their algo for a while (if they havent already).

Classifying individual parts of pages (as Diffbot seems to be doing) is
difficult, but I suspect google could take screenshots of pages reported as
spam or whatever as one class and compare those to screenshots of pages w/high
pr to get a pretty interesting classifier they could use as an extra
datapoint. Could be an interesting experiment anyway, using data they've got
lying around.

------
jdangu
I see some potential in ad tech.

How does caching works? Is there any focus on security? Multiple geolocations?

I liked the TOS :) \---- Diffbot.com is made available for personal, non-
commercial, and commercial purposes. Services are provided as-is, and we do
not make any guarantees on the quality or performance.

------
jdhuang
I thought this was super-clever when I first came across DiffBot last year.
Can't wait to see what they come out with next.

Keep it up!

------
laserDinosaur
wow, pretty cool. I wonder though is there much use for it outside of
aggregate sites like digg? Even for a site like reddit, all the content is
already split up into categories by users. While this is really cool, I'm not
really seeing much use for it. What are some problems that this will solve?

~~~
miket
The most pressing problem that the page classifier solves is for the current
community of developers using Diffbot. We only offer APIs at the moment for
extracting information out of frontpages and article pages (see
<http://www.diffbot.com/our-apis/>). While we'll be releasing more soon for
the other page types that you see in the infographic chart, this limitation is
a problem if, for example, you're passing in a photo page or recipe into the
article API--you're not going to get an article back since its not
appropriate. Having the technology to visually understand what the types of
pages are allow us to route those requests.

Beyond that, it's useful for companies that perform analytics or use the
information to do better search or auto-categorization.

~~~
hansef
Are you going to be releasing an events API in the near future? I've had an
app idea banging around for a few months that this API would be perfect for.

~~~
miket
The next API we'll train an API for is likely image/photo pages. Event pages
are definitely on our roadmap, but we've still got work to do there. Could
definitely use your help with training data and learning more about your use
case.

------
melipone
what are the different types the page classifier returns?

