
Diffbot Aims to Build the Intel of Data for Artificial Intelligence - dwynings
http://techcrunch.com/2016/02/11/diffbot-aims-to-build-the-intel-of-data-for-artificial-intelligence/
======
mrdrozdov
Serious question. If all websites decided to provide the structured data that
Diffbot has to scrape, would Diffbot still have a business? If someone is
making money off your data, shouldn't it be you?

~~~
mwilcox
They have a pretty good shot at owning customer relationships and becoming a
channel for that structured data (a market sites can sell data through) if
that does become the case

~~~
jorgecurio
wouldn't it be better for them to just turn their own API and control access
to it?

And wouldn't they just upload their database or whatever to transform it into
an API not rely on screen scraping?

------
raldi
In what way is their business model a parallel to Intel's?

~~~
miket
Founder, here.

It is an imperfect analogy, but let me try to elucidate what our model
actually is.

We currently power over 250 companies, from some of the most popular
applications like Instapaper and Shopspring, to search engines like Bing,
DuckDuckGo, Yandex, and eBay to business intelligence tools like Crunchbase.

Our thesis is that access to large volumes of structured data will be just as
critical a resource as the learning algorithms or the computational hardware
for developers building the next generation of smarter application, and we
offer our knowledge-as-a-service as a platform/component to build end-user
applications around.

Part of the new funding is going to our effort to build a comprehensive
database of knowledge synthesized and fact-checked from the sum total of all
of the information on the web. There is some parallel to Google's knowledge
graph project, except that our method is fully autonomous (no human curation)
and we are building a self-sustaining business around it with a fully-featured
API that everyone (outside of Google) can use.

~~~
raldi
Ah, you want to be the Intel _Inside_ of that market.

~~~
miket
Exactly, you got it.

------
jorgecurio
Here's a good answer I found about diffbot from quora:

[https://www.quora.com/What-is-the-best-solution-for-an-
autom...](https://www.quora.com/What-is-the-best-solution-for-an-automated-
web-scraping-solution)

However, I tried it out and found diffbot failed for a lot of websites I
wanted to crawl. Seems to work well enough for pulling meta data like title of
the page, blog post titles etc, but anything beyond that it struggled. And
since I don't know what the algorithm is doing or have any insight to how it
works.

I rather much have a tool where I can have the accuracy and control and
without having to write a full web scraper and running on an actual browser,
not phantomjs or Webkit derivation where I have no visual feed back of what
the scraper is doing. Like often, I want a live video of what the crawler is
doing. What is it clicking on? Where is the website slowing down? Log file
feels limiting.

Also, It's more than just crawling the website now. There's a lot of single
page apps and edge cases where I find majority of tools fail. Requiring
clicking on angular.js links, rendering javascript or websites requiring
extension to be installed, needing to login to a website that redirects domain
multiple times, trying out every single permutation of dropdown, checkbox,
keywords in input form and crawling the search results with infinite
scroll...some websites even give you bogus data because they can detect you
are not using a real browser.

90~95% accuracy means jack all if 5% represents 99% of the value to you. It's
actually not that hard to achieve an automated solution but it's when it fails
for that one website you rely on for most of your data extraction needs, then
it becomes a question of how fast can you debug and fix the situation....tough
to do when you don't have the means to do it yourself.

[https://www.cs.uic.edu/~liub/publications/kdd2003-dataRecord...](https://www.cs.uic.edu/~liub/publications/kdd2003-dataRecord.pdf)

on the computer vision related front:

[http://ir.lib.uwo.ca/cgi/viewcontent.cgi?article=3296&contex...](http://ir.lib.uwo.ca/cgi/viewcontent.cgi?article=3296&context=etd)

[http://repository.cmu.edu/cgi/viewcontent.cgi?article=1045&c...](http://repository.cmu.edu/cgi/viewcontent.cgi?article=1045&context=architecture)

here's a good 'snapshot' overview of where we are with Automated data
extraction:

[http://repository.cmu.edu/cgi/viewcontent.cgi?article=1045&c...](http://repository.cmu.edu/cgi/viewcontent.cgi?article=1045&context=architecture)

and surprise surprise guess who has the best research on this front (public),
it's Microsoft!

[http://research.microsoft.com/en-
us/um/people/sumitg/pubs/pl...](http://research.microsoft.com/en-
us/um/people/sumitg/pubs/pldi14-flashextract.pdf)

here's there computer vision approach called ViDE a while ago but was
abandoned (?) no news since then:

[http://research.microsoft.com/en-
us/people/jrwen/vips_techni...](http://research.microsoft.com/en-
us/people/jrwen/vips_technical_report.pdf)

Most scraping/crawling tools are super easy to use like Mozenda and you can
outsource it on freelancer to get people to create the crawlers for you.
However, I soon gave up because mozenda charges like $0.10 CAD cents _for each
page downloaded_ (intermediate pages too) due to using their anonymous proxies
and I ended up just giving up. Would be great if somebody like Mozenda offered
an unmetered plan and preferably with some type of rotating proxies so I won't
get throttled. I can imagine that diffbot is probably not using anything like
that and running it straight from AWS which anyone who wants to block scrapers
are able to add their public ip range to ufw. I'm finding it more and more
difficult to scrape from AWS and other free scraping tools as website owners
seem to block the AWS ip range mostly.

~~~
dwynings
Dru from Diffbot here.

Sorry to hear about the issues you ran into when trying out Diffbot! If you
have some examples, I'd like to take a look into whether we can improve.

In cases where our automatic extraction using computer vision isn't 100%
accurate, we offer a visual interface for actually overriding our default
extraction. This input is then used as additional training data for the ML
models.

Mozenda et al. are great if you only need data from a couple of sites and you
don't mind spending time manually specifying and maintaining CSS selectors for
each website and page layout you need data from.

Our crawling, and proxy support, is fairly robust thanks to our hiring the
creator of Gigablast [[https://gigaom.com/2013/09/10/diffbot-brings-big-time-
search...](https://gigaom.com/2013/09/10/diffbot-brings-big-time-search-
poobah-aboard-to-help-it-scale/)].

If you'd like to give Diffbot another go or you have some examples where the
extraction could be improved, please let me know!

~~~
jorgecurio
What proxies do you use and is it possible for us to use our own?

I've never seen that ML visual interface, where can I find it?

I'll give it another go

