
Google's fact-checking bots build vast knowledge bank - spountzy
http://www.newscientist.com/article/mg22329832.700-googles-factchecking-bots-build-vast-knowledge-bank.html?full=true
======
bra-ket
Kevin Murphy ([https://github.com/murphyk](https://github.com/murphyk)) is the
lead developer of Bayes Net toolbox
([https://code.google.com/p/bnt/](https://code.google.com/p/bnt/)) and PMTK:
[https://github.com/probml/pmtk3](https://github.com/probml/pmtk3)

This knowledge graph is probably the largest Bayesian network out there

------
sixQuarks
This is going to set the stage for the next battle between spammers and
Google.

spammers will be populating the web with "facts" that suit themselves.

~~~
cletus
Like many here I'm a huge fan of Neal Stephenson. A lot of people around
weren't big fans of Anathem. I actually really liked it.

One of the ideas that came up in that book was the Reticulum (Internet) was
populated by "botnet ecologies" that subtly manipulated facts, streams and the
like such that filtering this out became another industry (of course).

I've seen the idea that this lies in our future raised here and it seems to
get mocked. I think the idea has a lot of merit.

~~~
mentat
This is immediately what came to mind for me too. The level of confidence for
facts as referenced. I'm wondering how bogons might work into this.

------
dm2
>> "Behind the scenes, Google doesn't only have public data," says Suchanek.
It can also pull in information from Gmail, Google+ and Youtube."You and I are
stored in the Knowledge Vault in the same way as Elvis Presley," Suchanek
says.

I really hope Google does not use Gmail data for projects other than ads. They
really needs to ask users to opt-in to this kind of data sharing. I'm ok with
gmail being read for ads, but almost anything else is unethical, especially
some experimental knowledge base.

~~~
jacquesm
Why should google care what you are ok with after they already have all your
data? If you don't want them to be able to engage in activities like this then
_don 't give them your data in the first place_.

~~~
nhaehnle
It's still _my_ data. Post office employees are not allowed to read my
letters, even though I have given them into their care.

There are very good reasons why we, as a society, have agreed to disallow many
activities that are physically possible. There's a good case to be made that
such a rule should be explicitly added where organizations are entrusted with
private data.

~~~
icehawk219
It most certainly is not your data. It's on their servers, in their apps, and
running through their network. They decide what they do it with, how long they
keep it, and whether or not you even have access to it. Comparing them to the
post office doesn't really make sense either considering that's a public
service, and one a depressingly large number of people want to get rid of.
Google is a for profit company and their data is how they make money.

If you don't like that reality then don't use their service. It really is that
simple.

~~~
waterlesscloud
If I leave some loose hairs on an airline seat, does the airline now own my
dna?

~~~
TeMPOraL
The very concept of "owning data" is nonsense, as your example clearly shows.

------
discardorama
How does this compare with NELL[0] from CMU? I'm assuming it's something like
NELL, but scaled up 1000x because Google is not limited to how often it can
search its own index, whereas NELL is limited to 10K queries/day?

[0] [http://rtw.ml.cmu.edu/rtw/](http://rtw.ml.cmu.edu/rtw/)

------
murphyk
Hi, I’m Kevin Murphy, one of the researchers at Google who worked on this
project. Just to be clear, KV did NOT involve any private data sources -- it
just analyzed public text on the web. (And yes, we do try to estimate
reliability of the facts before incorporating them into KV.) Also, KV is not a
launched product, and is not replacing Knowledge Graph.

Unfortunately, I cannot do a more detailed Q&A here, but if you want more
details, please read the original paper here:
[http://www.cs.cmu.edu/~nlao/publication/2014.kdd.pdf](http://www.cs.cmu.edu/~nlao/publication/2014.kdd.pdf).
(Note that an earlier version of the work was presented at a CIKM workshop in
Oct 2013 (see [http://www.akbc.ws/2013/](http://www.akbc.ws/2013/) and
[http://cikm2013.org/industry.php#kevin](http://cikm2013.org/industry.php#kevin)).
We have also published tons of great related research at
[http://research.google.com/pubs/papers.html](http://research.google.com/pubs/papers.html)

------
turbolent
Paper:
[http://www.cs.cmu.edu/~nlao/publication/2014.kdd.pdf](http://www.cs.cmu.edu/~nlao/publication/2014.kdd.pdf)

------
dctoedt
Sounds a bit like Douglas Lenat's CYC project from the 1980s [1], but done by
machine.

[1] [http://en.wikipedia.org/wiki/Cyc](http://en.wikipedia.org/wiki/Cyc)

------
batbomb
HNers interested in this might also be interested in Deep Dive from Stanford
CS Professor Chris Ré.

[http://deepdive.stanford.edu/](http://deepdive.stanford.edu/)

~~~
turbolent
The paper
([http://www.cs.cmu.edu/~nlao/publication/2014.kdd.pdf](http://www.cs.cmu.edu/~nlao/publication/2014.kdd.pdf))
mentions the extracted knowledge base is about 38 times larger than
DeepDive's, the largest previous comparable system.

------
panarky
_It might even be possible to use a knowledge base as detailed and broad as
Google 's to start making accurate predictions about the future based on
analysis and forward projection of the past._

Hello Hari Seldon, psychohistory and mathematical sociology!

[http://en.wikipedia.org/wiki/Foundation_series](http://en.wikipedia.org/wiki/Foundation_series)

[http://en.wikipedia.org/wiki/Mathematical_sociology](http://en.wikipedia.org/wiki/Mathematical_sociology)

------
walterbell
Is any subset of the "derived knowledge" from public websites and data
contributed back to a public dataset like Dbpedia?

There are bots [1] making Wikipedia contributions, Google could also make
automated contributions to Wikipedia/Wikidata.

[1] [http://wikipedia-edits.herokuapp.com/](http://wikipedia-
edits.herokuapp.com/)

~~~
rryan
A subset of the knowledge graph is available via freebase RDF dumps:
[https://developers.google.com/freebase/data](https://developers.google.com/freebase/data)

I don't believe this is what the article is talking about (knowledge vault)
though. This is just the human and lightly machine curated graph (knowledge
graph).

~~~
__Joker
You are right. This paper proposes to use freebase(it can be any other source)
as prior knowledge.

------
jnbiche
I see a lot of downvoting here of posts that express very reasonable concerns
about privacy _if_ Google is actually using private emails for this AI.

That Google is engaging in this behavior is indeed speculation, as far as I
know. However, Google employees/allies have to realize that attempts to
suppress debate on this issue can only backfire on them. Indeed, the fact that
they don't have explicit policy on this (correct me if I'm wrong) is one of
the reasons researchers are speculating.

It may well be that most people would agree with and/or permit Google to use
their data in this way, but people should be given the opportunity to debate
it in a reasonable fashion, else it looks like it was forced down their
throats. And that's no good for anyone.

------
dave_sullivan
>> "Behind the scenes, Google doesn't only have public data," says Suchanek.
It can also pull in information from Gmail, Google+ and Youtube."You and I are
stored in the Knowledge Vault in the same way as Elvis Presley," Suchanek
says.

Ugh... that's a bit much... because now any employee at google could
potentially get access to random facts about me gleaned from my personal and
business emails? Good luck keeping different levels of confidential
information segregated correctly. That's awesome.

~~~
api
[https://www.youtube.com/watch?v=upu0gwGi4FE](https://www.youtube.com/watch?v=upu0gwGi4FE)

~~~
dave_sullivan
Sure, I'm aware, but this is different.

Collecting anonymous statistics about its users does not include automatically
generating a database indexed by individual based on their private data. One
is par for course when selling bundles of users according to demographic to
advertisers while the other is fucking crazy.

Mining public web data for building a database like that is one thing, but
mining individual private data like this is crossing a line.

------
illumen
Knowing the people who have left Google, who collected a lot of that data, who
we trusted, who are now gone, I wonder what other non-public data is being
used, and how is it being used, and for only good purposes, or for nefarious
purposes?

------
holri
Facts are not knowledge. Read Socrates / Platon.

~~~
adventured
Knowledge is the grasp of the facts of reality.

Most of the ideas produced by Socrates / Plato / Aristotle were in fact wrong.
They are not a good primer on epistemology, concepts, percepts, metaphysics or
anything else. They're a good primer on the history of philosophy.

They inspired incredible progress on thinking and understanding, but they were
wrong more often than they were right, and are a poor reference to
understanding what knowledge is.

~~~
holri
This is a contradiction:

"Knowledge is the grasp of the facts of reality." is was Socrates in an
essence said about knowledge.

Then you say Socrates was wrong.

------
ck2
Isn't it nice that millions of people made web pages that Google decided to
scrape to harvest the work of others and run ads next to it for themselves?

Now try scraping Google and see what they do to you.

~~~
dm2
You can ban GoogleBot easily, just put a line in your robots.txt file, but
then people won't be able to easily find your site using Google services.

If you provide value to Google they will make an API to allow accessing that
data easier.

By scraping do you mean scraping their search results? They offer this, which
is nice: [https://developers.google.com/custom-
search/](https://developers.google.com/custom-search/)

Many large sites don't allow scraping because of unnecessary server load
(denial of service sometimes) so they'll offer an API where you can download
content in a controlled (and monitorable) manner.

~~~
frik
We need a robots.txt and "noindex" metatag standard for emails.

If one sends an email to an GMail/Outlook.com/Yahoo email address, one should
be able to opt-out of their email crawler, advertisement analysis, artificial
intelligence analysis, etc.

~~~
dm2
The user who receives it can:
[https://support.google.com/ads/answer/2662922?hl=en](https://support.google.com/ads/answer/2662922?hl=en)

I don't think they do too much storing of email details, they know that it's a
sensitive area and that an employee will eventually blow the whistle and it
will hurt user trust, which is a big part of their business.

~~~
frik
I meant it the other direction.

1) Alice sends an email to Bob (GMail user).

2) Bob receives the email, meanwhile Google scrapes the content and extracts
its meaning to show Bob some ads and use the facts to improve Google's A.I.

Alice wants a way to mark her email content as "no index".

So that email service provider don't crawl through the content. Exactly like
the robots.txt for domains or the "noindex" metatag in HTML head element!

~~~
lern_too_spel
Why should Alice get to restrict what Bob can do with his inbox?

~~~
frik
It's not about Bob, he can do what he want.

It's about the email service provider, that should stop analyze the email text
to extract its meaning. Gmail uses it to display ads to Bob, builds a shadow
profile for Alice (like Facebook) and trains an artificial intelligence (see
headline link).

------
plicense
"Knowledge Vault has pulled in 1.6 billion facts to date", does this fact also
include the fact that I am adding more facts right now? What fact metric is
this fact?

------
hanula
Are there any open source efforts like this?

