Hacker News new | past | comments | ask | show | jobs | submit login
Google's fact-checking bots build vast knowledge bank (newscientist.com)
136 points by spountzy on Aug 22, 2014 | hide | past | web | favorite | 83 comments



Kevin Murphy (https://github.com/murphyk) is the lead developer of Bayes Net toolbox (https://code.google.com/p/bnt/) and PMTK: https://github.com/probml/pmtk3

This knowledge graph is probably the largest Bayesian network out there


This is going to set the stage for the next battle between spammers and Google.

spammers will be populating the web with "facts" that suit themselves.


Like many here I'm a huge fan of Neal Stephenson. A lot of people around weren't big fans of Anathem. I actually really liked it.

One of the ideas that came up in that book was the Reticulum (Internet) was populated by "botnet ecologies" that subtly manipulated facts, streams and the like such that filtering this out became another industry (of course).

I've seen the idea that this lies in our future raised here and it seems to get mocked. I think the idea has a lot of merit.


This is immediately what came to mind for me too. The level of confidence for facts as referenced. I'm wondering how bogons might work into this.


This makes sense. For me, the main problem of Google is trying to retrieve a treasure out of garbage. While the Internet has a lot of good information, much (most) of it is incorrect -- sometimes on purpose as you suggest. I would be much more interested in a learning system that is able to retrieve information from authoritative sources such as books, for example.


Part of the problem with that would be telling which books are authoritative for which topics. Or, more interestingly, which authors.


And of course authors and publishers will try to game this. Spam is an AI-complete problem, you'll need a system groking human values and making judgement calls to filter out spam perfectly.


Good to know that the Elephant Population is thriving.

http://spring.newsvine.com/_news/2006/08/01/307864-stephen-c...


>> "Behind the scenes, Google doesn't only have public data," says Suchanek. It can also pull in information from Gmail, Google+ and Youtube."You and I are stored in the Knowledge Vault in the same way as Elvis Presley," Suchanek says.

I really hope Google does not use Gmail data for projects other than ads. They really needs to ask users to opt-in to this kind of data sharing. I'm ok with gmail being read for ads, but almost anything else is unethical, especially some experimental knowledge base.


> I really hope Google does not use Gmail data for projects other than ads.

It's already used by the Google Now cards on Android, and it's a fantastic feature. If I book a flight, I automatically get a card that reminds me to leave for the airport at the correct time (taking traffic into account), without any interaction on my part.


If any flight itinerary hits gmail at all, in fact, it ends up in Now - as I've found out from itineraries forwarded by family and friends. Has been borderline annoying on occasion, since I don't generally care much if someone else's flight has been delayed.


Last week I searched google for more information about a specific compiler error code. Later that day google now showed me flight info for some flight that happened to have the same code.


I find that fantastically useful when I'm traveling to meet family somewhere or when family is traveling to meet me. You can handily swipe them away if they are not useful though.


You really shouldn't treat your family that way.


It doesn't make any sense to do that, for exactly the reasons you mention. You'd gain little value and basically ruin all public trust in you.

Luckily the guy who said that is from Télécom ParisTech, i.e. he was completely speculating.

Public posts from google+ and youtube are fine, though.


Maciej Cegłowski: The Internet with a Human Face http://idlewords.com/bt14.htm

One of the best discussions bar none of this issue I've seen.


Why should google care what you are ok with after they already have all your data? If you don't want them to be able to engage in activities like this then don't give them your data in the first place.


It's still my data. Post office employees are not allowed to read my letters, even though I have given them into their care.

There are very good reasons why we, as a society, have agreed to disallow many activities that are physically possible. There's a good case to be made that such a rule should be explicitly added where organizations are entrusted with private data.


I'd love a world in which email providers would be held to the same standards as the post office.

But that's not the world we currently live in.

The only thing that sets a limit on what google can do with your data is the amount of data you give them. They also have terms of service and privacy policies but these can change over time and/or be re-interpreted in creative new ways to enable whatever it is they want to do next.


Well yes. The main part of your comment is descriptive: you are describing the state of the world as it is today.

However, there is a normative side to the debate as well. This is what I (and you in your first line) explicitly referred to. This side is about asking what state of the world is desirable. It is perfectly legitimate and good to ask this question, so that we might hopefully act upon the answer once it has been found. That is how progress is made in the world.


Legal restrictions do apply - for EU citizens, quite a few rights cannot be taken away by 'terms&conditions' of online companies.


What standard? The US Post office scans the mail it handles.


The concept of data ownership doesn't make much sense. Data is infinitely copyable and infinitely inferrable, thanks to magic of causality (at Google scale, if I couldn't read something from your mail, I could probably correlate it out of your search queries, web browsing patterns and location history). The discussion should be about ways to obtain a particular piece of information and the ways to use it.

The perfect example to illustrate this is actually what waterlesscloud wrote downthread:

> If I leave some loose hairs on an airline seat, does the airline now own my dna?

Do you own your DNA? What the hell would that even mean?


Do you own your DNA?

Yes. Intellectual property, clean and simple. If someone can make a buck off my DNA, then I get my cut. Prevents exploitation such as this:

http://en.wikipedia.org/wiki/Henrietta_Lacks

"Neither Lacks nor her family gave her physician permission to harvest the cells. At that time, permission was neither required nor customarily sought. The cells were later commercialized. In the 1980s, family medical records were published without family consent. This issue and Mrs. Lacks' situation was brought up in the Supreme Court of California case of Moore v. Regents of the University of California. On July 9, 1990, the court ruled that a person's discarded tissue and cells are not their property and can be commercialized."


No, the "intellectual property" term is absurd, not clean and simple. You cannot own a piece of data like you'd own a physical object. I'll pass the mike to RMS here.

http://www.gnu.org/philosophy/not-ipr.en.html

Also you call developing a vaccine to cure Polio an exploitation? As far as I can tell from cursory reading of that article, this "exploitation" was hugely beneficial to society.


this "exploitation" was hugely beneficial to society.

Which explains why so many people were reluctant to acknowledge the source.


It most certainly is not your data. It's on their servers, in their apps, and running through their network. They decide what they do it with, how long they keep it, and whether or not you even have access to it. Comparing them to the post office doesn't really make sense either considering that's a public service, and one a depressingly large number of people want to get rid of. Google is a for profit company and their data is how they make money.

If you don't like that reality then don't use their service. It really is that simple.


Nope. My data is me. The totality of my data is literally my identity. If you know everything about me, you can steal my identity and assume my living role.

I don't like the reality of the US War Machine killing innocents simply to enrich crony war profiteers. By your reasoning, I should stop paying taxes too.


> By your reasoning, I should stop paying taxes too.

You can. Depending on how principled you are about thing like this. You'd still need to give up your American citizenship otherwise it doesn't matter where you live on the planet.


I'm unconvinced: what about UPS? They're not a public service and are a for-profit company.


You can phrase that even stronger. In many countries, the post office is privatized. Examples:

http://en.m.wikipedia.org/wiki/Deutsche_Post

http://en.m.wikipedia.org/wiki/KPN


[deleted]


No it doesn't. That privacy policy is pretty clear that they won't share personal information without my consent, and I'm quite certain a court would agree.


Sure it make sense. The post office could do the same nasty things with your mail. But they don't, even though they could. It's not because you can you should.


If I leave some loose hairs on an airline seat, does the airline now own my dna?


The very concept of "owning data" is nonsense, as your example clearly shows.


They certainly can scan it if you leave it behind.


No, but they have access to it.


The USPS is only partially public. It is mostly a private company with a government influenced charter, which is the same basic structure as every corporation.


Gmail users permit google to analyze their email for their own purposes.


Emails are all about sender and receiver. Often only one participant is using a GMail address. Using email text for ad purposes is one thing, analyzing the email text where Google only acts as a carrier and using it for A.I. purposes is whole different thing.


None of that makes any difference to Google. Their view is that people benefit from what they do.


email providers and all other providers need to have access only to encrypted data eventually, hopefully soon (to remove the temptation to use this valuable data...)


I should be able to use services from companies based on some terms and expect those terms to be respected.

You're basically saying I shouldn't expect any sort of fair treatment or rights from any service provider on the Internet. I don't want to play on your Internet.


Jacques is describing the Internet as it currently is. If you don't want to play on it, you need to do something about it or stop playing.


Google does respect its terms. Their terms of service let them do whatever they want with your information, and they can update their terms at any time.


Instead of downvoting... maybe someone could point out where and how exactly Google violated their own terms?


Google can (and does) change its terms without notice.

It's modestly better about this than many other SaaS / PaaS providers, but not by much.

I'm having a conversation at this moment with the chief architect of G+ over the G+/YouTube Anschluss in which the two services were integrated. I had separate accounts on each prior to this, repeatedly refused to combine accounts, and yet found them combined as of last November.

Worse: individual users have little or no recourse against such actions.

As for Gmail, as has been pointed out, parties not using Google directly have their private correspondence entered into Google's systems. And not just when emailing Gmail addresses, but many domains for which email is handled via Gmail.

Similar arguments could be made for many other online service providers as well. I don't consider Google to be significantly different from many of these, either for better or worse. But they're certainly a massive and major part of the problem, particularly for their size and scope.

Bruce Schneier and Eben Moglen have made this point quite well, particularly in their December, 2013 Columbia Law School talk, and Schneier's April, 2014, Stanford Law School lecture.

Maciej Cegłowski, "The Internet with a Human Face", makes the case far better yet. http://idlewords.com/bt14.htm


The actual corpus that is worth using is the book corpus. While Google can't provide public access to all of the books it has scanned there is no restriction on them using the data in the books to feed this project. Given the amount of information they have scanned from libraries and elsewhere that is a much better source.


Is anyone doing the same for the books scanned by Archive.org?


Reminds me of Doctorow's Scroogled. http://craphound.com/scroogled.html

The funny thing is Doctorow makes references to "just metadata" years before it became a public issue, however this goes beyond metadata, and will eventually contain facts about people, not just tangential stuff.

"This isn't P.I.I."—Personally Identifying Information, the toxic smog of the information age—"It's just metadata. So it's only slightly evil."


One quibble: Metadata is facts about people. Logging meta-data isn't a slippery-slope toward also catching facts. It's a problem from the jump.

"Joe goes to the gym three times a week" is a fact.

"Joe's network activity originates from a gym on the following schedule" is not only at least an equivalent fact, in practice it's far superior to the simple case. It can give you subtleties [1], it's less susceptible to subterfuge [2], it gives you actionable evidence of specific occurrences, etc.

Consider the CIA doesn't use meta-data to target hellfire missiles because it's less identifying than actual data. They use it because it's far better.

[1] Joe never goes to the gym on Saturday. Joe goes to the gym more during the spring than the winter. Joe almost never misses a day when Sally is at the gym. Joe and Sally nearly always leave at the same time.

[2] It's trivial for someone to say they go to the gym on a schedule they don't. It's not even too difficult to get a second or third party to fudge, embellish or outright lie on their behalf. It's much more difficult to get a second or third party to help you make your device convincingly take the claimed routine, without you creating any conflicting meta-data that gives up the ruse.


This is important and should be a top-level post. I think many people miss this in discussions about privacy, and this is the reason I believe the privacy is dead. Everything you do is metadata about everything else you do, about everything else people around you do. You can infer any piece of "data" you want given enough (meta)data sources and computing power.

Thus the only way we can keep privacy would be to roll back the last 50 years of technological progress, and that's why I'm starting to entertain a thought that we (as a society) should drop the concept entirely and tackle the change head-on, instead of being dragged there by force by the ongoing progress of technology.


How does this compare with NELL[0] from CMU? I'm assuming it's something like NELL, but scaled up 1000x because Google is not limited to how often it can search its own index, whereas NELL is limited to 10K queries/day?

[0] http://rtw.ml.cmu.edu/rtw/


Hi, I’m Kevin Murphy, one of the researchers at Google who worked on this project. Just to be clear, KV did NOT involve any private data sources -- it just analyzed public text on the web. (And yes, we do try to estimate reliability of the facts before incorporating them into KV.) Also, KV is not a launched product, and is not replacing Knowledge Graph.

Unfortunately, I cannot do a more detailed Q&A here, but if you want more details, please read the original paper here: http://www.cs.cmu.edu/~nlao/publication/2014.kdd.pdf. (Note that an earlier version of the work was presented at a CIKM workshop in Oct 2013 (see http://www.akbc.ws/2013/ and http://cikm2013.org/industry.php#kevin). We have also published tons of great related research at http://research.google.com/pubs/papers.html



Sounds a bit like Douglas Lenat's CYC project from the 1980s [1], but done by machine.

[1] http://en.wikipedia.org/wiki/Cyc


HNers interested in this might also be interested in Deep Dive from Stanford CS Professor Chris Ré.

http://deepdive.stanford.edu/


The paper (http://www.cs.cmu.edu/~nlao/publication/2014.kdd.pdf) mentions the extracted knowledge base is about 38 times larger than DeepDive's, the largest previous comparable system.


It might even be possible to use a knowledge base as detailed and broad as Google's to start making accurate predictions about the future based on analysis and forward projection of the past.

Hello Hari Seldon, psychohistory and mathematical sociology!

http://en.wikipedia.org/wiki/Foundation_series

http://en.wikipedia.org/wiki/Mathematical_sociology


Is any subset of the "derived knowledge" from public websites and data contributed back to a public dataset like Dbpedia?

There are bots [1] making Wikipedia contributions, Google could also make automated contributions to Wikipedia/Wikidata.

[1] http://wikipedia-edits.herokuapp.com/


A subset of the knowledge graph is available via freebase RDF dumps: https://developers.google.com/freebase/data

I don't believe this is what the article is talking about (knowledge vault) though. This is just the human and lightly machine curated graph (knowledge graph).


You are right. This paper proposes to use freebase(it can be any other source) as prior knowledge.


I see a lot of downvoting here of posts that express very reasonable concerns about privacy if Google is actually using private emails for this AI.

That Google is engaging in this behavior is indeed speculation, as far as I know. However, Google employees/allies have to realize that attempts to suppress debate on this issue can only backfire on them. Indeed, the fact that they don't have explicit policy on this (correct me if I'm wrong) is one of the reasons researchers are speculating.

It may well be that most people would agree with and/or permit Google to use their data in this way, but people should be given the opportunity to debate it in a reasonable fashion, else it looks like it was forced down their throats. And that's no good for anyone.


>> "Behind the scenes, Google doesn't only have public data," says Suchanek. It can also pull in information from Gmail, Google+ and Youtube."You and I are stored in the Knowledge Vault in the same way as Elvis Presley," Suchanek says.

Ugh... that's a bit much... because now any employee at google could potentially get access to random facts about me gleaned from my personal and business emails? Good luck keeping different levels of confidential information segregated correctly. That's awesome.



Sure, I'm aware, but this is different.

Collecting anonymous statistics about its users does not include automatically generating a database indexed by individual based on their private data. One is par for course when selling bundles of users according to demographic to advertisers while the other is fucking crazy.

Mining public web data for building a database like that is one thing, but mining individual private data like this is crossing a line.


Knowing the people who have left Google, who collected a lot of that data, who we trusted, who are now gone, I wonder what other non-public data is being used, and how is it being used, and for only good purposes, or for nefarious purposes?


Facts are not knowledge. Read Socrates / Platon.


Knowledge is the grasp of the facts of reality.

Most of the ideas produced by Socrates / Plato / Aristotle were in fact wrong. They are not a good primer on epistemology, concepts, percepts, metaphysics or anything else. They're a good primer on the history of philosophy.

They inspired incredible progress on thinking and understanding, but they were wrong more often than they were right, and are a poor reference to understanding what knowledge is.


This is a contradiction:

"Knowledge is the grasp of the facts of reality." is was Socrates in an essence said about knowledge.

Then you say Socrates was wrong.


Isn't it nice that millions of people made web pages that Google decided to scrape to harvest the work of others and run ads next to it for themselves?

Now try scraping Google and see what they do to you.


You can ban GoogleBot easily, just put a line in your robots.txt file, but then people won't be able to easily find your site using Google services.

If you provide value to Google they will make an API to allow accessing that data easier.

By scraping do you mean scraping their search results? They offer this, which is nice: https://developers.google.com/custom-search/

Many large sites don't allow scraping because of unnecessary server load (denial of service sometimes) so they'll offer an API where you can download content in a controlled (and monitorable) manner.


We need a robots.txt and "noindex" metatag standard for emails.

If one sends an email to an GMail/Outlook.com/Yahoo email address, one should be able to opt-out of their email crawler, advertisement analysis, artificial intelligence analysis, etc.


The user who receives it can: https://support.google.com/ads/answer/2662922?hl=en

I don't think they do too much storing of email details, they know that it's a sensitive area and that an employee will eventually blow the whistle and it will hurt user trust, which is a big part of their business.


I meant it the other direction.

1) Alice sends an email to Bob (GMail user).

2) Bob receives the email, meanwhile Google scrapes the content and extracts its meaning to show Bob some ads and use the facts to improve Google's A.I.

Alice wants a way to mark her email content as "no index".

So that email service provider don't crawl through the content. Exactly like the robots.txt for domains or the "noindex" metatag in HTML head element!


Why should Alice get to restrict what Bob can do with his inbox?


It's not about Bob, he can do what he want.

It's about the email service provider, that should stop analyze the email text to extract its meaning. Gmail uses it to display ads to Bob, builds a shadow profile for Alice (like Facebook) and trains an artificial intelligence (see headline link).


It would be nice if there were a way to provide google, bing, et. al. the data so they were constantly crawling sites. A non-insignificant amount of our bandwidth is used up daily by search engines spidering sites.


Those millions of people want google to scrape and harvest, in the hope that they will rank higher etc etc.

If an unknown person tries to scrape, he/she will promptly get banned by those very same people (Google wouldn't like someone scraping their stuff either).

Different players different rules, I guess.


Which large website did you get banned from for scraping? I did some scrapping and never got banned... perhaps my scraping's rate was not too fast.


If you try to mass scrape almost any major site (millions of pages of content) they'll block you.

For example, if you went one by one through Stack Overflow and sucked out every question and answer, your scraper bot would get banned (unless you're doing one request per minute, in which case you'll never finish).

Or if you tried to scrape Twitter.


"Knowledge Vault has pulled in 1.6 billion facts to date", does this fact also include the fact that I am adding more facts right now? What fact metric is this fact?


Are there any open source efforts like this?




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: