Hacker News new | comments | show | ask | jobs | submit login
Wikipedia is now drawing facts from the Wikidata repository (gigaom.com)
118 points by eksith 1730 days ago | hide | past | web | favorite | 79 comments

OT: I was using Wikipedia the other day and it occurred to me how primitive it is to have all the inner links to other Wikipedia articles defined manually, surely these should have been automated by now (i.e., marking a word or two would link you to the relevant article).

There's a lot of research dating back to the early days of hypertext on automatic link insertion, but afaict it hasn't really caught on in any system, whether hypertext or wiki-based, with the exception of those somewhat spammy content farms that auto-cross-link their articles. Wikipedia has indeed spawned a new wave of such research, e.g. (random example pulled out of Google Scholar): http://www.cs.waikato.ac.nz/~ihw/papers/08-DNM-IHW-LearningT...

I'd be curious how good the results are. I've found a bunch of articles, but no live demo. If someone set up a small-scale version where you compare the auto-linked version of a few hundred Wikipedia articles with the existing manually-linked version, I think that could convince people it was worth adopting (if the results looked good).

There's a certain confusion, or duality, in the way Wikipedia links work. If you access a director's page for example, and a sentence states he has made "20 movies", clicking on "movies" can either take you to his separate filmography page or to the general article "Movie". I believe only the first option should be manually defined.

The Free Dictionary uses something like that - try double clicking words within the definition: http://www.thefreedictionary.com/link

> There's a lot of research dating back to the early days of hypertext on automatic link insertion, but afaict it hasn't really caught on in any system

Except, of course, any mention of the name of a listed company in a financial/business publication.

Plenty of webforums use text ad-links interspersed with content for unlogged-in guest users.

The disambiguation could be challenging. (Does "The Sun" link to our closest star or the tabloid in the UK.)

It would be a fun project to try and determine the correct link based on the context.

This reminds me of an article I read about IBM's Watson. As a demo in front of a lot of people, a researcher was going through a stack of journals and feeding in data about anthrax. Most of the data was about animals, but the researcher was asking Watson to extrapolate possible effects on people. Watson responded "I assume by people you mean humans, and not People magazine."

Edit: found the link (PDF) http://www.cs.mtu.edu/~nilufer/classes/cs5811/2003-fall/hilt... Here's the actual quote:

  “Do you mean Anthrax (the heavy-metal band),
  anthrax (the bacterium) or anthrax (the disease)?”

  “The bacterium,” was the typed answer, followed by 
  the instruction, “Comment on its toxicity to people.”

  “I assume you mean people (homo sapiens),” the
  system responded, reasoning, as it informed its
  programmer, that asking about People magazine
  “would not make sense.”

Late edit: funny that I remembered this being Watson when it's actually Cyc, a longtime rival of the Watson project.

Just make the link to the disambiguation page, if there is one? Otherwise, make it a special link that doesn't go anywhere directly, but uses some javascript/CSS to raise a dialog when clicked, that gives you the different choices?

Auto-generate disambiguation pages?

When a search term matches a tag, and none of the tagged pages have a clear "majority probability" of being correct, it would display a list of all pages with the tag, in order of popularity.

I don't think it would be a good idea if wikipedia required that you to run Javascript to navigate the site, and it would make for pretty bad SEO.

Wikipedia doesn't need to care to much about SEO.

> Just make the link to the disambiguation page, if there is one? Otherwise, make it a special link that doesn't go anywhere directly, but uses some javascript/CSS to raise a dialog when clicked, that gives you the different choices?

Both of these things would be amazingly annoying to the majority of Wikipedia users.

I'm only talking about auto-generated links that can't be clearly disambiguated by the system. At worst, the experience wouldn't be any worse than it is today.

> I'm only talking about auto-generated links that can't be clearly disambiguated by the system.

But they could be disambiguated by humans, which is my point. Humans understand context.

Sure, and when humans create links, they should continue to create them just like they do not. I'm picturing an "auto linkifier" that creates links that no human has gotten around to creating yet.

Whether or not something like that would be a net win for Wikipedia is up for debate I guess. That said, I think they already do have a bot that can do at least a limited amount of auto-linkification, but I can't swear to it.

I guess an added bonus of what you're suggesting is that the correct link could be crowdsourced; if the system kept track of which of the options users clicked on, it could figure out pretty quickly which one is correct.

Exactly this.

It would take some AI to work it out, unless using the context around the link. "Sun" could refer to many things, its not really possible for wikipedia to know which one you're on about. So links are still done manually.

I don't think it is primitive. Wikipedia is edited by humans, I don't think an automatic link algorithm could do a better job at this time.

I would venture there's a way to make "concepts" and "entities" become linkable automatically based on existing articles, but that would mean a bit of engineering. I.E. A name, product or academic field. But then there's going to be a high number of links to articles that haven't been created yet or deleted/merged etc... in cases of lack of notability.

I don't know if it has that exact feature or not, but Semantic Mediawiki has a lot of extensions to base Mediawiki that involve working with data at a semantic level.


If not, doing that should be possible using an NLP library that does NER. Along with heuristics, one could use the list of currently existing articles as a seed.

Edit: Of course, if all you're trying to do is link to existing pages, then you don't use the set of existing pages as a seed, you just use them as the list. But if you're trying to extract "entities" that don't have WP pages yet, then you'd still want to fall back to other NER techniques, which include various heuristics and what-not. Whether or not there would be an value in that is an open question, I suppose.

Good resource! And I think your idea of linking to the disambiguation page makes sense, but there may be a way to infer the correct article from the list of links based on the context of the text in the linking article.

Yeah, there's that as well.

Also, if you're interested in that sort of thing, two other projects you might find interest are:




Both involve extracting semantic meaning from unstructured data. It's pretty cool stuff.

Here is a quick demo of Stanbol-provided Wikipedia annotations and disambiguation in a WYSIWYG editor:


Here is one solution they could use: http://bergie.iki.fi/blog/automated-linking/

I'm surprised no one (AFAICT) has attempted a family tree of all humans. There is an obvious demand for this information because many commercial services exist. Users are paying to upload their personal genealogical data to proprietary for-profit silos. Yet this data would be much more productive in an open system with user data from all services.

You could seed the database with famous people's family trees from Wikidata. The Mormon church also has lots of genealogy data that (perhaps :) they might share for not-for-profit use.

The biggest challenge would be preventing trolls and spammers from uploading false data. I've sketched out some rough ideas where family links can be "thumbs up'd" bidirectionally by people on both sides of the connection, but not necessarily the immediate people.

The genealogists are always happy to share their data. It's part of the culture.

There's no open source genealogy programmers because the young nerds that do open source dont care about genealogy.



There are others. I guess there aren't many, but there are some.

I care about genealogy! Of course, it's debatable if I'm considered "young" since I'm 30. I think maybe there would be privacy issues and personality rights to photographs that would get in the way of such a big family tree.

Not to mention a security issue as well since most financial institutions ask for your mother's maiden name as a security question.

Can someone point to an example wikipedia page that's actually drawing from wikidata? When I hit wikidata and look up random items (say, item 1000, the country of Gabon: http://www.wikidata.org/wiki/Q1000 ) and then look at the source for the English wikipedia page for Gabon, I don't see anything that suggests any of the facts are coming from wikidata.

This seems to be more "all wikipedias can now draw facts from wikidata", and certainly isn't "all facts in wikipedia come from wikidata." The former is cool, the latter would be mind-blowing - but I'm not sure how far along we are on the path to "{all|most|some|a few} facts come from wikidata"

Poking around a bit, I found this press release: https://blog.wikimedia.de/2013/04/24/wikidata-all-around-the...

It looks like the ability to pull Wikidata into infoboxes was just rolled out a week ago, so I guess it wouldn't actually be used yet in a lot of places.

It's too bad that all of these grand visions and truly positive developments for humanity are locked inside of an exclusionary, elitist organization that is governed by deletionism and will take any contribution you might possibly make and CRUSH IT LIKE A BUG.

This is a good example for Paul Graham's "middlebrow dismissal" data set. Generic negative comment about the subject of the article that could be copy-pasted into any article matching a "Wikipedia" string search by a bot.

I see the brow a great deal lower than you see it. Middlebrow is unduly dignified.

Your comment is itself a negative, middlebrow dismissal. The OPs negativity is middlebrow precisely because it's accurate, but it's beating a dead horse. This means it gets up voted repeatedly, whereas a genuinely low brow comment like "FUK U GAY WIKIPEDIA" would be quickly down voted.

I think by this point you've robbed the word "middlebrow" of any meaning at all.

> deletionism

you might enjoy Deletionpedia: http://www.deletionpedia.dbatley.com/w/index.php?title=Main_...

Very interesting :)

I can see why most were deleted, I chose to view a few pages at random and they were:

- a really non-notable musician (probably self-promotion)

- a hoax ("The Independent City-State of Sonora")

- a game guide to Super Smash Brawl, this is probably something that should have gone on a blog / GameFAQs

- a witty bio about a non-notable person (probably self-bio, or a friend's)

I believe the four of them were deservedly well deleted. Two of the cases, the musician and the game guide, should be hosted elsewhere (personal blog or website).

"deservedly well deleted"

Under a critereon of "notability" that works great for dead trees but is largely irrelevant to an online primarily text reference.

Does anybody know if there's a project out there that's the union of Wikipedia & Deletionpedia? Or, IOW, "Wikipedia with deleted pages preserved"? If not, that would be an interesting thing to create...

That's a bit harsh.

It's true the admins take notability pretty seriously, but to be fair, so does Encyclopaedia Britannica. Wikipedia folks freely admit that they're elitist in that regard, but I don't think that's necessarily a bad thing. They set it to that level for the sake of maintaining quality and relevancy (which is a measure of notability) to an acceptable degree.

The foundation's ultimate goal is to make a reliable resource for knowledge after all.

What we need to "balance it out" if you will, is maybe another entity(ies) with a broader scope and maybe a looser threshold for notability, but hopefully with more reliability with expert verification ( Citizendium comes to mind : http://en.citizendium.org , but it's no where near as comprehensive ). Still it's not easy to run something like Wikipedia, so whoever will do it would have to have deep pockets and seriously dedicated.

Britannica actually accepts user generated content. They are superior to wikipedia, however, in that they have credible experts vet the content.

The complaint about wikipedia is that it fails to live up to its ideals. If you had ever been a wikipedia editor you wouldn't talk the way you do. Even as a domain expert it's intolerable to deal with the wikipedia amateur hour culture. It's far more rewarding to submit your articles to Encyclopedia Britannica.

Just so everyone knows, this person is a troll.

[citation needed]


The database is available under creative commons, and they're working on an API, so I'm confused what you're talking about--whether Wikipedia is an elitist organization or not, I don't see how that would matter.

Cite some examples.

gwern talks about it a bunch on his site [1].

[1]: http://www.gwern.net/In%20Defense%20Of%20Inclusionism#the-ed...

> deletionism

Deletionism is the reason it's worth using to begin with. It prevents it from being flooded by crap and allows it to stay on mission. Not everything has to be all things to all people.

The presence of an article on obsolete BBS software (for instance), somewhere in the wikipedia encyclopedia, does not intersect with anyones use-case that isn't already searching for it.

Wikipedia isn't one big giant article, where the existence of marginally noteworthy elements distracts you or diverts your attention. The articles that get deleted by the little notability hitlers are contributions with no cost incurred by anyone, save the trivial storage space.

It's like saying some guys tripod page that you never visit is somehow cluttering your experience of web browsing.

I've had my run-ins with Wikipedia admins (on the Spanish wikipedia they were convinced I was self-promoting myself by creating my country's Finance Minister's Wikipedia page), but most of the deletions are no-brainers (looking at Deletionpedia makes it very clear).

I wish they'd err on the side of leaving stuff alone in the 1% of cases where it's not black and white (like the obsolete BBS software example)

> The articles that get deleted by the little notability hitlers are contributions with no cost incurred by anyone, save the trivial storage space.

Godwin, and wrong on the merits, as well.

The stuff that gets deleted is the stuff that can't be verified, which means there's no way to fact-check it. It isn't about storage space or clutter: It's about not having stuff in there that can't be verified.

It's inaccurate to say that wikipedia is fact checked or verified. The existence of a citation does not imply the existence of a fact check or a verification, or verifiability. Even when citations are high quality, the info they're supporting is still usually unverifiable due to wikipedia's disconnection from the community of experts.

Wikipedia is nothing more than the biggest plagiarist / content farm on the Internet. It isn't scrutinized because it has been grandfathered in.

Wikipedia is the ebaumsworld of information. Completely unreliable. Steals credit, traffic, royalties from the content creators. Policies focused on self preservation rather than serving a public good or respecting creators.

Absolutely none of this is true and reflects your bias more than anything else. Anyone with a passing familiarity with Wikipedia would know it to be false. Your lack of knowledge is apparent and your opinions are of no value.

Absolutely none of your comment is true; it reflects an unexamined bias and a complete unfamiliarity with the depths of wikipedia, how wikipedia stands up to alternatives, and the nature of the shitty content on wikipedia. Your opinion is completely worthless and it's safe to say you're an uneducated ideologue.

The fact you're the only one here making your extreme claims is strong evidence in my favor unless you think you're that much better than everyone else here.

Really. You're really going to use that logical fallacy to support the religion for which you are a true believer.

The delicious irony of your hatred is that your point is so poorly argued, you must be a wikipedia editor.

I'm actually a pretty accomplished wikipedia editor with several original article credits. The articles are standing today. I have barnstars and everything. But I actually submit my articles to Encyclopedia Britannica now, because I realized the truth about wikipedia. It's just a really low quality content farm that can't be trusted on anything.

What is the point of getting your knowledge from unreliable losers? All the biggest wikipedia editors are no-life losers with zero respect in any real intellectual community. They are divorced from the community of experts and receive nothing but scorn from them.

What's the point of reading an encyclopedia written by people who you can't rely on? Getting 80% of the content right is not an achievement--every content farm on the internet does the same--from eHow to expertsexchange.

Wikipedia is just an ideology and volunteer driven low quality content farm. Due to Wikipedia's overpowering marketing/SEO, when Wikipedia writes an article on a topic, that Wikipedia article will now have higher visibility than the original information source that it scraped and now cites. The Wikipedia article will steal traffic from the original source.

The internet would be so much better without content farms. And Wikipedia is the worst of the worst.

When I feel like writing an encyclopedia article, I send it now to Encyclopedia Britannica. They've edited my writing and incorporated parts into their high quality encyclopedia.

By the way, that study that said that wikipedia was just as good as Britannica was complete bullshit--as flawed as your informal fallacy that you just shat out right above this comment.

So you expect us to believe you without citation? And you haven't given any specific examples for anything in all your verbiage. Frankly you sound like you're trolling.

Haha, I was thinking the same about you. If you're a troll I give you credit: you know exactly how to impersonate the stupidity of a Wikipedia true believer. Asking for citations when they're not relevant. Ironic use of logical fallacies. Substance-free ideological bandwagoning.

To map all human knowledge.

Isn't non-notable knowledge still knowledge? They are only interested in a small slice of human knowledge by their own sated goals.

To include verifiable human knowledge.

The semantic web has gotten one step closer!

Good. For quite some time, census data was incorporated as a one-time substitution into geographical articles by a bot. With this new development, I suppose that census data can be incorporated as a "transclusion" that is updated automatically either on schedule or on demand.

But you will need a bot to populate and update the WikiData source. :)

I wish they would do something like this with Wiktionary. A dictionary doesn't need all the flexibility of wiki-syntax, and having the data in a stricter structure makes it much more useable.

There's a bit of discussion loosely collected here: http://meta.wikimedia.org/wiki/Wiktionary_future

Wikidata is interesting ... it seems to be a normal MediaWiki install with the custom-developed "WikiBase" extension: https://www.mediawiki.org/wiki/Extension:Wikibase

It'd be nice if the WikiMedia projects had a proper GitHub presence - it's hard to get a sense how plausible a self-hosted version of this is.

There's gerrit.wikimedia.org, and it is mirrored to github.com/wikimedia. Not sure how you can get a sense of how plausible a self hosted veresion of this is from that though :) The mediawiki.org page you linked is the best resource for that.

Well, this discussion is quickly moving toward the shortcomings of Wikipedia.

There's plenty of opinion of the deletion-happy (if that's the operative term) policy of the admins, especially as it pertains to notability, and I think this is a common complaint : http://www.highprogrammer.com/alan/rants/wikipedia-delete.ht...

I do something that may seem like trying to fix a leaky dam with chewing gum, but every time I see an article nominated for deletion, I copy the source to a private wiki I'm running. Sometimes, the deletion nomination goes away, other times it does get deleted, but at least then, I have a copy.

But we also have to keep in mind that there are plenty of other resources on the web specifically aimed at niche resources. E.G. Wikia. I've lost count of how many comic book related articles were deleted only to show up on the DC or Marvel wikis. ( http://dc.wikia.com/ http://marvel.wikia.com/ )

Likewise, it's not unreasonable that a lot of content gets missed by the editors, since it's a big place and most editors have one or two areas that they focus on. As the editing guidelines state, when in doubt engage in dispute resolution, not edit wars. If you can make a good case for why an article should be there in the first place, be persuasive in the talk pages.

So a few pointers for people getting angry at Wikipedia:

First, ask whether the article is a good fit for the wiki. Can it go on a blog or a niche wiki (like a dedicated Wikia) instead? The Pokémon example is a bit extreme, but I think many of those pages may get deleted or merged. There's always the Pokémon Wikia : http://pokemon.wikia.com

Second, notability is a very tricky thing. Reputable sources may be even trickier. Rather than debating notability, focus on reputable sources (since that's the biggest hiccup for references). If you can link NY Times articles or BBC or some other news source, rather than just community sites, other blogs (depending on popularity) etc... you'll have a better chance of getting the article/section through and staying there.

Third, well written articles have a better chance of surviving than those that give off the 2-3 paragraph stub vibe. The more reputable citations and best content structure you can give, the better the chances an article survives (this may partly explain the Pokémon pages too). If you have trivia, try to merge it into the content body more, rather than list at the end; that feels tacked on and superfluous, if not directly supporting the main content.

Fourth, try to be a bit more empathetic to the goals of the Wiki while being objective to the subject when asserting your views (especially on controversial articles). How you word things is a big hint as to whether that content will remain or get scrubbed the next day/hour/minute.

Now, I hope can go back to discussing Wikidata and how Wikipedia and everyone else will benefit.

I consider TV Tropes to be the spiritual successor of wikipedia, and what wikipedia should have been, and would encourage people to contribute there instead - even for more "encyclopaedic" topics, its "Useful Notes" articles are often more informative than their wikipedia equivalents. It has an explicit policy of "no such thing as notability".

If you're happy relying on a 3rd party site... I recommend archive.is for keeping records of pages.

http://archive.is/EOZqQ - The Imaginary Theatre ( post deletion obviously :( )

They could have made this and many other advances years ago, had they adopted contextual advertising. I don't think anyone would argue that the advertiser-financed services provided by Google are a bad thing.


Insignificant revenue. The click-through rate might very well be so low that only a small amount of revenue would be brought in by the ads. It would not be worth barraging thousands of readers with ads for only a few pennies of revenue.

Ads cheapen the encyclopedia. By their very nature, ads are biased content intended to influence people. They are thus diametrically opposed to the goals of a neutral encyclopedia intended to inform people. They would cheapen the encyclopedia in the eyes of many readers, as evidenced by the numerous anti-ad comments received during every donation drive.

Contributors may leave. Many contributors vigorously oppose ads (see the forking of the Spanish Wikipedia, 1, 2, 3, 4), and in 2009 the Wikimedia Foundation promised to keep "Wikipedia. Ad-free forever." Since about 2002, Jimbo Wales has repeatedly stated that he opposes all advertising on Wikipedia as well. Based on these statements, some editors have probably contributed with the understanding that their content would not be diluted with ads. Changing the long-standing no-ads policy now could reasonably be perceived as a bait and switch tactic. Numerous contributors are likely to leave as a result and new ones are less likely to start. Contributor goodwill is Wikipedia's main asset and should not be gambled with.

Annoying and distracting. Readers come to us for encyclopedic information, not for ads. Ads have to be processed by the brain (if only subconsciously) and therefore distract and annoy. "The free encyclopedia" also means: free from distractions and annoyances.

Privacy violation. If an ad consolidator such as Google AdSense is used, the privacy of our readers is compromised. The consolidator will invariably learn which Wikipedia articles a given IP address reads or searches for; they can then correlate that information with other data they may have about that IP address (e.g. Gmail account).

et cetera et cetera et cetera.


Your arguments are very persuasive, although they could use a bit elaboration.

He copied and pasted from a Wikipedia page, so I spent as much time crafting my response as he did. More importantly, those arguments are so blatantly wrong on their surface that I don't believe they justify more than a one word response.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact