Hacker News new | past | comments | ask | show | jobs | submit login
Wikidata, the free knowledge base that anyone can edit (wikidata.org)
153 points by a3_nm on March 6, 2013 | hide | past | favorite | 42 comments

First of all, having worked with wikis for about ten years, finally seeing this live in Wikipedia is huge -- but it'll take a long time for it to have full impact.

The first baby step is interwiki links. Previously, every article in every version of Wikipedia had a long list of links to every other version of that article, maintained by a small army of flaky bots that clogged up edit histories and often stepped on each others' toes. Now, there's a single Wikidata node that has the single mapping of what article corresponds to what, reflected across all Wikipedia versions. Here's "Tokyo" now:


And here's what it replaces, repeated dozens of times in every language:


The next step will be infoboxes. Instead of every Wikipedia having a separate copy of the population of Tokyo or the GDP of Canada, updated ad hoc by different people whenever they get around to it, there will be a single place storing that data automatically reflected into all Wikipedias.

And it keeps going. Taxonomies of plants and animals change all the time, Wikidata can become their single repository. Wikivoyage currently has to store the phone numbers of each hotel separately in each language version, Wikidata will allow centralizing them. Here's the master plan:


That page for Tokyo looks suspiciously like the internals of a natural language processing system. I can't wait until someone hooks up a probabilistic speech parser to Wikidata. A talking computer with the knowledge of Wikipedia could be incredible.

I'm pretty excited about this, if part of the plan is to encourage Wikipedia contributors to add their data as data instead of as finished output.

I've spent some time lately building election results maps for Google, and while Wikipedia has been a useful resource, it's also been very frustrating.

There is a wealth of geographic data on Wikipedia, but hardly any of it is in a usable form. Look at all the interesting maps you see there. Somebody generated each of those maps from actual usable data such as shapefiles [1] or KML files [2]. The generated maps are nice to look at, but I can't do anything else with them. I need the shapefiles or equivalent to build other kinds of maps with the data.

For the Brazil election, we had shapefiles for the municipalities which had been mangled by a process that converted all the municipality names to uppercase with no accent marks. I wanted to display the correctly capitalized and accented names and found this list:


This was great! It had all the data I needed. But it's in a jumbled format that looks nice in the wiki but doesn't lend itself to machine use. It took me a few hours to write some Python code to parse the page and get it into a CSV format that I could import into my database. This is a frustrating kind of work, because I was pretty sure that somebody had started with nice tabular data and generated the initial version of this Wikipedia page from that.

If we're eventually able to get this kind of data as real usable data, that will save a lot of people a lot of work.

[1] http://en.wikipedia.org/wiki/Shapefile

[2] http://en.wikipedia.org/wiki/Keyhole_Markup_Language

I just read through the entire front page and I have no idea:

1) what need wikidata will fill

2) what they want me to do

3) how I would go about doing (2) (other than "starting the wikidata community", mentioned at the bottom of the page, which sounds like a lot of work).

Maybe I'm not their target audience, but this sure as hell isn't a good elevator pitch.

What...is...this? Is there any organization of these datapoints? All I see on the front page is links to single datapoints, such as this:


Which corresponds to "victory", defined as "term that applies to success" , and also known as "win" and "success".

Sorry, but what's the need here? This just seems to dilute whatever's going on at Wikipedia.

It's a triple store (sort of) "things" and verbs associated. Right now every article in every language has a data point, ie: population on San Francisco, and every article must be maintained individually, in the future you'll be able to reference a data point and it will be updated in every article.

So are you saying that in the future, you could have a system where you type a statement as a description, then conceivably, be able to hover over items in the description for data. E.G.: the statement "San Francisco's population is largely made up of imigrants to the bay area, where an estimated 20% of the inhabitants are actually native San Franciscan born"

Where hovering over "San Francisco's population" would give you the data-point from wikidata?

Is this not the level of parsing that Wolfram Alpha is trying to do?

Also, it would be really interesting to see the following:

Assume you create new google doc, and as you begin typing, in a small window/pane/pop-up, relevant information is displayed based on the context of what you are typing with subtle highlighting. As you typed out "San Francisco's population" that phrase would highlight and the context indicator would display that number.

What would be interesting about this is that if children where using a system like this from their early school days - would they passively absorb such information? would it be annoying or useful?

> Assume you create new google doc, and as you begin typing, in a small window/pane/pop-up, relevant information is displayed based on the context of what you are typing with subtle highlighting. As you typed out "San Francisco's population" that phrase would highlight and the context indicator would display that number.

I've been thinking along the same lines, but for a study prosthetic. Imagine a head-mounted camera that OCRs as you read. So, when you read "Navier-Stokes" there's side-bar with everything that you know about Navier-Stokes, equations, code samples, etc.

This sounds like it'd be an awesome app for Google Glass.

It's an application of the "semantic web". The hope is that one day it will be possible to give a computer the link (or a successor) that you cited, and the computer will in some way be able to "understand" what victory means.

Languages don't have bijective mappings of concepts, so this is a hard problem. Do you have an ontology you'd like to propose?

Property P107 (http://www.wikidata.org/wiki/Property:P107) has emerged as Wikidata's de facto upper ontology. It currently consists of six main types: person, organization, event, creative work, term, and geographical feature. It's essentially a clean port of the high-level entities from the GND Ontology -- a controlled vocabulary developed by the German National Library and released last summer (http://d-nb.info/standards/elementset/gnd).

There's a fair amount of debate over that property. Are those current high level types (person, place, work, event, organization, term) a good fit for a knowledgebase that aims to structure all knowledge and not just library holdings? Does classifying subjects like inertia, DNA, Alzheimer's disease, dog, etc. as simply "terms" make sense?

More reading related to Wikidata, ontology and types: https://blog.wikimedia.de/2013/02/22/restricting-the-world/.

No, I mean I was really confused. The term "WikiData" seems to connote data of the tabular type, like a central repository for public data. Though I'm also confused at how the mapping for this particular term (for "victory") can't be done in Wikipedia or the Wiki dictionary.

It is, but the problem is whether you are talking about an individual language version of Wikipedia or the Wikipedia project as a whole. In the article they talk about the problem of maintaining Interwiki links on each individual language version, rather than centrally.

This is also just one aspect of Wikidata. The centrality of shared table content is important, too. Why have data in a specific language version Wikipedia and point to it from other versions when you can have a central repository that is pointed to using templates from each language version?

"The term "WikiData" seems to connote data of the tabular type"

There's a strong correspondence between "tabular data" (you probably mean relational) and triples (<predicate,X,Y>). Bot are based on the first-order predicate logic, so there's actually a natural mapping.

Cool to see my jQuery tag widget[1] in use here :)

[1] http://aehlke.github.com/tag-it/

I'm not big into the Wiki world, but it's also struck me as odd how different pages refer to the same facts and yet are totally disparate. If one page updates, does somebody manually have to go update the other page? Does a bot do it?

This looks like a great response to that. I just hope they've made it easy to interface with.

Yup, manual updates and the occasional bot is how it works (or, more often, doesn't) in the pre-Wikidata world.

Note that the four-part barcode logo is “WIKI” in Morse code.

So... it's like a Freebase clone?

If I understand it correctly, it has more modest goals. Freebase was trying to make the uber-map of all data entities. Wikidata is just trying to make data reuse easier on Wikipedia.

For instance, imagine a table of all the populations of the countries of the world. Today, someone might make a really good one for the French Wikipedia. But then someone has to make it from scratch, all over again, for the Greek Wikipedia. And when someone updates the French one, the Greek one doesn't update, and vice versa.

With Wikidata you can define the data once, and then transclude it to different pages, with translated labels if necessary.

The first release attacks the problem of "inter-wiki links". On the left hand side of some Wikipedia pages, there are the links to equivalent pages in different languages. Check out the one for http://en.wikipedia.org/wiki/Jimmy_Wales, for instance. Right now these are updated with a system that looks at every possible connection (scaling at O(n2)), and with Wikidata it will be more manageable.

Interestingly, interwiki links sometimes have some semantic drifting effect.

from Omnipedia http://brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf [pdf]:

""" One major source of ambiguities in the ILL graph is conceptual drift across language editions. Conceptual drift stems from the well-known finding in cognitive science that the boundaries of concepts vary across language-defined communities [13]. For instance, the English articles “High school” and “Secondary school” are grouped into a single connected concept. While placing these two articles in the same multilingual article may be reasonable given their overlapping definitions around the world, excessive conceptual drift can result in a semantic equivalent of what happens in the children’s game known as “telephone”. For instance, chains of conceptual drift expand the aforementioned connected concept to include the English articles “Primary school”, “Etiquette”, “Manners”, and even “Protocol (diplomacy)”. Omnipedia users would be confused to see “Kyoto Protocol” as a linked topic when they looked up “High school”. A similar situation occurs in the large connected concept that spans the semantic range from “River” to “Canal” to “Trench warfare”, and in another which contains “Woman” and “Marriage” (although, interestingly, not “Man”). """

Wikidata is trying to make re-use possible beyond Wikipedia, as well. In fact there's already a couple of apps built with it. Here's a trivial genealogy visualization using the API:


It's also worth noting that Freebase itself heavily relied on parsing Wikipedia database dumps to build its ontology -- to a large extent Wikidata is giving structure to data that's been in Wikipedia all along.

Thanks, it makes more sense now.

For one thing, Wikidata data has a different intellectual property regime.

Wikidata data is dedicated to the public domain, using http://creativecommons.org/publicdomain/zero/1.0/

Most Freebase data licensed under a CC-BY license. Details are here: http://www.freebase.com/policies/attribution

A CC-BY license can be a burden, if you really want to fulfill all the terms of the license, namely: "You must attribute the work in the manner specified by the author or licensor." If there are 20,00 authors, are you really going to find out how each one wants you to give them attribution? It's impractical, so what you end up doing is giving what you think is reasonable attribution. But you never really know for sure.

Even worse, some of the material in Freebase is under other licenses, such as CC-BY-SA or GFDL.

Similar but not exactly. To my understanding, Wikibase was introduced to avoid the duplication of language-neutral information (like infoboxes) and to improve the coherence of same subject on multiple projects (like cross-language interwikis). Both are similar goals (and can be realized with the very same tools like RDF) but the latter is quite specific to Wikimedia projects.

Atlantic.com (Apr2012) "The Problem With Wikidata"


Interesting to see if this will open up the web to more linked data based trends.

This could be a heaven-sent for data exchange and knowledge structures. If data subjects are labeled with wikidata nouns it could much more interchangeable. RDF should have gone that route a long time ago.

Populating this by hand seems like an enormous task that will never end. How hard could it be to automatically populate the data from publicly available data sources (e.g. SEC filings)?

That's actually already happening and will happen more in the future.

There's already http://dbpedia.org , but it's nice to see the Wikimedia foundation finally take matters into their own hands.

Let's see, is it similar to WordNet but with open access (i.e., anyone can edit)?

Wordnet is more a linguistical ressource, that focusses on word senses. You would not find things like POPULATION OF CITY-X there.

Assurances of accuracy?

Each statement will have the ability to have sources. This is not currently supported by the UI (hence everyone is being a bit tentative and only putting in really obvious, uncontentious things) but when it does, it'll basically contain expressions of the form "X has property Y with a value of Z (type T), according to sources A, B and C".

Similar to Wikipedia. That is, every triple ("statement") present in the Wikidata may have an associated source. This is a mechanized version of [citation needed]...

Thats the same question you can ask for Wikipedia as well!


Where does the fact numbering come from?

Q1 - universe

Q2 - Earth

Q3 - life


Q24 - Jack Bauer


Q76 - Barack Obama

It's sequential.

Right. With possibly some easter eggs ;-)

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact