First of all, having worked with wikis for about ten years, finally seeing this live in Wikipedia is huge -- but it'll take a long time for it to have full impact.
The first baby step is interwiki links. Previously, every article in every version of Wikipedia had a long list of links to every other version of that article, maintained by a small army of flaky bots that clogged up edit histories and often stepped on each others' toes. Now, there's a single Wikidata node that has the single mapping of what article corresponds to what, reflected across all Wikipedia versions. Here's "Tokyo" now:
The next step will be infoboxes. Instead of every Wikipedia having a separate copy of the population of Tokyo or the GDP of Canada, updated ad hoc by different people whenever they get around to it, there will be a single place storing that data automatically reflected into all Wikipedias.
And it keeps going. Taxonomies of plants and animals change all the time, Wikidata can become their single repository. Wikivoyage currently has to store the phone numbers of each hotel separately in each language version, Wikidata will allow centralizing them. Here's the master plan:
That page for Tokyo looks suspiciously like the internals of a natural language processing system. I can't wait until someone hooks up a probabilistic speech parser to Wikidata. A talking computer with the knowledge of Wikipedia could be incredible.
I'm pretty excited about this, if part of the plan is to encourage Wikipedia contributors to add their data as data instead of as finished output.
I've spent some time lately building election results maps for Google, and while Wikipedia has been a useful resource, it's also been very frustrating.
There is a wealth of geographic data on Wikipedia, but hardly any of it is in a usable form. Look at all the interesting maps you see there. Somebody generated each of those maps from actual usable data such as shapefiles  or KML files . The generated maps are nice to look at, but I can't do anything else with them. I need the shapefiles or equivalent to build other kinds of maps with the data.
For the Brazil election, we had shapefiles for the municipalities which had been mangled by a process that converted all the municipality names to uppercase with no accent marks. I wanted to display the correctly capitalized and accented names and found this list:
This was great! It had all the data I needed. But it's in a jumbled format that looks nice in the wiki but doesn't lend itself to machine use. It took me a few hours to write some Python code to parse the page and get it into a CSV format that I could import into my database. This is a frustrating kind of work, because I was pretty sure that somebody had started with nice tabular data and generated the initial version of this Wikipedia page from that.
If we're eventually able to get this kind of data as real usable data, that will save a lot of people a lot of work.
I'm not big into the Wiki world, but it's also struck me as odd how different pages refer to the same facts and yet are totally disparate. If one page updates, does somebody manually have to go update the other page? Does a bot do it?
This looks like a great response to that. I just hope they've made it easy to interface with.
If I understand it correctly, it has more modest goals. Freebase was trying to make the uber-map of all data entities. Wikidata is just trying to make data reuse easier on Wikipedia.
For instance, imagine a table of all the populations of the countries of the world. Today, someone might make a really good one for the French Wikipedia. But then someone has to make it from scratch, all over again, for the Greek Wikipedia. And when someone updates the French one, the Greek one doesn't update, and vice versa.
With Wikidata you can define the data once, and then transclude it to different pages, with translated labels if necessary.
The first release attacks the problem of "inter-wiki links". On the left hand side of some Wikipedia pages, there are the links to equivalent pages in different languages. Check out the one for http://en.wikipedia.org/wiki/Jimmy_Wales, for instance. Right now these are updated with a system that looks at every possible connection (scaling at O(n2)), and with Wikidata it will be more manageable.
One major source of ambiguities in the ILL graph is conceptual drift
across language editions. Conceptual drift stems from the well-known
finding in cognitive science that the boundaries of concepts vary
across language-defined communities . For instance, the English
articles “High school” and “Secondary school” are grouped into a
single connected concept. While placing these two articles in the
same multilingual article may be reasonable given their overlapping
definitions around the world, excessive conceptual drift can result
in a semantic equivalent of what happens in the children’s game
known as “telephone”. For instance, chains of conceptual drift
expand the aforementioned connected concept to include the English
articles “Primary school”, “Etiquette”, “Manners”, and even
“Protocol (diplomacy)”. Omnipedia users would be confused to see
“Kyoto Protocol” as a linked topic when they looked up “High
school”. A similar situation occurs in the large connected concept
that spans the semantic range from “River” to “Canal” to “Trench
warfare”, and in another which contains “Woman” and “Marriage”
(although, interestingly, not “Man”).
It's also worth noting that Freebase itself heavily relied on parsing Wikipedia database dumps to build its ontology -- to a large extent Wikidata is giving structure to data that's been in Wikipedia all along.
A CC-BY license can be a burden, if you really want to fulfill all the terms of the license, namely: "You must attribute the work in the manner specified by the author or licensor." If there are 20,00 authors, are you really going to find out how each one wants you to give them attribution? It's impractical, so what you end up doing is giving what you think is reasonable attribution. But you never really know for sure.
Even worse, some of the material in Freebase is under other licenses, such as CC-BY-SA or GFDL.
Similar but not exactly. To my understanding, Wikibase was introduced to avoid the duplication of language-neutral information (like infoboxes) and to improve the coherence of same subject on multiple projects (like cross-language interwikis). Both are similar goals (and can be realized with the very same tools like RDF) but the latter is quite specific to Wikimedia projects.
This could be a heaven-sent for data exchange and knowledge structures. If data subjects are labeled with wikidata nouns it could much more interchangeable. RDF should have gone that route a long time ago.
It's a triple store (sort of) "things" and verbs associated. Right now every article in every language has a data point, ie: population on San Francisco, and every article must be maintained individually, in the future you'll be able to reference a data point and it will be updated in every article.
So are you saying that in the future, you could have a system where you type a statement as a description, then conceivably, be able to hover over items in the description for data. E.G.: the statement "San Francisco's population is largely made up of imigrants to the bay area, where an estimated 20% of the inhabitants are actually native San Franciscan born"
Where hovering over "San Francisco's population" would give you the data-point from wikidata?
Is this not the level of parsing that Wolfram Alpha is trying to do?
Also, it would be really interesting to see the following:
Assume you create new google doc, and as you begin typing, in a small window/pane/pop-up, relevant information is displayed based on the context of what you are typing with subtle highlighting. As you typed out "San Francisco's population" that phrase would highlight and the context indicator would display that number.
What would be interesting about this is that if children where using a system like this from their early school days - would they passively absorb such information? would it be annoying or useful?
> Assume you create new google doc, and as you begin typing, in a small window/pane/pop-up, relevant information is displayed based on the context of what you are typing with subtle highlighting. As you typed out "San Francisco's population" that phrase would highlight and the context indicator would display that number.
I've been thinking along the same lines, but for a study prosthetic. Imagine a head-mounted camera that OCRs as you read. So, when you read "Navier-Stokes" there's side-bar with everything that you know about Navier-Stokes, equations, code samples, etc.
It's an application of the "semantic web". The hope is that one day it will be possible to give a computer the link (or a successor) that you cited, and the computer will in some way be able to "understand" what victory means.
Property P107 (http://www.wikidata.org/wiki/Property:P107) has emerged as Wikidata's de facto upper ontology. It currently consists of six main types: person, organization, event, creative work, term, and geographical feature. It's essentially a clean port of the high-level entities from the GND Ontology -- a controlled vocabulary developed by the German National Library and released last summer (http://d-nb.info/standards/elementset/gnd).
There's a fair amount of debate over that property. Are those current high level types (person, place, work, event, organization, term) a good fit for a knowledgebase that aims to structure all knowledge and not just library holdings? Does classifying subjects like inertia, DNA, Alzheimer's disease, dog, etc. as simply "terms" make sense?
No, I mean I was really confused. The term "WikiData" seems to connote data of the tabular type, like a central repository for public data. Though I'm also confused at how the mapping for this particular term (for "victory") can't be done in Wikipedia or the Wiki dictionary.
It is, but the problem is whether you are talking about an individual language version of Wikipedia or the Wikipedia project as a whole. In the article they talk about the problem of maintaining Interwiki links on each individual language version, rather than centrally.
This is also just one aspect of Wikidata. The centrality of shared table content is important, too. Why have data in a specific language version Wikipedia and point to it from other versions when you can have a central repository that is pointed to using templates from each language version?
"The term "WikiData" seems to connote data of the tabular type"
There's a strong correspondence between "tabular data" (you probably mean relational) and triples (<predicate,X,Y>). Bot are based on the first-order predicate logic, so there's actually a natural mapping.
Each statement will have the ability to have sources. This is not currently supported by the UI (hence everyone is being a bit tentative and only putting in really obvious, uncontentious things) but when it does, it'll basically contain expressions of the form "X has property Y with a value of Z (type T), according to sources A, B and C".