
Wikidata: The first new project from Wikimedia Foundation since 2006 - mattrichardson
http://meta.wikimedia.org/wiki/Wikidata
======
femto
Like others here, it's something I've been thinking about for a number of
years.

This is an important project, with the potential to eclipse wikipedia, maybe
even growing to be the saviour of free software? My reasoning follows.

Currently we program computers by giving them a set of instructions on how to
achieve a goal. As computers grow more powerful, we will stop giving detailed
instructions. Instead, we will write a general purpose deduction/inference
engine, feed in a volume of raw data and let the computer derive the
instructions it must follow to achieve the given goal.

There are two parts to such a system: the engine and the data. The engine is
something that free software is capable of producing. The missing component is
the data. The wikidata project is this missing component.

I'm convinced that Wolfram Alpha is a glimpse of this future: an engine
coupled to a growing body of structured data. Wolfram's end game isn't taking
over search, but taking over computer programming and ultimately reasoning.
It's just that search is currently a tractable problem for Alpha, one that can
pay the bills until it becomes more capable. There will come a day when Alpha
is powerful enough to automatically translate natural language into structured
data, at which point it will spider the Internet and its database and
capabilities will grow explosively.

Free software needs Wikidata, to arrive at this endpoint first and avoid being
made largely irrelevant by Alpha (or Google?)

~~~
Alex3917
"Free software needs Wikidata, to [] avoid being made largely irrelevant by
Alpha"

Wolfram Alpha is already completely worthless because it doesn't cite the
sources for any of its results. It's basically just a fancy search engine
built on top of a garbage dump.

~~~
mkr-hn
<http://www.wolframalpha.com/input/?i=how+heavy+is+earth>

Click "Source information."

~~~
Alex3917
That isn't a list of references, that's just a list of suggested reading. In
fact it's not even guaranteed that the any of the facts on that page come from
any of those sources. It's basically just showing a list of books that come up
when you Google for the question.

~~~
jasonkolb
Interesting... so they're making the calculations internally but not telling
you how they got there then, right? So you really can't use wolfram alpha as a
reliable source for anything?

~~~
Alex3917
Correct. It's conceivable that you could find a secondary source in their
reading list that links to a primary source, but in practice going through
their list of sources would be much slower than just doing the search
yourself, meaning that site has zero utility in practice. (Assuming you care
about the information you're getting being true, if you're writing a middle
school paper about penguins then it probably gives you enough plausible
deniability for having done the work, but for anything else there isn't much
point.)

------
sjaakkkkk
For people interested in this subject, you might want to check out the DBPedia
project: <http://dbpedia.org/About>. They have been extracting structured data
from Wikipedia for quite some time already and allow you to query their
database with SPARQL.

From their site: The DBpedia knowledge base currently describes more than 3.64
million things, out of which 1.83 million are classified in a consistent
Ontology, including 416,000 persons, 526,000 places, 106,000 music albums,
60,000 films, 17,500 video games, 169,000 organisations, 183,000 species and
5,400 diseases.

------
halo
tl;dr: spin-off Wikipedia infoboxes into a seperate project with an API, and
then use that data to bootstrap an open data project with broader goals.

In theory, it's a good idea. It takes an existing useful data source and puts
in a form that encourages reuse, and since it solves the bootstrapping problem
then it's not obviously doomed to failure like the Semantic Web.

I see two potential downsides.

My first concern is that, in practice, it will make editing Wikipedia more
complex. There's no inherent reason why this should be the case, but there's
no inherent reason why Wikimedia Commons should make editing Wikipedia more
complex either, yet it undeniably does.

Secondly, it will prevent a similar source of data from appearing with broader
terms of use. For example, OpenLibrary is public domain.

~~~
ZeroGravitas
Is it even possible to have a database of factual content under CC-BY-SA? This
is part of the reason OpenStreetMap is moving to ODbL.

Somewhat ironically , since part of the reason is that you can't copyright
facts, they didn't just take the existing data under the same theory, but
asked everyone to accept the new licence. I wonder what Wikipedia plan to do?

~~~
roc
I don't see why you couldn't have a database of facts under CC-BY-SA. You
can't copyright _individual_ facts, but you absolutely can copyright a
_collection_ of facts as a collection. [1]

I would think the more-pressing problem would be the 'viral' nature of the
'share alike' restriction when it came to API use.

Attribution would also seem to be thorny and difficult to police, but not
intractable.

[1] e.g. I can make a phone directory and copyright it. You could take all the
data out of my phone directory to make your _own_ directory and that would be
fine. But you could not simply make copies of my directory and sell those as
your own.

~~~
ZeroGravitas
But being able to legally take all the data out and making your own database
(or other thing) with it (which you state is fine) is exactly what makes CC-
BY-SA pointless/inapplicable to databases of open data.

See this discussion of why CC-BY-SA is unsuitable for OpenStreetMap (which
mentions the case law on phone books you refer to):

[http://www.osmfoundation.org/wiki/License/Why_CC_BY-
SA_is_Un...](http://www.osmfoundation.org/wiki/License/Why_CC_BY-
SA_is_Unsuitable)

Wikipedia says this on Fiest vs Rural and collections of facts:

"In regard to collections of facts, O'Connor states that copyright can only
apply to the creative aspects of collection: the creative choice of what data
to include or exclude, the order and style in which the information is
presented, etc., but not on the information itself. If Feist were to take the
directory and rearrange them it would destroy the copyright owned in the data.

The court ruled that Rural's directory was nothing more than an alphabetic
list of all subscribers to its service, which it was required to compile under
law, and that no creative expression was involved. The fact that Rural spent
considerable time and money collecting the data was irrelevant to copyright
law, and Rural's copyright claim was dismissed."

<http://en.wikipedia.org/wiki/Feist_v._Rural>

~~~
roc
It seems to me the confusion is over what OpenStreetMap wants to control and
what copyright allows them to control.

The 'shortcomings' of CC-BY-SA noted in your first link seem to boil down to
use cases involving chunks of data that simply do not qualify for copyright.
Thus, by definition, no copyright license could behave any differently than
any other in determining what can and can't be done with those chunks of data.

A Terms of Use agreement (and enforcement) could do more, but the particular
copyright license is simply moot.

~~~
ZeroGravitas
The ODbL isn't (just) a copyright licence, for exactly those reasons.

------
Alex3917
This is actually a startup idea I've had for a while now. It's a great idea in
theory, but it's very tricky in practice. Facts have a mysterious way of
vanishing if you look closely enough at them, and the raw numbers themselves
don't actually tell you anything.

The part that's actually interesting is:

\- The methodology behind the numbers

\- What we think is most likely the case based on the evidence available

\- How each fact connects with other facts

\- What we think we should do based on the evidence available

Being able to embed facts is definitely a cool use case, but unless you have
all the other stuff backing it up when you click the link back to the database
then it's pretty much worthless. And curating these sorts of epistemological
discussions and third party analyses isn't something that really fits within
the Wikimedia mission, so I doubt they will even try.

Because of this I doubt their implementation of the project will be
successful, although I do think it's a space that ultimately has potential.

~~~
david927
You couldn't be more right, and I think the key here is: _How each fact
connects with other facts_

If there were no operations, math would just be numbers on their own -- and
what fun is that?

The problem is that the relations turn it into the Semantic Web, and after
trying and failing to crack that nut for so long, everyone is turned off of
it. Which is too bad, because what was failing was the approach. Trying
several shipping routes to the New World and failing each time doesn't mean
that the New World doesn't exist.

~~~
Alex3917
"The problem is that the relations turn it into the Semantic Web"

Not really. Assuming there are only four or five simple relationships like
"Knowing fact X is necessary to understand fact Y", then the whole system
isn't much more complicated than trackbacks for blog posts.

~~~
david927
If it was that simple, it would already have been solved. The problem is that
relations are for _any_ data point and they can be one-to-one, one-to-many, or
many-to-many; and mixes metadata with data seamlessly. It's a hard problem,
make no mistake, but completely solvable. I have an approach I'm working on
that I'll email you, if you're interested.

~~~
Alex3917
Sure, send me an email.

------
jasonkolb
Nice to see they're going to support SPARQL:

"O3.1. Develop and prepare a SPARQL endpoint to the data. Even though a full-
fledged SPARQL endpoint to the data will likely be impossible, we can provide
a SPARQL endpoints that allows certain patterns of queries depending on the
expressivity supported by the back end."

I see the semantic web slowly realizing its actual purpose (which is not
related to semantic natural language processing but rather linking data).

------
judofyr
Missing from the FAQ: What's the difference between Freebase and Wikidata?

~~~
_delirium
It looks like the main difference is two-way integration: instead of just
scraping data from Wikipedia dumps to produce a structured database (like
Freebase and dbpedia do), it's going to store the _canonical_ version of some
of the information there, and pull _from_ it to populate the infoboxes. One of
the motivations seems to be to keep the data in sync across Wikipedia
languages, so an addition or fix propagates to them all, which is currently
done somewhat awkwardly by a mix of manual and bot measures.

~~~
huherto
So they are adding an extra layer?...Who said that CS is the science where
everything is solved with an extra level of indirection?

~~~
mindcrime
_Who said that CS is the science where everything is solved with an extra
level of indirection?_

Looks like David Wheeler made the statement that I think you're referring to:

[http://stackoverflow.com/questions/2057503/does-anybody-
know...](http://stackoverflow.com/questions/2057503/does-anybody-know-from-
where-the-layer-of-abstraction-layer-of-indirection-q)

[http://en.wikipedia.org/wiki/David_Wheeler_%28computer_scien...](http://en.wikipedia.org/wiki/David_Wheeler_%28computer_scientist%29)

------
nsns
Hats off to Wikimedia, a beacon of the true ideals of the free Internet;
they've never tried to monetize their substantial achievements, really made a
difference, and actually realized what for other companies have been merely
lip service (i.e. freeing up information).

------
jasonkolb
Now this is interesting (from the page):

"Wikidata is a secondary database. Wikidata will not simply record statements,
but __it will also record their sources, thus also allowing to reflect the
diversity of knowledge available in reality __."

That sounds pretty cool to me, because you could potentially upload
probabalistic data from statistical analysis. If they make this so that you
can tell how reliable the source is, you could upload information that's
accurate to a given degree of probability.

It would be very interesting if you could version data by reliability, so that
less-reliable data could eventually be replaced by definitive data. This is an
achilles heel of current data modeling systems.

------
debacle
My concern for the potential for abuse in this project is much greater than
that of wikipedia. How is wikimedia going to ensure that there are no
malicious edits to this data? Any changes will almost certainly need stringent
peer review.

Edit: As an afterthought, it would make a lot of sense to manage it like a git
repository, where someone could submit a pull request for data changes, and
then some subgroup or a trusted percentage of the population approves the
request and it gets merged into the master dataset.

~~~
femto
Given that the data is structured, to some extent it should be possible to
automatically check its consistency.

------
tomkin
One area I really want to see this take off in is Medicine.

As someone who had suffered from an unknown illness (no doctor could figure it
out), I can rationalize how such a system would have been helpful. You see a
bit of this with WebMD's Symptom Checker, but I feel tools like that aren't
comprehensive enough and we end up with a lot of cyberchondria. You can't rely
on co-relation to find absolute answers, but helping map out symptoms,
lifestyle choices may be a tool to finding solutions faster.

It took about a year to resolve my illness. Going to the doctor 2-4 times a
week for 10-20 minutes isn't enough to work with when you have no clear-cut
diagnosis.

Now, to be clear, I am not talking about replacing doctors or devaluing
doctors by allowing everyone to _be an expert_.

~~~
chintan
In the medical domain, there do exist large structured knowledge bases and
"expert systems" for diagnosis. Read up on DXplain[1], MYCIN[2] and the
UMLS[3]. Even in biology, there seems to be significant activity in
formalizing knowledge. It literally took decades to develop and refine these
knowledge bases.

Creating something general purpose like the Cyc or Semantic Web is very
challenging, especially because different people have different notion of
"meaning". Just look at the back and forth arguments over some controversial
Wikipedia page. This is 100 times more conceptual and challenging.

1\. <http://dxplain.org/dxp/dxp.pl>

2\. <http://en.wikipedia.org/wiki/Mycin>

3\. <http://www.nlm.nih.gov/research/umls/>

------
Monotoko
It's hardly new... it's been a non starter for about 5 years:
<http://lists.wikimedia.org/pipermail/wikidata-l/>

------
manuletroll
This might be very interesting if it's implemented in a sane way.
Unfortunately there doesn't seem to be a very widely-adopted standard in the
world of open data for now..

~~~
icebraining
RDF has been adopted by some pretty big data websites, and apparently that's
one of the formats they plan to support:

    
    
        The data will be exported in different formats, especially RDF, SKOS, and JSON.
    

<http://meta.wikimedia.org/wiki/Wikidata/Technical_proposal>

~~~
gioele
Technically unsound: RDF is a relationship model and a meta-model (think XML
Infoset), SKOS is a vocabulary (think XHTML) and JSON is a serialization
format (think XML or RDF/N3).

The question is which schema, ontology or vocabulary will they use to express
their data? Who will develop it? Or will they reuse other vocabularies? How do
they intend to extend them? If they are RDF based, how will they project to
JSON given that there are a dozen different conversion methods?

How can that document not cite DBpedia, a project that is extracting
structured data from Wikipedia infoboxes and has years of experience in doing
that?

The fact that their technical proposal document is quite confused about these
ground technologies makes me fear that there is more wishful thinking than
past experiences.

~~~
icebraining
I found the mix of different kinds of technologies odd too, but I assumed it's
just a draft, not the final spec.

------
nathell
This kind of reminds me of
<http://dabanese.blogspot.com/2009/09/introduction.html>

