Hacker News new | past | comments | ask | show | jobs | submit login
Wikidata: The first new project from Wikimedia Foundation since 2006 (wikimedia.org)
142 points by mattrichardson on March 30, 2012 | hide | past | favorite | 49 comments



Like others here, it's something I've been thinking about for a number of years.

This is an important project, with the potential to eclipse wikipedia, maybe even growing to be the saviour of free software? My reasoning follows.

Currently we program computers by giving them a set of instructions on how to achieve a goal. As computers grow more powerful, we will stop giving detailed instructions. Instead, we will write a general purpose deduction/inference engine, feed in a volume of raw data and let the computer derive the instructions it must follow to achieve the given goal.

There are two parts to such a system: the engine and the data. The engine is something that free software is capable of producing. The missing component is the data. The wikidata project is this missing component.

I'm convinced that Wolfram Alpha is a glimpse of this future: an engine coupled to a growing body of structured data. Wolfram's end game isn't taking over search, but taking over computer programming and ultimately reasoning. It's just that search is currently a tractable problem for Alpha, one that can pay the bills until it becomes more capable. There will come a day when Alpha is powerful enough to automatically translate natural language into structured data, at which point it will spider the Internet and its database and capabilities will grow explosively.

Free software needs Wikidata, to arrive at this endpoint first and avoid being made largely irrelevant by Alpha (or Google?)


I think one problem is that it's really hard to do structured data in general. Projects that pick a specific domain tend to do it much better, because they have a more tractable problem, can build a community with domain expertise, etc., in ways that Wikipedia will have trouble matching unless they plan to collaborate with those projects and/or pull data from them. For example, I think a structured-data version of Wikipedia artist/album infoboxes is going to have a long way to go to catch up to http://musicbrainz.org/, which has a carefully thought out ontology and years of iteration on that specific problem. Alternatively you can try to do a carefully thought out, consistent schema for all metadata, but the Cyc project shows how hard that is.

I do think that by virtue of breadth Wikipedia's version may become the best data resource in niches that have no specialized structured-data project for them, and it may give other informal-schema, broad-coverage projects like ConceptNet a competitor.


"Free software needs Wikidata, to [] avoid being made largely irrelevant by Alpha"

Wolfram Alpha is already completely worthless because it doesn't cite the sources for any of its results. It's basically just a fancy search engine built on top of a garbage dump.



That isn't a list of references, that's just a list of suggested reading. In fact it's not even guaranteed that the any of the facts on that page come from any of those sources. It's basically just showing a list of books that come up when you Google for the question.


Interesting... so they're making the calculations internally but not telling you how they got there then, right? So you really can't use wolfram alpha as a reliable source for anything?


Correct. It's conceivable that you could find a secondary source in their reading list that links to a primary source, but in practice going through their list of sources would be much slower than just doing the search yourself, meaning that site has zero utility in practice. (Assuming you care about the information you're getting being true, if you're writing a middle school paper about penguins then it probably gives you enough plausible deniability for having done the work, but for anything else there isn't much point.)


This link here lists exactly what their sources for AstronomicalData are: http://reference.wolfram.com/mathematica/note/AstronomicalDa...


But it doesn't tell you which source a specific fact comes from. You can't verify it or check if there are more recent sources with better values--relevant if you need good accuracy for a specific value.


The way I read this it looks like you are placing the data part as more difficult? Although the system you speak about sounds like it is AI Complete. Figuring out how the human mind manages to maneuver combinatorial explosions in interesting search spaces is a very hard problem.

The field is very exciting but not without grave risks. I am of the opinion that The final key breakthrough(s) in Artificial Intelligence will be raced and not collaborated towards. The advantages the possessor of such a system would have would be enough to test the purest of saints. Also, Computational Ethics lags far behind even current primitive attempts at AGI. Furthermore, there are insentives to leave off the moral breaks since the consequences seem ephemeral, burdening your system with ethics would further increase the search space - doing the right thing is computationally harder than doing what is best for just yourself. The future: step carefully.

For what it is worth, you can merge large data sources with automatic program construction today. I recently started a project in this area. ConceptNet has an excellent API. Then look at Genetic Programming, Markov Logic Networks, inductive logic programming, each with its own strength and weaknesses. Program Transformation is a related area where it deduces programs from formal specifications that are unoptimized or non polynonmial in time or space. The most interesting take on this I have seen: http://www.cas.mcmaster.ca/~kahl/HOPS/ANIM/index.html.


I'm with you to an extent, but why do you think Wikidata in particular will be the missing component and not some other service like Freebase or DBPedia?


I don't really. Substitute any free body of structured data for Wikidata, or even view them as one body of data, which happens to be spread across multiple servers (and maybe requiring some translation for unification).


For people interested in this subject, you might want to check out the DBPedia project: http://dbpedia.org/About. They have been extracting structured data from Wikipedia for quite some time already and allow you to query their database with SPARQL.

From their site: The DBpedia knowledge base currently describes more than 3.64 million things, out of which 1.83 million are classified in a consistent Ontology, including 416,000 persons, 526,000 places, 106,000 music albums, 60,000 films, 17,500 video games, 169,000 organisations, 183,000 species and 5,400 diseases.


tl;dr: spin-off Wikipedia infoboxes into a seperate project with an API, and then use that data to bootstrap an open data project with broader goals.

In theory, it's a good idea. It takes an existing useful data source and puts in a form that encourages reuse, and since it solves the bootstrapping problem then it's not obviously doomed to failure like the Semantic Web.

I see two potential downsides.

My first concern is that, in practice, it will make editing Wikipedia more complex. There's no inherent reason why this should be the case, but there's no inherent reason why Wikimedia Commons should make editing Wikipedia more complex either, yet it undeniably does.

Secondly, it will prevent a similar source of data from appearing with broader terms of use. For example, OpenLibrary is public domain.


Is it even possible to have a database of factual content under CC-BY-SA? This is part of the reason OpenStreetMap is moving to ODbL.

Somewhat ironically , since part of the reason is that you can't copyright facts, they didn't just take the existing data under the same theory, but asked everyone to accept the new licence. I wonder what Wikipedia plan to do?


I don't see why you couldn't have a database of facts under CC-BY-SA. You can't copyright individual facts, but you absolutely can copyright a collection of facts as a collection. [1]

I would think the more-pressing problem would be the 'viral' nature of the 'share alike' restriction when it came to API use.

Attribution would also seem to be thorny and difficult to police, but not intractable.

[1] e.g. I can make a phone directory and copyright it. You could take all the data out of my phone directory to make your own directory and that would be fine. But you could not simply make copies of my directory and sell those as your own.


But being able to legally take all the data out and making your own database (or other thing) with it (which you state is fine) is exactly what makes CC-BY-SA pointless/inapplicable to databases of open data.

See this discussion of why CC-BY-SA is unsuitable for OpenStreetMap (which mentions the case law on phone books you refer to):

http://www.osmfoundation.org/wiki/License/Why_CC_BY-SA_is_Un...

Wikipedia says this on Fiest vs Rural and collections of facts:

"In regard to collections of facts, O'Connor states that copyright can only apply to the creative aspects of collection: the creative choice of what data to include or exclude, the order and style in which the information is presented, etc., but not on the information itself. If Feist were to take the directory and rearrange them it would destroy the copyright owned in the data.

The court ruled that Rural's directory was nothing more than an alphabetic list of all subscribers to its service, which it was required to compile under law, and that no creative expression was involved. The fact that Rural spent considerable time and money collecting the data was irrelevant to copyright law, and Rural's copyright claim was dismissed."

http://en.wikipedia.org/wiki/Feist_v._Rural


It seems to me the confusion is over what OpenStreetMap wants to control and what copyright allows them to control.

The 'shortcomings' of CC-BY-SA noted in your first link seem to boil down to use cases involving chunks of data that simply do not qualify for copyright. Thus, by definition, no copyright license could behave any differently than any other in determining what can and can't be done with those chunks of data.

A Terms of Use agreement (and enforcement) could do more, but the particular copyright license is simply moot.


The ODbL isn't (just) a copyright licence, for exactly those reasons.


What editing interface could possibly be more complex than the current system of Infobox "markup"? If Wikidata does nothing besides make it easier to edit those infoboxen, it will be a success.


This is actually a startup idea I've had for a while now. It's a great idea in theory, but it's very tricky in practice. Facts have a mysterious way of vanishing if you look closely enough at them, and the raw numbers themselves don't actually tell you anything.

The part that's actually interesting is:

- The methodology behind the numbers

- What we think is most likely the case based on the evidence available

- How each fact connects with other facts

- What we think we should do based on the evidence available

Being able to embed facts is definitely a cool use case, but unless you have all the other stuff backing it up when you click the link back to the database then it's pretty much worthless. And curating these sorts of epistemological discussions and third party analyses isn't something that really fits within the Wikimedia mission, so I doubt they will even try.

Because of this I doubt their implementation of the project will be successful, although I do think it's a space that ultimately has potential.


You couldn't be more right, and I think the key here is: How each fact connects with other facts

If there were no operations, math would just be numbers on their own -- and what fun is that?

The problem is that the relations turn it into the Semantic Web, and after trying and failing to crack that nut for so long, everyone is turned off of it. Which is too bad, because what was failing was the approach. Trying several shipping routes to the New World and failing each time doesn't mean that the New World doesn't exist.


"The problem is that the relations turn it into the Semantic Web"

Not really. Assuming there are only four or five simple relationships like "Knowing fact X is necessary to understand fact Y", then the whole system isn't much more complicated than trackbacks for blog posts.


If it was that simple, it would already have been solved. The problem is that relations are for any data point and they can be one-to-one, one-to-many, or many-to-many; and mixes metadata with data seamlessly. It's a hard problem, make no mistake, but completely solvable. I have an approach I'm working on that I'll email you, if you're interested.


Sure, send me an email.


Nice to see they're going to support SPARQL:

"O3.1. Develop and prepare a SPARQL endpoint to the data. Even though a full-fledged SPARQL endpoint to the data will likely be impossible, we can provide a SPARQL endpoints that allows certain patterns of queries depending on the expressivity supported by the back end."

I see the semantic web slowly realizing its actual purpose (which is not related to semantic natural language processing but rather linking data).


Missing from the FAQ: What's the difference between Freebase and Wikidata?


It looks like the main difference is two-way integration: instead of just scraping data from Wikipedia dumps to produce a structured database (like Freebase and dbpedia do), it's going to store the canonical version of some of the information there, and pull from it to populate the infoboxes. One of the motivations seems to be to keep the data in sync across Wikipedia languages, so an addition or fix propagates to them all, which is currently done somewhat awkwardly by a mix of manual and bot measures.


For the interested reader, here a cool paper on Information Arbitrage Across Multi-lingual Wikipedia: http://www.cond.org/paper_202.pdf


So they are adding an extra layer?...Who said that CS is the science where everything is solved with an extra level of indirection?


Who said that CS is the science where everything is solved with an extra level of indirection?

Looks like David Wheeler made the statement that I think you're referring to:

http://stackoverflow.com/questions/2057503/does-anybody-know...

http://en.wikipedia.org/wiki/David_Wheeler_%28computer_scien...


And dbpedia?


Also related: Factual.com.


Hats off to Wikimedia, a beacon of the true ideals of the free Internet; they've never tried to monetize their substantial achievements, really made a difference, and actually realized what for other companies have been merely lip service (i.e. freeing up information).


Now this is interesting (from the page):

"Wikidata is a secondary database. Wikidata will not simply record statements, but it will also record their sources, thus also allowing to reflect the diversity of knowledge available in reality."

That sounds pretty cool to me, because you could potentially upload probabalistic data from statistical analysis. If they make this so that you can tell how reliable the source is, you could upload information that's accurate to a given degree of probability.

It would be very interesting if you could version data by reliability, so that less-reliable data could eventually be replaced by definitive data. This is an achilles heel of current data modeling systems.


My concern for the potential for abuse in this project is much greater than that of wikipedia. How is wikimedia going to ensure that there are no malicious edits to this data? Any changes will almost certainly need stringent peer review.

Edit: As an afterthought, it would make a lot of sense to manage it like a git repository, where someone could submit a pull request for data changes, and then some subgroup or a trusted percentage of the population approves the request and it gets merged into the master dataset.


Given that the data is structured, to some extent it should be possible to automatically check its consistency.


Openstreetmap has the same problem and it handles it well.


One area I really want to see this take off in is Medicine.

As someone who had suffered from an unknown illness (no doctor could figure it out), I can rationalize how such a system would have been helpful. You see a bit of this with WebMD's Symptom Checker, but I feel tools like that aren't comprehensive enough and we end up with a lot of cyberchondria. You can't rely on co-relation to find absolute answers, but helping map out symptoms, lifestyle choices may be a tool to finding solutions faster.

It took about a year to resolve my illness. Going to the doctor 2-4 times a week for 10-20 minutes isn't enough to work with when you have no clear-cut diagnosis.

Now, to be clear, I am not talking about replacing doctors or devaluing doctors by allowing everyone to be an expert.


In the medical domain, there do exist large structured knowledge bases and "expert systems" for diagnosis. Read up on DXplain[1], MYCIN[2] and the UMLS[3]. Even in biology, there seems to be significant activity in formalizing knowledge. It literally took decades to develop and refine these knowledge bases.

Creating something general purpose like the Cyc or Semantic Web is very challenging, especially because different people have different notion of "meaning". Just look at the back and forth arguments over some controversial Wikipedia page. This is 100 times more conceptual and challenging.

1. http://dxplain.org/dxp/dxp.pl

2. http://en.wikipedia.org/wiki/Mycin

3. http://www.nlm.nih.gov/research/umls/


It's hardly new... it's been a non starter for about 5 years: http://lists.wikimedia.org/pipermail/wikidata-l/


This might be very interesting if it's implemented in a sane way. Unfortunately there doesn't seem to be a very widely-adopted standard in the world of open data for now..


What does it mean "a very widely-adopted standard in the world of open data"? "standard" of what?

There are meta-format standards: XML, RDF, HTML and lately JSON. With these four you are probably covering 80% of the world published open data, the rest is PDF, MS DOC and MS XLS.

That is missing, and good like filling this void, is a single format that you can use to describe everything. Personally, I think that such a single format will never exist and looking for one is pointless. Geographical data requires attention to certain details, music data to others; this means two different formats must be used (serialized through XML, RDF, HTML, whatever). If you are thinking about "bridging" different formats and data models, then, welcome to the world of RDF/S, OWL, TopicMaps ontologies (or ontologY), I'm not sure you want to live there :)

This new Wikidata, just like Freebase, is trying to collect structured or semi-structured data instead of unstructured data such as that present in Wikipedia. I am happy about the aim (completely unstructured data is basically useless for any serious data reuse and data extraction) but my fear is that they will not succeed as well as they did with Wikipedia. Wikipedia funded its success on the fact that anybody could edit it. In order to edit a wikipedia page you only need very low technical skills and basic writing skills (plus knowledge of the topic, obviously). Adding and manipulating structured data requires people to obey to a certain mental grid, to a formalized model, to a schema developed by someone and put in place to be respected strictly. The vast majority of people is easily demotivated when they are required to learn something substantial beforehand and most of the edits of unskilled users end up removed by watchdog (something seen often in high quality Wikipedia articles: edits made by new users are quickly reverted on the grounds that they did not follow some of the many guidelines that must be followed).

My idea is that many problems found in structured-data projects (FreeBase, MusicBrainz...) could be alleviated by better interfaces and a wide use of automation, both things that Wikipedia projects do not seem to excel in.


RDF has been adopted by some pretty big data websites, and apparently that's one of the formats they plan to support:

    The data will be exported in different formats, especially RDF, SKOS, and JSON.
http://meta.wikimedia.org/wiki/Wikidata/Technical_proposal


Technically unsound: RDF is a relationship model and a meta-model (think XML Infoset), SKOS is a vocabulary (think XHTML) and JSON is a serialization format (think XML or RDF/N3).

The question is which schema, ontology or vocabulary will they use to express their data? Who will develop it? Or will they reuse other vocabularies? How do they intend to extend them? If they are RDF based, how will they project to JSON given that there are a dozen different conversion methods?

How can that document not cite DBpedia, a project that is extracting structured data from Wikipedia infoboxes and has years of experience in doing that?

The fact that their technical proposal document is quite confused about these ground technologies makes me fear that there is more wishful thinking than past experiences.


I found the mix of different kinds of technologies odd too, but I assumed it's just a draft, not the final spec.


I think whatever they choose to implement it in has a good chance at becoming the next de facto standard.


Does the standard really matter? If it's machine understandable, it should be able to be automatically translated into any other format in the future.

The important thing is to jump in and make a start. The right way of doing things will become evident as the project evolves.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: