OT: I was using Wikipedia the other day and it occurred to me how primitive it is to have all the inner links to other Wikipedia articles defined manually, surely these should have been automated by now (i.e., marking a word or two would link you to the relevant article).
There's a lot of research dating back to the early days of hypertext on automatic link insertion, but afaict it hasn't really caught on in any system, whether hypertext or wiki-based, with the exception of those somewhat spammy content farms that auto-cross-link their articles. Wikipedia has indeed spawned a new wave of such research, e.g. (random example pulled out of Google Scholar): http://www.cs.waikato.ac.nz/~ihw/papers/08-DNM-IHW-LearningT...
I'd be curious how good the results are. I've found a bunch of articles, but no live demo. If someone set up a small-scale version where you compare the auto-linked version of a few hundred Wikipedia articles with the existing manually-linked version, I think that could convince people it was worth adopting (if the results looked good).
There's a certain confusion, or duality, in the way Wikipedia links work. If you access a director's page for example, and a sentence states he has made "20 movies", clicking on "movies" can either take you to his separate filmography page or to the general article "Movie".
I believe only the first option should be manually defined.
This reminds me of an article I read about IBM's Watson. As a demo in front of a lot of people, a researcher was going through a stack of journals and feeding in data about anthrax. Most of the data was about animals, but the researcher was asking Watson to extrapolate possible effects on people. Watson responded "I assume by people you mean humans, and not People magazine."
“Do you mean Anthrax (the heavy-metal band),
anthrax (the bacterium) or anthrax (the disease)?”
“The bacterium,” was the typed answer, followed by
the instruction, “Comment on its toxicity to people.”
“I assume you mean people (homo sapiens),” the
system responded, reasoning, as it informed its
programmer, that asking about People magazine
“would not make sense.”
It would take some AI to work it out, unless using the context around the link.
"Sun" could refer to many things, its not really possible for wikipedia to know which one you're on about. So links are still done manually.
Both of these things would be amazingly annoying to the majority of Wikipedia users.
Sure, and when humans create links, they should continue to create them just like they do not. I'm picturing an "auto linkifier" that creates links that no human has gotten around to creating yet.
Whether or not something like that would be a net win for Wikipedia is up for debate I guess. That said, I think they already do have a bot that can do at least a limited amount of auto-linkification, but I can't swear to it.
I guess an added bonus of what you're suggesting is that the correct link could be crowdsourced; if the system kept track of which of the options users clicked on, it could figure out pretty quickly which one is correct.
I would venture there's a way to make "concepts" and "entities" become linkable automatically based on existing articles, but that would mean a bit of engineering. I.E. A name, product or academic field. But then there's going to be a high number of links to articles that haven't been created yet or deleted/merged etc... in cases of lack of notability.
If not, doing that should be possible using an NLP library that does NER. Along with heuristics, one could use the list of currently existing articles as a seed.
Edit: Of course, if all you're trying to do is link to existing pages, then you don't use the set of existing pages as a seed, you just use them as the list. But if you're trying to extract "entities" that don't have WP pages yet, then you'd still want to fall back to other NER techniques, which include various heuristics and what-not. Whether or not there would be an value in that is an open question, I suppose.
Good resource! And I think your idea of linking to the disambiguation page makes sense, but there may be a way to infer the correct article from the list of links based on the context of the text in the linking article.