I'd be curious how good the results are. I've found a bunch of articles, but no live demo. If someone set up a small-scale version where you compare the auto-linked version of a few hundred Wikipedia articles with the existing manually-linked version, I think that could convince people it was worth adopting (if the results looked good).
The Free Dictionary uses something like that - try double clicking words within the definition: http://www.thefreedictionary.com/link
Except, of course, any mention of the name of a listed company in a financial/business publication.
It would be a fun project to try and determine the correct link based on the context.
Edit: found the link (PDF) http://www.cs.mtu.edu/~nilufer/classes/cs5811/2003-fall/hilt... Here's the actual quote:
“Do you mean Anthrax (the heavy-metal band),
anthrax (the bacterium) or anthrax (the disease)?”
“The bacterium,” was the typed answer, followed by
the instruction, “Comment on its toxicity to people.”
“I assume you mean people (homo sapiens),” the
system responded, reasoning, as it informed its
programmer, that asking about People magazine
“would not make sense.”
When a search term matches a tag, and none of the tagged pages have a clear "majority probability" of being correct, it would display a list of all pages with the tag, in order of popularity.
Both of these things would be amazingly annoying to the majority of Wikipedia users.
But they could be disambiguated by humans, which is my point. Humans understand context.
Whether or not something like that would be a net win for Wikipedia is up for debate I guess. That said, I think they already do have a bot that can do at least a limited amount of auto-linkification, but I can't swear to it.
It would take some AI to work it out, unless using the context around the link.
"Sun" could refer to many things, its not really possible for wikipedia to know which one you're on about. So links are still done manually.
If not, doing that should be possible using an NLP library that does NER. Along with heuristics, one could use the list of currently existing articles as a seed.
Edit: Of course, if all you're trying to do is link to existing pages, then you don't use the set of existing pages as a seed, you just use them as the list. But if you're trying to extract "entities" that don't have WP pages yet, then you'd still want to fall back to other NER techniques, which include various heuristics and what-not. Whether or not there would be an value in that is an open question, I suppose.
Also, if you're interested in that sort of thing, two other projects you might find interest are:
Both involve extracting semantic meaning from unstructured data. It's pretty cool stuff.