Hacker News new | comments | show | ask | jobs | submit login

OT: I was using Wikipedia the other day and it occurred to me how primitive it is to have all the inner links to other Wikipedia articles defined manually, surely these should have been automated by now (i.e., marking a word or two would link you to the relevant article).



There's a lot of research dating back to the early days of hypertext on automatic link insertion, but afaict it hasn't really caught on in any system, whether hypertext or wiki-based, with the exception of those somewhat spammy content farms that auto-cross-link their articles. Wikipedia has indeed spawned a new wave of such research, e.g. (random example pulled out of Google Scholar): http://www.cs.waikato.ac.nz/~ihw/papers/08-DNM-IHW-LearningT...

I'd be curious how good the results are. I've found a bunch of articles, but no live demo. If someone set up a small-scale version where you compare the auto-linked version of a few hundred Wikipedia articles with the existing manually-linked version, I think that could convince people it was worth adopting (if the results looked good).


There's a certain confusion, or duality, in the way Wikipedia links work. If you access a director's page for example, and a sentence states he has made "20 movies", clicking on "movies" can either take you to his separate filmography page or to the general article "Movie". I believe only the first option should be manually defined.

The Free Dictionary uses something like that - try double clicking words within the definition: http://www.thefreedictionary.com/link


> There's a lot of research dating back to the early days of hypertext on automatic link insertion, but afaict it hasn't really caught on in any system

Except, of course, any mention of the name of a listed company in a financial/business publication.


Plenty of webforums use text ad-links interspersed with content for unlogged-in guest users.


The disambiguation could be challenging. (Does "The Sun" link to our closest star or the tabloid in the UK.)

It would be a fun project to try and determine the correct link based on the context.


This reminds me of an article I read about IBM's Watson. As a demo in front of a lot of people, a researcher was going through a stack of journals and feeding in data about anthrax. Most of the data was about animals, but the researcher was asking Watson to extrapolate possible effects on people. Watson responded "I assume by people you mean humans, and not People magazine."

Edit: found the link (PDF) http://www.cs.mtu.edu/~nilufer/classes/cs5811/2003-fall/hilt... Here's the actual quote:

  “Do you mean Anthrax (the heavy-metal band),
  anthrax (the bacterium) or anthrax (the disease)?”

  “The bacterium,” was the typed answer, followed by 
  the instruction, “Comment on its toxicity to people.”

  “I assume you mean people (homo sapiens),” the
  system responded, reasoning, as it informed its
  programmer, that asking about People magazine
  “would not make sense.”


Late edit: funny that I remembered this being Watson when it's actually Cyc, a longtime rival of the Watson project.


Just make the link to the disambiguation page, if there is one? Otherwise, make it a special link that doesn't go anywhere directly, but uses some javascript/CSS to raise a dialog when clicked, that gives you the different choices?


Auto-generate disambiguation pages?

When a search term matches a tag, and none of the tagged pages have a clear "majority probability" of being correct, it would display a list of all pages with the tag, in order of popularity.


I don't think it would be a good idea if wikipedia required that you to run Javascript to navigate the site, and it would make for pretty bad SEO.


Wikipedia doesn't need to care to much about SEO.


> Just make the link to the disambiguation page, if there is one? Otherwise, make it a special link that doesn't go anywhere directly, but uses some javascript/CSS to raise a dialog when clicked, that gives you the different choices?

Both of these things would be amazingly annoying to the majority of Wikipedia users.


I'm only talking about auto-generated links that can't be clearly disambiguated by the system. At worst, the experience wouldn't be any worse than it is today.


> I'm only talking about auto-generated links that can't be clearly disambiguated by the system.

But they could be disambiguated by humans, which is my point. Humans understand context.


Sure, and when humans create links, they should continue to create them just like they do not. I'm picturing an "auto linkifier" that creates links that no human has gotten around to creating yet.

Whether or not something like that would be a net win for Wikipedia is up for debate I guess. That said, I think they already do have a bot that can do at least a limited amount of auto-linkification, but I can't swear to it.


I guess an added bonus of what you're suggesting is that the correct link could be crowdsourced; if the system kept track of which of the options users clicked on, it could figure out pretty quickly which one is correct.


Exactly this.

It would take some AI to work it out, unless using the context around the link. "Sun" could refer to many things, its not really possible for wikipedia to know which one you're on about. So links are still done manually.


I don't think it is primitive. Wikipedia is edited by humans, I don't think an automatic link algorithm could do a better job at this time.


I would venture there's a way to make "concepts" and "entities" become linkable automatically based on existing articles, but that would mean a bit of engineering. I.E. A name, product or academic field. But then there's going to be a high number of links to articles that haven't been created yet or deleted/merged etc... in cases of lack of notability.


I don't know if it has that exact feature or not, but Semantic Mediawiki has a lot of extensions to base Mediawiki that involve working with data at a semantic level.

http://semantic-mediawiki.org/wiki/Help:Introduction_to_Sema...

If not, doing that should be possible using an NLP library that does NER. Along with heuristics, one could use the list of currently existing articles as a seed.

Edit: Of course, if all you're trying to do is link to existing pages, then you don't use the set of existing pages as a seed, you just use them as the list. But if you're trying to extract "entities" that don't have WP pages yet, then you'd still want to fall back to other NER techniques, which include various heuristics and what-not. Whether or not there would be an value in that is an open question, I suppose.


Good resource! And I think your idea of linking to the disambiguation page makes sense, but there may be a way to infer the correct article from the list of links based on the context of the text in the linking article.


Yeah, there's that as well.

Also, if you're interested in that sort of thing, two other projects you might find interest are:

http://stanbol.apache.org

and

http://uima.apache.org

Both involve extracting semantic meaning from unstructured data. It's pretty cool stuff.


Here is a quick demo of Stanbol-provided Wikipedia annotations and disambiguation in a WYSIWYG editor:

https://www.youtube.com/watch?v=zAMUpd6rb9k&feature=yout...


Here is one solution they could use: http://bergie.iki.fi/blog/automated-linking/




Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: