Not sure if you deliberately designed it this way, but I noticed when spot check...

trishume · on Feb 27, 2018

A few years ago I made something similar (http://ratewith.science/) that only uses bi-directional links, that is pages that both link to each other, and this gives much more interesting results.

When two Wikipedia pages both link to each other they are usually related in some reasonable way, but unidirectional links give you things like Wikipedia -> California, which only exists because Wikipedia is headquartered in California, a pretty weak connection.

Other than the fact I have it running on an overburdened tiny VPS, my app is also really fast even though I only do a unidirectional BFS because I use a custom in-memory binary format that's mmapped directly from a file that's only 700MB, and a tight search loop written in D.

cyphar · on Feb 27, 2018

Maybe I made a mistake somewhere, but I'm not sure it uses bi-directional links. For a really odd search like GoldenEye -> Abbotsford [1] it uses multiple "special" Wikipedia links that I'm pretty sure wouldn't be bidirectional.

[1]: http://ratewith.science/#start=Goldeneye&stop=abbotsford

trishume · on March 8, 2018

Yah so what it does is try to find a path with bidirectional links, and then if it can't it tries to find a unidirectional path. You can tell which one it did by whether the numbers in the path have a pulsing glow animation, that particular path does not.

mulmen · on Feb 27, 2018

On a scale of The Monty Hall Problem to Manifest Destiny this is a 5!

amelius · on Feb 27, 2018

Perhaps you could refine that by also allowing double backward hops, triple backward hops etc.

dingo_bat · on Feb 27, 2018

Pretty cool!

jwngr · on Feb 26, 2018

Yeah unfortunately I don't know of any way to differentiate the different types of links. Wikipedia's pagelinks database doesn't different them. I agree it's undesireable but I just cannot figure out how to cull them.

turc1656 · on Feb 26, 2018

I'm not sure how the backend is structured, but it seems that you must parse the individual pages at some point or another. I took a quick look at the Wikipedia HTML for a few pages and I would suggest stripping out anything within (or nested inside) of classes like "mw-cite-backlink", "reference-text", "citation book", "citation journal", etc. Also, you can probably strip out anything inside of a <cite> HTML tag.

I'm sure there are more classes and tags, but that hopefully should give you a solid place to start.

EDIT - You can also strip out or ignore anything inside of the ordered list for references - <ol class="references">...</ol>

Also, some pages aren't documented in the same way, so something like this page - https://en.wikipedia.org/wiki/X_Window_System - doesn't have any classes or easy way to parse it for the References section even though the Notes section was set up in a more organized way. However, you could take note that the <span> tag contains class="mw-headline" id="References" and the text value is also References and then ignore everything until the next <span> begins.

jwngr · on Feb 26, 2018

I never actually touch any of the source HTML. I think that would simply take way too long and would probably result in some very high bandwidth charges. I use three tables from a public dump of Wikipedia's database, which unfortunately don't differentiate between where the links occur on the page. Check out the first section of my README[1] for more information.

[1] https://github.com/jwngr/sdow#data-source

turc1656 · on Feb 26, 2018

Ohhhhh, yeah that changes things. I didn't realize the pagelinks database you mentioned was not generated by you based on the master Wiki dump that they keep available for download. In that case, yes this seems unsolvable without 1) petitioning them to update the code on their side to adjust for this and exclude these erroneous links or 2) deciding to create your own pagelink database based on the master wiki download and updating this database periodically (I forget how often Wiki updates the master downloads). While 1) is clearly the preferable option, it is unlikely to occur, unless maybe they are willing to add some columns to provide more details about the links and what sections, tags, classes, etc. they are a part of. That might be more palatable for Wiki to provide as it is just a code change on their end to supply more information rather than something possibly affecting many users by excluding the links altogether.

incompatible · on Feb 26, 2018

If it was technically possible, it would probably also be worth culling anything linked within a template. Like disambiguation headers at the top of a page, or semi-related lists grouped in blocks at the bottom (things like "Cities in Australia").

posterboy · on Feb 26, 2018

That might be a problem of wikipedia itself. This project can hardly fix that. I mean, That link has hardly any business of being there; "Oxford University Press" isn't linked either, though the page exists, and why would it?

pradn · on Feb 26, 2018

I think you'd have to parse the raw wikitext dump, keeping track of which section each link belongs in. Since the raw dumps is something like 50 GB of text, this sort of thing takes a while.