Hacker News new | past | comments | ask | show | jobs | submit login

Not sure if you deliberately designed it this way, but I noticed when spot checking some results that it includes the bibliography section links as connections. This seems like it may not be desirable. Example, I did a search that went from the Crusades to Buzz Aldrin and I noticed that Routledge was the first hop from the Crusades. It strikes me as odd that Routledge (a publishing company) would be mentioned on the Wiki article for the Crusades. So I went to look and noticed the link it took was the citation for a book published by this company. I wouldn't really count that as a legitimate hop since it's a citation, not a content link.

EDIT - I noticed this also applies to the Notes section.




A few years ago I made something similar (http://ratewith.science/) that only uses bi-directional links, that is pages that both link to each other, and this gives much more interesting results.

When two Wikipedia pages both link to each other they are usually related in some reasonable way, but unidirectional links give you things like Wikipedia -> California, which only exists because Wikipedia is headquartered in California, a pretty weak connection.

Other than the fact I have it running on an overburdened tiny VPS, my app is also really fast even though I only do a unidirectional BFS because I use a custom in-memory binary format that's mmapped directly from a file that's only 700MB, and a tight search loop written in D.


Maybe I made a mistake somewhere, but I'm not sure it uses bi-directional links. For a really odd search like GoldenEye -> Abbotsford [1] it uses multiple "special" Wikipedia links that I'm pretty sure wouldn't be bidirectional.

[1]: http://ratewith.science/#start=Goldeneye&stop=abbotsford


Yah so what it does is try to find a path with bidirectional links, and then if it can't it tries to find a unidirectional path. You can tell which one it did by whether the numbers in the path have a pulsing glow animation, that particular path does not.


On a scale of The Monty Hall Problem to Manifest Destiny this is a 5!


Perhaps you could refine that by also allowing double backward hops, triple backward hops etc.


Pretty cool!


Yeah unfortunately I don't know of any way to differentiate the different types of links. Wikipedia's pagelinks database doesn't different them. I agree it's undesireable but I just cannot figure out how to cull them.


I'm not sure how the backend is structured, but it seems that you must parse the individual pages at some point or another. I took a quick look at the Wikipedia HTML for a few pages and I would suggest stripping out anything within (or nested inside) of classes like "mw-cite-backlink", "reference-text", "citation book", "citation journal", etc. Also, you can probably strip out anything inside of a <cite> HTML tag.

I'm sure there are more classes and tags, but that hopefully should give you a solid place to start.

EDIT - You can also strip out or ignore anything inside of the ordered list for references - <ol class="references">...</ol>

Also, some pages aren't documented in the same way, so something like this page - https://en.wikipedia.org/wiki/X_Window_System - doesn't have any classes or easy way to parse it for the References section even though the Notes section was set up in a more organized way. However, you could take note that the <span> tag contains class="mw-headline" id="References" and the text value is also References and then ignore everything until the next <span> begins.


I never actually touch any of the source HTML. I think that would simply take way too long and would probably result in some very high bandwidth charges. I use three tables from a public dump of Wikipedia's database, which unfortunately don't differentiate between where the links occur on the page. Check out the first section of my README[1] for more information.

[1] https://github.com/jwngr/sdow#data-source


Ohhhhh, yeah that changes things. I didn't realize the pagelinks database you mentioned was not generated by you based on the master Wiki dump that they keep available for download. In that case, yes this seems unsolvable without 1) petitioning them to update the code on their side to adjust for this and exclude these erroneous links or 2) deciding to create your own pagelink database based on the master wiki download and updating this database periodically (I forget how often Wiki updates the master downloads). While 1) is clearly the preferable option, it is unlikely to occur, unless maybe they are willing to add some columns to provide more details about the links and what sections, tags, classes, etc. they are a part of. That might be more palatable for Wiki to provide as it is just a code change on their end to supply more information rather than something possibly affecting many users by excluding the links altogether.


If it was technically possible, it would probably also be worth culling anything linked within a template. Like disambiguation headers at the top of a page, or semi-related lists grouped in blocks at the bottom (things like "Cities in Australia").


That might be a problem of wikipedia itself. This project can hardly fix that. I mean, That link has hardly any business of being there; "Oxford University Press" isn't linked either, though the page exists, and why would it?


I think you'd have to parse the raw wikitext dump, keeping track of which section each link belongs in. Since the raw dumps is something like 50 GB of text, this sort of thing takes a while.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: