Creator here. Six Degrees of Wikipedia is a side project I've been sporadically hacking on over the past few years. It was an interesting technical challenge and it's fun to play with the end result. Here's the tech stack:
* Frontend: React (Create React App)
* Backend: Python Flask
* Database: SQLite
* Web (frontend) hosting: Firebase Hosting
* Server (backend) hosting: Google Compute Engine (it runs fine on a tiny f1-micro instance)
All the code is open source[1] and I'm happy to answer any questions about building or maintaining it!
make it so that if I visit the page and just click the "go" button, it will use the placeholder examples as the start and end points. i did this and got an error message stating "You'll probably want to choose the start and end pages before you hit that." that was annoying. the placeholders that were auto chosen were actually really interesting.
There's a certain irony to the lack of tact in this post suggesting better manners. I suspect the point would be better received were it more politely made.
That post seems like quintessential trolling. It was my expectation as well when going to the page to just click the "Go" button and see what the results would be, and I think the suggestion is great feedback. I think it is generally in the spirit of these "Show HN" posts to give and get feedback and discuss with the creator. The response post to that user, to me, is patronizing and is undoubtedly going to ruffle some feathers. But, now we aren't even talking about the OPs cool project...
I've see something similar when offering items on freecycle: people seem to think that the more they insert "pls" and "thnx" into a message the more likely it is that they'll be first in the queue.
I swear at least once there was more instances of "pls" and thnx" (there seems to be a significant overlap between people who overuse the words and people who don't bother to type them properly, though that may be the linguistic snob in me talking) than all other words combined. It backfired by making the message difficult to read so I binned it and read the next.
Come on, this is Show HN! Not some GitHub issue tracker. Try to insert please and thank you in that post ... it will make it sound rude! It will turn it from Show HN feedback to sounding like a feature request, which it is clearly not...
If you can automatically detect sentences where adding "Could you" would make the author sound more tactful, then why not just mentally add them and assume that the author intended to sound that tactful?
This is a slight modification of the principle of charity, something I think is incredibly important and underemphasized, especially for online text communication.
Based on the thread here, many people interpreted the comment as a well-intended feature request, but instead of requesting the commenter to change their comment, it seems like it would’ve been fine to just use that charitable interpretation.
I’m all for being charitable in our interpretations of the motives of others, but I think a total lack of the most basic courtesy should be seen for what it is. The lower end of the spectrum being lazy, and the high end being rude, even charitably no positive interpretation exists IMO.
Please and thank you are such basic concepts that honestly, I judge the hell out of people who don’t seem capable of it regardless of platform. If that minimal degree of politeness isn’t reflexive for you, again, the world at large judges you. Not to put too fine a point on it, but maybe “why not add it in your head” is the kind of thing thst gives some in tech a bad name?
This is a bit off topic, but just to blow ones mind: Finnish, which I am natively speaking, is lacking direct "please" word completely and I have to admit that especially orally I forget it often. (We make things sound nice by conditional, "could you", "would you be so kind" etc..) I have no immediate reaction what so ever how it would feel if someone left out "please". I understand the concept, but it feels horrible how much wrong I can go with native English speakers without meaning it. Thanks for the reminder! (thank you we do have;)
That’s not off-topic at all! Honestly, I’m ashamed that I hadn’t considered the possibility of a non-native speaker’s linguistic quirks, and what this tells me is that I need to continue to work on my snap-judgements. Thank you for reminding me that languages don’t handle courtesy the same way.
This was really nice to hear! Don't worry, one can only think through ones own mind =) I think this already helps, if everybody are just a bit more aware about that others might think differently, and not be deliberately mean. I have learned several languages, since Finnish is very useless in a large scale, and it always amazes me how in different languages it is also thought differently. I can only recommend!
You're right that please and thank you could come off as sarcastic or a bit odd in a comment section like this.
The best route is probably to clarify that it seems like a good concept, and you think that adding x and y would be the best features to focus on that would make it even better.
There’s a human being at the other end of the line. They have feelings. This is their baby, which they’ve gestated for multiple years.
Rational fundamentalism is not the one ring to rule them all. There is room for love and compassion in the world. I know (painfully well) that Internet forums (and this one in particular) anything but empower empathy, but that doesn’t make it any less true or valuable.
There is room for love and empathy. There really is. Logical, rationalistic supremacy notwithstanding.
When your baby grows up, and you put it out in the world, you no longer need to--and no longer should--continue to treat it like a baby. The author is not her project, and statements about the project are not personal attacks on the author. And if she were going to take them as such, she probably wouldn't be sharing that project on HN.
This is not how humans work. Humans are not rational emotionless robots. Humans are complex beings , they have feelings, biases and emotions which they do not control. It’s humane , considerate and kind to keep that in mind when communicating with each other.
Nobody is perfect. We shouldn’t expect anyone to be. Being kind and considerate is not a waste of anyone’s time, it’s not disrespectful nor patronising.
Maybe all of this is not true for OP, maybe they’re so zen they immediately see past their human instinct to attach themselves to their work (a well known and documented human trait). But it can’t hurt to be tactful , just in case they’re not.
I know I personally would’ve felt hurt if that first comment were the reply to something I did. I might, hopefully, have realised that it was, indeed, a useful comment and that I should swallow my pride. And I hope I would have the presence of mind to reply with a smile and a thank you. But it would be a pretense.
Why put each other through that? We can be truthful and tactful. There is no reason to be blunt.
I can understand the desire to unshackle ourselves from the low entropy morass of politeness. It’s fake and not what you really feel, right? The truth shines through anyway, so do them the honour of not sugar coating it, and be straight? I do understand, but hear this: you lack the nonverbal cues to make that incredibly delicate call. Face to face you would see the twitch in someone’s eye, the soft breeze of disappointment whisking along their face. A tear of pain streaking their human heart while they smile and nod and say thank you. You have an incredibly attune radar for empathy, and you would adjust. But when you lose all that, and the medium for your truth is this harsh knife forged from the soulless steel of words, it cuts more sharply than you can imagine.
In a text only medium, I implore everyone on HN: take a moment to consider our humanity. Not just here, but in all threads. Not just of the person we reply to, but of those we speak of. They might read it one day. And they, too, are human.
(I should say the painful irony in this rant is palpable to anyone who knows me :( but I am not proud of it. Perhaps it’s precisely why I feel so strongly. It does, after all, take one to know one.)
No, of course the word you're after is "tact". (And even as a foreigner, it annoys the heck out of me when people mix them up [which you didn't!], ususally as in "take another tact" when they mean "take another tack" [where 'tack' is a originally a sailing term for flipping the sail to the other side of the boat when going against the wind, IIUC; so the expression means "go in a slightly different direction"].)
I'm just riffing on Heikki's remark about foreign languages: In my primary language, Swedish, "tack" means "thank you" but is also often used for Eng. "please" -- so "Kan du ... , tack?" means "Could you ... , please?".
I second this, I did the exact same thing and it made me a bit mad -- why include interesting placeholders if you can't just "go" with them. So, in its current state, it might be a bit counterintuitive and could have a better UX.
I've poked at trying to build this multiple times for the last 10 years, but always ended up looking at the wikipedia XML and just balking.
The one difference that mine theoretically would have had is that I think it would be rad to include the paragraph you found the link along the edge in the graph, so you could see a little story.
Also, excluding years and places to make the routes more "interesting" (e.g. longest path under a limit without cycles).
The UX design is very well executed. This feels so polished. The floating graphs in the background, the individual paths under the chart, etc. It's all very well done with lots of little flourishes. Makes me want to up my game. Thanks for sharing.
[copy of answer from below] The Wikipedia database doesn't differentiate links which appear in the main article versus in the sources or categories sections. It's possible one of the intermediate links is in there. You sometimes need to do a CTRL+f in "View Source" to find the link. Also, the latest Wikipedia dump is from February 2nd, so it's possible the link has been deleted since that date. I'll regenerate my database when the new dump lands in early March.
I do a bi-directional BFS, but the search from the target node traverses incoming links as opposed to outgoing links. That's why I have to store both in the `links` table[1].
Your notification for having JS disabled made me chuckle.
One suggestion I have is that the graph view seems to clutter up pretty fast. Maybe have a slider that increases the length of the lines between the vertices. Also, a SVG export would be cool for visualizing related concepts.
The graph visualization / performance is definitely not ideal. I spent a ton of time trying to make d3 more performant and layout the graph more nicely, but ultimately I just had to cut my losses and go with what I had. I do think there is room for improvement and I'll look into your suggestion, which is something I didn't consider. SVG export is also a great idea!
How about representing the destination page as a circle around the whole graph, instead of a node? so that all the paths can be drawn in different directions but still reaching the same page. I somehow feel that it might look more beautiful.
Ooh cool idea! That certainly would improve the information density issue. I honestly never considered that at all and have no idea how I'd do it in d3, but I may try to hack it out. Thanks!
Maybe link all the outside nodes and then tell them to repel each other. If there are only two, it will be weird, but with three or more should work out.?
Thanks! I'm glad you asked. I actually do what I call a bi-directional breadth first search[1]. The gist of it is that instead of just doing a BFS from the source node until I reach the target node, I do a reverse BFS from the target node as well and wait until the two searches overlap. That helps with the exploding path problem, although that still becomes an issue for longer paths (>= 5 degrees generally). I also pre-compute all the incoming and outgoing links for each page when I create the database[2] so I don't need to do that upon every search, which resulted in a huge performance boost.
When I was first playing with this, I actually really expected you to be using an expensive performant solution like neo4j. When I read that you were using sqlite, I didn't believe it at first and thought it was a mistype from dev until I looked over the source.
That's an impressive and well thought out performance enhancement, and that the app runs so blazingly fast on sqlite is very impressive.
My thoughts exactly. It's impressive that despite relying on BFS, the tool manages to be so fast on a tiny node on google cloud. Even when there's a depth of >= 5 !
The downside of bidirectional seems to be that things like place names are linked via short paths through boring articles like "Census designated place" or "City"
Apparently an idiot here. What is the difference between web hosting and server hosting?
The way my mind interprets that stack is the database is hosted on firebase while the page is hosted on the server?
Edit: Thank you all for the explanation. I used to think firebase was used as a database. I didn't know one could host front end files there. It seems I still have a long ways to go :)
No, not an idiot. I didn't use the best terms. I updated them to say "Web (frontend) hosting" which are my static files (the HTML, JS, CSS) which is deployed to Firebase Hosting and "Server (backend) hosting" which is my backend Python Flask web server which is deployed to Google Compute Engine (GCE).
So, the website files are hosted on Firebase while the backend is hosted on GCE. The database is actually not hosted; it's just a SQLite file stored on my GCE instance.
Firebase Hosting is a simple way to host frontend code, similar to (but a little easier than) using S3 to serve the frontend. Since the database is SQLite, it seems like the backend and DB are hosted on GCE.
I assumed the other way around. Firebase gives you the webpage, but when you send a query, it is sent to Compute Engine to calculate all the paths and sent back to the frontend to render.
I'm interested in building a fact checker from a wikipedia graph, and your SDOW seems like a great place to start (I'm intending to use an algorithm inspired from researchers at Indiana University http://journals.plos.org/plosone/article?id=10.1371/journal....). I was wondering if your database has a non-GUI API. Is there a URL or something I can hit to get back JSON or XML as a response?
I'd prefer you not send any additional load to my server (this is just a side project I'm paying out of pocket for), but you are welcome to download the data yourself. There are instructions in the project README[1] to download the SQLite files I use in the project and I should have documented enough about the schema for you to know what queries to make. I am happy to answer questions via GitHub issues if you have them.
Could you please add a mode that does the opposite?
I've always enjoyed playing 6 degrees myself, so if it gives a link to the first page and names the second page, then only shows the available routes when I'm done, that would be a lot of fun.
I have a couple of "hub" articles that I like to use, but I'd like to see how much more effective I could have been with a tool like this. And if it randomizes my start and end like the placeholder text shows, that makes it even easier.
Love the idea, and it’s brilliantly executed! Well done.
Perhaps I misinterpreted the concept of “degrees of separation”, but I was expecting the site to tell me how to start at page X and get to page Y with the min number of clicks. If you wanted to achieve this, it doesn’t strike me as appropriate to use Bidirectional BFS but IANAL.
I did notice that someone pointed out that they get different results by swapping the order of X and Y. This seems pretty surprising?
> I was expecting the site to tell me how to start at page X and get to page Y with the min number of clicks.
Yup, this is exactly what the site does, and a bi-directional BFS is an efficient way to do it. The special thing about my bi-directional BFS is that I follow outgoing links when searching from the source page while following incoming links when searching from the target page[1].
> I did notice that someone pointed out that they get different results by swapping the order of X and Y. This seems pretty surprising?
This is expected, because it is a directed graph, with the links on Wikipedia being in one direction. Just because page A links to page B doesn't mean page B links to page A.
I was looking at the results of going from CMU to my little secondary school in Dublin, Ireland. I saw the results and saw that the last page before my Irish school was "College" and assumed it must be wrong, because how could my tiny secondary school be on the Wikipedia page for "College"? But alas, I was wrong!! I just checked and turns out it IS on the college wiki page!
I also assumed you were looking at outgoing links for both X and Y - that explains a lot.
I am super interested in this, but I have never done any graph theory or searching/planning (I'm EE) - how did you build up all of the incoming links for each wiki page? Are you storing all of this? How much data is that? Thanks for the reply!
I'm definitely not the first to think of it or build a tool for it (lots of similar projects gave me inspiration), but I think I'm the first to make it really fast and with a nice usable UI. And to actually open source the code so others can build it themselves.
Can you tell me more about what you mean by filtering by mutual paths?
> Can you tell me more about what you mean by filtering by mutual paths?
I should have said mutual connections, my apologies. So if article A connects to B and article B also connects to A. I suppose you could do this mostly client-side, all you need is two searches (one the reverse of the other) and an intersection of the resulting graphs, no?
I implemented this pretty naively a while back [1]. I was interested in how yours was so fast. I expected some sort of complex heuristic; cool to see that your solution is straightforward!
The database creation script[1] has a lot of Unix junk in it, but reading through the comments and echo statements should give you an idea of what it does. The end result is a SQLite database with a size of about 9 GB which has four tables, the schema of which are described in the README[2]. The big things that are precomputed are redirects are "auto-followed" to reduce the total graph size and all incoming and outgoing links are stored in a |-separated string for each page (in the `links` table).
Every time a query is made, a bi-directional breadth-first search[3] is run which uses the |-separated incoming and outgoing links and runs a fairly standard BFS algorithm. A lot of the hard work was precomputed, which minimizes the number of required database queries and makes each search respond fairly quickly.
I've not yet had a chance to look over the code (in case it's already there or infeasible due to architecture), but you may wish to consider caching prior queries and their results - this seems like the type of service that would be likely to have certain paths shared widely, such as the first few top-level comments on this post.
I'm not sure caching would help a ton given how I structure the data and do my searches in batches of pages, not for individual pages. I already do some "caching" by precomputing all incoming and outgoing links for each page when I create the database, which, as you would expect, yields a huge performance improvement. A cache certainly would help, but I would expect the hit rate on it to be extremely low, making it not worth the effort. I may have a different opinion after analyzing some of today's results though. Thanks for the suggestion!
Just a heads up - some of the node colors can be difficult to differentiate for people who are red/green colorblind. Very minor, just wanted to mention it though.
I actually had a friend suggest it to me and the Neo4j docs happen to be one of the many tabs I currently have open. I was already so far into using SQLite for this project and I wanted to ship it, so I decided to stick with what I had. I would be interested to see how Neo4j performs with such a big dataset (the resulting SQLite file is around 9 GB with nearly 6 million nodes and 500 billion links). I was a bit worried that Neo4j wouldn't be able to scale to a graph of that size, but that is a completely untested and ignorant opinion. If you have any experience with Neo4j, I'd love to hear your thoughts.
I've only played around with Neo4j, but it looks like Neo4j 3.0 could handle your node and edge count (although not earlier versions).
Your app likely would have been easier to write with python/Neo4j than python/SQLite due to Neo4j's query language (incidentally, Neo4j also prefers to use bidirectional BFS for shortest path, and executing the search is a very simple query similar to an SQL query). Neo4j would possibly perform better as well (I'm not sure about that, given SQLite is pretty optimized for reads).
However, as the neighboring comment states, Neo4j is quite resource-intensive. It wouldn't work on any kind of micro instance, I'm pretty sure.
Awesome job!.A great learning tool for someone coming from Non-Cs background to actually dabble with BFS Graph algorithms. After a quick search, i managed to ferret out a few implementations, one of which is this. Uses Neo4J.
https://github.com/erabug/wikigraph
The sheer scale of Wikipedia (5 millions pages, half a trillion links) made it difficult to make the searches fast. Simply downloading the Wikipedia database dumps and parsing them into my own database took over a day on my first successful attempt. The site returns most results in just a few seconds despite the giant graph size.
Were either of your creations around since ~2012? I recall someone sharing this very concept in a room on turntable.fm, except it would list it's discovery in real-time as an ordered list. I've been racking my memory for 2 years trying to find a link again, but here we are!
Thank you! The graph is built using vanilla d3, no library on top of it. The code for it lives all in one file, ResultsGraph.js [1]. I pieced together the code from a handful of other attempts online. I am still not 100% pleased with the performance of it with a larger number of nodes (250+), but that seems to be a common complaint with the d3 force simulation layouts.
The resulting SQLite database file is currently 8.3 GB, most of which is taken up by the `links` table. The big performance wins are having a handful of indexes (see the .sql files[1] for the database's schema) and preprocessing a lot of data so I don't have to do duplicate work every time a query occurs. For example, instead of the `links` table going from `source_id` to `target_id` and having a ton of rows which have the same `source_id`, I go from `id` to `outgoing_links` (which is a |-separate string of all source page IDs). Computing each page's incoming and outgoing links is the really heavy work and I only do that at database creation time, using a beefy GCP machine with 8 vCPUs, 52 GB RAM, and a 256 GB SSD. It still takes about an hour, but it's a one time cost and means I can run the actual service on a much smaller machine which won't cost me a fortune to maintain. Also, SQLite is just very fast and performant out of the box, so as usual, it's a matter of choosing the right tools for the job.
Not sure if you deliberately designed it this way, but I noticed when spot checking some results that it includes the bibliography section links as connections. This seems like it may not be desirable. Example, I did a search that went from the Crusades to Buzz Aldrin and I noticed that Routledge was the first hop from the Crusades. It strikes me as odd that Routledge (a publishing company) would be mentioned on the Wiki article for the Crusades. So I went to look and noticed the link it took was the citation for a book published by this company. I wouldn't really count that as a legitimate hop since it's a citation, not a content link.
EDIT - I noticed this also applies to the Notes section.
A few years ago I made something similar (http://ratewith.science/) that only uses bi-directional links, that is pages that both link to each other, and this gives much more interesting results.
When two Wikipedia pages both link to each other they are usually related in some reasonable way, but unidirectional links give you things like Wikipedia -> California, which only exists because Wikipedia is headquartered in California, a pretty weak connection.
Other than the fact I have it running on an overburdened tiny VPS, my app is also really fast even though I only do a unidirectional BFS because I use a custom in-memory binary format that's mmapped directly from a file that's only 700MB, and a tight search loop written in D.
Maybe I made a mistake somewhere, but I'm not sure it uses bi-directional links. For a really odd search like GoldenEye -> Abbotsford [1] it uses multiple "special" Wikipedia links that I'm pretty sure wouldn't be bidirectional.
Yah so what it does is try to find a path with bidirectional links, and then if it can't it tries to find a unidirectional path. You can tell which one it did by whether the numbers in the path have a pulsing glow animation, that particular path does not.
Yeah unfortunately I don't know of any way to differentiate the different types of links. Wikipedia's pagelinks database doesn't different them. I agree it's undesireable but I just cannot figure out how to cull them.
I'm not sure how the backend is structured, but it seems that you must parse the individual pages at some point or another. I took a quick look at the Wikipedia HTML for a few pages and I would suggest stripping out anything within (or nested inside) of classes like "mw-cite-backlink", "reference-text", "citation book", "citation journal", etc. Also, you can probably strip out anything inside of a <cite> HTML tag.
I'm sure there are more classes and tags, but that hopefully should give you a solid place to start.
EDIT - You can also strip out or ignore anything inside of the ordered list for references - <ol class="references">...</ol>
Also, some pages aren't documented in the same way, so something like this page - https://en.wikipedia.org/wiki/X_Window_System - doesn't have any classes or easy way to parse it for the References section even though the Notes section was set up in a more organized way. However, you could take note that the <span> tag contains class="mw-headline" id="References" and the text value is also References and then ignore everything until the next <span> begins.
I never actually touch any of the source HTML. I think that would simply take way too long and would probably result in some very high bandwidth charges. I use three tables from a public dump of Wikipedia's database, which unfortunately don't differentiate between where the links occur on the page. Check out the first section of my README[1] for more information.
Ohhhhh, yeah that changes things. I didn't realize the pagelinks database you mentioned was not generated by you based on the master Wiki dump that they keep available for download. In that case, yes this seems unsolvable without 1) petitioning them to update the code on their side to adjust for this and exclude these erroneous links or 2) deciding to create your own pagelink database based on the master wiki download and updating this database periodically (I forget how often Wiki updates the master downloads). While 1) is clearly the preferable option, it is unlikely to occur, unless maybe they are willing to add some columns to provide more details about the links and what sections, tags, classes, etc. they are a part of. That might be more palatable for Wiki to provide as it is just a code change on their end to supply more information rather than something possibly affecting many users by excluding the links altogether.
If it was technically possible, it would probably also be worth culling anything linked within a template. Like disambiguation headers at the top of a page, or semi-related lists grouped in blocks at the bottom (things like "Cities in Australia").
That might be a problem of wikipedia itself. This project can hardly fix that. I mean, That link has hardly any business of being there; "Oxford University Press" isn't linked either, though the page exists, and why would it?
I think you'd have to parse the raw wikitext dump, keeping track of which section each link belongs in. Since the raw dumps is something like 50 GB of text, this sort of thing takes a while.
I was getting interesting results until I noticed a pattern involving the presence of "Wayback machine" as the only connecting dot between really different things.
That adds noise, since articles now automatically use the wayback machine for "archived" links, thus generating many paths that do not really connect topics, just because the text "wayback machine" is part of the link text.
It may be an interesting exercise to find outliers like that and compute paths without those nodes.
I considered this and may eventually add an option to ignore those kinds of pages, but I ultimately felt like the current mode remains more true to my goal for the project which is to traverse the links as any human would be able to. By the way, the two pages with the most incoming links are "Geographic coordinate system" (1,047,096 incoming links) and "International Standard Book Number" (955,957 incoming links).
Wikipedia is highly connected. Even pretty darn different things often seem to have only three degrees of separation. For example, here's Ramesses II to Ankylosaurus in three:
I don't know if this is legit. The Lion Express node only connects to a couple of Wikipedia's generic help pages, which I'm pretty sure don't link back to Lion Express.
This makes the "How many clicks to Hitler" game much faster.
For those uninitiated, the game was to click the "Random Article" link in the sidebar and count how many links it took to get to Hitler. It is really interesting to see just how big of an event WWII was. Every country article has a section on their involvement or why they were not involved.
After playing with it more, this is pretty fun. I vote that a "degrees from Hitler" score be added to the top of every article. I think it might be an interesting proxy for how esoteric a particular page is.
This reminds me of the wikipedia rule I learned a while back: If you click the first link in an article (besides the pronunciation guide), you will always end up on philosophy.
According to the wikipedia page about this rule/game, over 97% of pages have this property. Interestingly, "Mathematics" seems to be a rare exception. You get stuck in a loop:
There are several easy to reach pages without categories.
Almost anyone alive during WW2 will have a WW2 link. Almost every country is 1 degree from Hitler. Most companies/publications around than will link to Hitler.
My colleague saw me looking at this site, and I went through exactly the same thought process as you! His feeble attempts at non-Hitler linked articles failed miserably.
Kind of reminds me of PageRank in someway; the number of "backedges" is not guaranteed to equal number of "forwardedges", hence the difference in number of paths.
Actually, interestingly, I've been introducing this concept as a "party game" with other nerds at RL gatherings for some years now. The goal is to start on a random page and find the shortest path to another random page by only clicking links in the articles. It can be quite a lot of fun, despite what you're thinking! And anybody can understand the challenge and compete and have fun. It's not just something for geeks.
Sputnik-1_EMC/EMI_lab_model
Sputnik 1
Soviet space program
United Nations Committee on the Peaceful Uses of Outer Space
United Nations
Model United Nations
Dwight Schrute
Spud gun
Also I'm realizing after finding that that I could have used the site in this post...
I think it would be interesting to have a "swap" button between the start and end points, perhaps just under where it says "to". (similar to the swap buttons in translator apps and GPS apps to quickly swap the start and end points) As some other comments have mentioned, the paths are not necessarily the same route or length, and it is fun to see how they might be different.
Good suggestion! I intended to have that exact button, but couldn't find a way to put it in the UI without making things more confusing. I expect I'll add it in the future. Thanks for the suggestion!
I'd prefer to have a brief technical explanation why Javascript is needed instead of this condescending labeling.
Edit: NoScript is not a luxury for technically minded geeks anymore, it's a necessary protection against tracking, CPU-consuming advertisement and attacks like Spectre/Meltdown and who knows what else Intel has for us.
I found one with 813 paths with 4 degrees, but yours has 1,645 paths with 5 degrees. I suppose there are end points with many thousands of links. It's a bit of trivia, not really a measure of anything.
It's a bit surprising, since you'd think that with 1,645 paths with 5 degrees, there'd be at least one path with 4 degrees, but it doesn't work that way.
Incidentally, it´s a good way to find incorrect ambiguous wikilinks. [[Comparison of web browsers]] shouldn´t link directly to [[Gnome]], which is the article about small mythologic humanoids, not about the desktop environment.
Wikipedia dumps raw database tables in a gzipped SQL format for the English language Wikipedia (enwiki) approximately once a month (e.g. dump from February 1, 2018)...By default, the script grabs the latest dump (available at https://dumps.wikimedia.your.org/enwiki/latest/), but you can also call the database creation script with a download date in the format YYYYMMDD as the first argument.
I'd guess the tool is working off an out of date article.
At the bottom of the page for each of the countries "Articles related to Italy" is a huge expandable section, with even more expandable sections inside. You can actually find both New Zealand and Faroe Islands in there.
If you view-source the Italy page, you can find a link to both of those countries.
Sexism links to both philosophy and Ethiopia. It is treating links as an undirected graph (despite the visual indication showing a directional vector), which seems entirely fair.
It seems somewhat unfair. If it's not a directed graph, it means you can't actually navigate from the source to the target page by following these links...
edit: in fact, the author describes the goal of his project in another comment here: "my goal for the project which is to traverse the links as any human would be able to".
Every single example I can find makes it painfully clear it is an undirected graph. Sets may differ going from one direction to the other simply because it is only trying to find a sample set of correlations, not an exhaustive set.
The graph is most definitely directed. One small example is Facebook -> Narcissism (1 path of 1 degree)[1] compared to Narcissism -> Facebook (8 paths of 2 degrees)[2].
Philosophy links to neither sexism or "The Demographics of Africa" (and while some older version might link to sexism, it seems unlikely that it ever linked to the Demographics of Africa"). Both of those do link to Philosophy, however.
Alternately it is directed, except when it doesn't find an easy route and uses an undirected result.
About 12 years ago, some colleagues (Yehuda Koren and Chris Volinsky) and I proposed a technique for measuring proximity in quasi-random (social) networks, that can handle out-of-memory databases, and more than two query nodes, also finds a visualizable subgraph that represents as much of the relationship as possible using dynamic programming. It is described here (includes some figures): http://web2.research.att.com/export/sites/att_labs/groups/in... The proposed heuristic approximates the "cycle-free effective [electrical] conductance" between the query nodes. In this paper, we were able to use anonymized phone calls to confirm 6 degrees of separation.
There used to be a live demo on an AT&T Labs website but it is not available now. There are published algorithms for all the phases of the proposed heuristic, but my recollection is that Yehuda found an efficient, robust implementation of k-disjoint-shortest-paths was not easy.
This is an interesting problem, thank you for making your work available. (I do agree the HTML form placeholders that change rapidly but are ignored when you press the GO button are a little confusing; it took me a minute or two to figure out what was going on.)
I clearly should've done some more user testing ;) I think that's a great suggestion and would be much better than showing an error message. I'll update that once I have some time when the server isn't crashing!
Took some tinkering to hit five degrees, from 'Hemisphere Dancer' (Jimmy Buffet's old seaplane) to the VC (Vapnik–Chervonenkis) dimension. Rewarding for the site to say 'wipes brow I really had to work for this one.'
I would love to see a list compiled somewhere of two articles with exactly 6 degrees of separation. This is proving to be extremely difficult. In the entire HN thread so far, I only see one so far by dkuder https://www.sixdegreesofwikipedia.com/?source=Six%20Degrees%...
I'm storing all the search results and will do some analysis on the data. Maybe I'll even get around to adding a new page with some of the interesting stuff I find.
This thing is pretty dang good, "Golden Ratio" to "Pope Pius X", what could possibly go wrong? Nailed it with "1 path with 2 degrees of separation". "Infinite Series" to "Declaration of Independence", nailed it in "4 paths with 3 degrees of separation". This is a lot of fun
This is brilliant. For a long time I have had a 'motorway game' I play with my brother in law where we identify two cars on the road and then have to come up with the connection. So often you see badge engineered vans where one partner, e.g. Nissan rather than Renault, makes the van as seen, so you can match the other partner, e.g. Renault, in this game, since they are the same company, then the next vehicle, e.g. some Mercedes car, how do you get the link, and prove it when at motorway speed?
Clearly this app makes it effortless, you could find some supplier like Recaro actually make the seats in both, even the van. Or you could find some joint venture in Brazil that both companies share, who knows... So I look forward to using this app on the M4 some time soon!
Well that question only makes sense for connected graphs, we don't really know whether this is connected. So in general a version of this question that makes sense is, among all the connected components of the wikipedia graph, what is the largest diameter.
This is an interesting question that I'd like to answer now that I have all the data. I am curious to see how long it will take to find a solution as I believe even the most efficient algorithms for this have a high runtime complexity.
And yes, the graph is not connected (there are both nodes with no outgoing links and with no incoming links), but over 99% of the pages are connected, so the answer would still be interesting and worthwhile.
Another interesting question would be what the largest graphs are that are disconnected from the "main" one. Are there any larger than 1 or 2 nodes? Is there some set of pages on some obscure topic that only link within themselves?
Floyd-Warshall solves the all-pairs shortest path in time O(V^3). Running a BFS search rooted from every node would find the shortest path in an unweighted graph in O(V*(V + E)).
Hence, since there's a whole lot of Vs in the wikipedia graph, he's probably going to be satisfied with an approximate solution, unless he has a lot of CPU hours to spare.
There's 5,579,252 articles in English Wikipedia right now, which means the largest component has 5 million and change or so vertices. Each individual BFS should take a few CPU-seconds at most, and you can trivially parallelize the BFS for each node across as many CPUs as you have.
> Several people were asking about what's known as the "diameter" of Wikipedia, that is, the distance between the two articles furthest apart (the longest shortest path if that makes any sense to you). This was in fact the original goal of the project but it turned out not to be very interesting. Wikipedia has giant "tails", almost linear linked lists of articles that stretch out for 70 links. The worst offenders were the subpages of List of named asteroids as each is only linked from the previous one, and it takes about 70 links to get from anywhere to the last one. So, you find that the two articles furthest from each other are List of asteroids/145701-145800, linked to by List of asteroids/145601-145700, linked to by List of asteroids/145501-145600, and so on for 70 links until you get back to some vageuly normal article. This is far less interesting that I was hoping. Even when I special-cased out that string of 70 boring articles, a new one appeared (I think it was linked pages about administrations of Myanmar or something). Rather than special-casing out reams of articles, I decided to pick a different metric, one that better reflects the structure of wikipedia without arbitrarily removing parts of it.
It appears that the particular tail has been "cut" however.
Would be cool to see the paths ranked in order of “narrowness”, e.g. ranking a path with shorter, more specific articles more highly than a path that goes through a mega-page like “France”.
This reminds me of that thing back around 2010 where if you clicked the first* link on any wikipedia page you'd eventually get to the article on philosophy or something. I forget which it was exactly, and I have no idea if it still works, but it was pretty fun back then trying to find a starting article that couldn't find the "philosophy" article.
*Non-italicised because of the disambiguation suggestions
Another similar project from years ago came to the conclusion that disregarding "list" wikipedia entries the centre was the article on The United Kingdom. You reach a similar conclusion?
I like the little trivia that show up, although sometimes they disappear too quickly.
My favorite mentioned which Wikipedia article had the longest article name:
Suzukake no Ki no Michi de "Kimi no Hohoemi o Yume ni Miru" to Itte Shimattara Bokutachi no Kankei wa Dō Kawatte Shimau no ka, Bokunari ni Nannichi ka Kangaeta Ue de no Yaya Kihazukashii Ketsuron no Yō na Mono
(https://en.wikipedia.org/wiki/Suzukake_no_Ki_no_Michi_de_%22...)
The full fact list is on GitHub[1]. The fact list was a lot more interesting and important when searches took longer to run. One of the bad things about improving the performance so much was the fact that the facts don't have as long to display :P
It would be a bit more interesting if it could distinguish between the "See Also" and other tables at the bottom of the pages, and the main content. E.g. I connected Donald Knuth to a Geneticist simply because they both won awards (in different categories).
Unfortunately I'm not aware of a way to distinguish between the two. Wikipedia stores both types of links in the same database. I would love to cull out all the links in category boxes and sources. If anyone has any ideas, let me know!
Very cool. And a lot less degrees of freedom between my tests than I expected - so far everything has a legitimate 3 degrees when I expected much more. Higgs Boson to Taylor Swift in 3 steps, amazing. Not only that, but 91 ways to get there in just 3 steps.
This basically quantifies what my wife and I jokingly refer to "rabbit-holing" online. She'll ask me what I'm reading and it will be something totally unrelated to what I said I was coming to look up. And she's always like "how did that happen?" And I never have a good answer. But now I do (if it involves Wikipedia)!
I love all the little touches, like the fun facts (2.51% of Wikipedia articles don't have incoming links) and UI touches ("wipes brow, I really had to work hard for this one").
I noticed that pages about countries are good hubs for this kind of connection. The page cites someone, their page has a link to his country of origin and these country pages are really extensive and have links to a variety of topics.
You should track the "longest shortest paths" of things and display them - has anyone managed to hit 10 degrees of separation, for example? If someone finds one, let us know!
Those graph node colors are not very color blind friendly. They are easy to differentiate in the list view, but in the graph with black borders, they are just too low contrast.
This is incorrect, that's not the reason they are linked. It is using the category box at the bottom of the page, and both Hitler and Gates were voted 'Times Person of the Year' at some point.
It looks like it uses links in the 'category' box at the bottom of the page. Both MLK and Zuckerberg appear in the 'Times Person of the Year' category.
Please make it Ctrl+Scroll to zoom (a UI standard obeyed in umpteen applications!), and leave Scroll alone for scrolling through the page.
Nobody wants to have to click outside the graph view to change the state of the UI so that Scroll scrolls the page.
(Of course, Ctrl+Scroll is already taken for browser zoom; but hijacking browser zoom in this situation is more acceptable than hijacking vertical scroll).
So the reason is that "Theodore Roosevelt" links to "Freddy Fazbear" (ctrl+f for "Articles related to Theodore Roosevelt" and then click it and then click "Teddy Bears") which redirects to "Friday Night at Freddy's". So, technically, there is a direct link there, albeit through the categories dropdowns. Unfortunately, Wikipedia doesn't have anything in their database dumps identifying if a link is in the article itself versus the categories dropdowns or sources, so I include them all.
I have built a very similar project some time ago and although it's not as beautiful and organized as yours, it's pretty fast! It's in Portuguese, but if any of you guys want to check it out: http://wikigraph.russoft.tech/
Pretty nifty! Nice visualization and execution. It reminded me of a game a friend of mine wrote as a Greasemonkey script back in the day, where you do the work: http://www.playwikipaths.com
One challenge with Wikipedia data I've struggled with over the years is that knowledge graph and interest graph don't directly overlap. Cancer is about as closely related to Healthcare as Horse is to Astronaut (stronger ties, but same distance).
Very cool! Your UI is great. I like all the animations and the graph is super smooth. All my code is open source[1] and it is decently documented. I'm happy to answer any questions you have and you're more than welcome to use any of the code I wrote for your project.
Created an account just to say this: First time in a long time I've seen something this fun on Hacker News. Just spent the last 10 minutes trying to see the hidden connections between Bagels and the Atlantic Puffin. Great work!
Really really cool implementation and i like that UI. This inspires me of making kinda six degrees separation of companies by the board directors (businesses) and the political climate in my country.
I think this is going to be extremely computationally intensive. One of the big performance wins I got when designing the search algorithm[1] was visiting as few nodes in the graph as possible (which I did via a bi-directional breadth-first search). To find the most distant node, I'd need to traverse the entire graph, which consists of almost 6 million nodes. It can be done, but it would take minutes, hours, days, ...
Thanks! It's weird, but as soon as I tried upgrading to an identical GCP machine running Debian 9 (stretch) and run my database creation process[1], it took many, many times longer to even download the several GB dumps from Wikipedia via wget than it did on Debian 8 (jessie). As the beefy machine I use for the database creation process costs ~40 cents per hour, I figured it wasn't worth it to upgrade to Debian 9 since the Debian 8 machine finished in one hour and worked fine. After trying it a second time a month later and realizing the same thing happened, I added a note for myself in the README about it. I did find some other reports about it online[2] and just decided it wasn't that critical to upgrade it at the moment.
Yes, Sthephen Dolan's project (along with others) was definitely an inspiration for me! Although, I like to think I made a lot of performance and design improvements over his work. I used to have an acknowledgements section in my README with his name in it, but I seemed to have destroyed that. I'll make sure to give credit where credit is due and list all my inspirations.
All I get for every search (no matter what terms) is: "Whoops... something is broken and has been reported. In the mean time, please try a different search."
This looks great! Nice work.
I have a suggestion: Could you highlight a path when an edge is clicked on? I guess the clicking on the nodes can link to their wiki pages.
Yes, definitely a feature I'd like to add. I can change the opening the page on Wikipedia to happen on double click instead of single click. I am still a d3 noob and need to figure out how to implement the highlight path thing.
The Wikipedia database doesn't differentiate links which appear in the main article versus in the sources or categories sections. It's possible one of the intermediate links is in there. You sometimes need to do a CTRL+f in "View Source" to find the link. Also, the latest Wikipedia dump is from February 2nd, so it's possible the link has been deleted since that date. I'll regenerated my database when the new dump lands in early March.
5 degrees if you pick Vim (disambiguation page). If you pick Vim (text editor) you get 1 degree. Which is interesting, since Vim (disambiguation page) links to Vim (text editor).
I know you're getting swamped by feature requests, but a button to swap the endpoints would be great. In searching for ever bigger degrees of separation I found myself manually swapping them a lot to see the difference. It could also lessen some of the confusion expressed in this thread about whether these should be different at all.
As for how far I've gotten: I've found finite degrees all the way up to seven [1] and also pages with no path from one to the other [2]. I have yet to find a doubly-untraversable path or anything with a degree between seven and infinity, and I suspect the latter to be impossible, especially considering Goldbach's Extremely Strong Conjecture [3].
[1] https://github.com/jwngr/sdow