Hacker News new | comments | show | ask | jobs | submit login
Show HN: Six Degrees of Wikipedia (sixdegreesofwikipedia.com)
1176 points by jwngr 7 months ago | hide | past | web | favorite | 323 comments



Creator here. Six Degrees of Wikipedia is a side project I've been sporadically hacking on over the past few years. It was an interesting technical challenge and it's fun to play with the end result. Here's the tech stack:

  * Frontend: React (Create React App)
  * Backend: Python Flask
  * Database: SQLite
  * Web (frontend) hosting: Firebase Hosting
  * Server (backend) hosting: Google Compute Engine (it runs fine on a tiny f1-micro instance)
All the code is open source[1] and I'm happy to answer any questions about building or maintaining it!

[1] https://github.com/jwngr/sdow


make it so that if I visit the page and just click the "go" button, it will use the placeholder examples as the start and end points. i did this and got an error message stating "You'll probably want to choose the start and end pages before you hit that." that was annoying. the placeholders that were auto chosen were actually really interesting.


'Please' and 'thank you' go a long when requesting additional features for an OSS project.


There's a certain irony to the lack of tact in this post suggesting better manners. I suspect the point would be better received were it more politely made.

I, of course, am merely propagating the cycle.


That post seems like quintessential trolling. It was my expectation as well when going to the page to just click the "Go" button and see what the results would be, and I think the suggestion is great feedback. I think it is generally in the spirit of these "Show HN" posts to give and get feedback and discuss with the creator. The response post to that user, to me, is patronizing and is undoubtedly going to ruffle some feathers. But, now we aren't even talking about the OPs cool project...


When I was a kid my mom told me "always say 'please' and 'thank you'."

So from then on, when I wanted something, I said "please and thank you."

I think this was my first social hack. I just wish it had worked better!


I've see something similar when offering items on freecycle: people seem to think that the more they insert "pls" and "thnx" into a message the more likely it is that they'll be first in the queue.

I swear at least once there was more instances of "pls" and thnx" (there seems to be a significant overlap between people who overuse the words and people who don't bother to type them properly, though that may be the linguistic snob in me talking) than all other words combined. It backfired by making the message difficult to read so I binned it and read the next.


Come on, this is Show HN! Not some GitHub issue tracker. Try to insert please and thank you in that post ... it will make it sound rude! It will turn it from Show HN feedback to sounding like a feature request, which it is clearly not...


Just adding "Could you" to the start (and change first sentence's . to ?) would make a world of difference


If you can automatically detect sentences where adding "Could you" would make the author sound more tactful, then why not just mentally add them and assume that the author intended to sound that tactful?

This is a slight modification of the principle of charity, something I think is incredibly important and underemphasized, especially for online text communication.


> why not just mentally add them and assume that the author intended to sound that tactful?

Because, that could be the wrong action to take. Could you, Would you, Will you, I want you to, You should, The project is stupid unless you, etc.

Therefore, the author should communicate well enough to be understood.


Based on the thread here, many people interpreted the comment as a well-intended feature request, but instead of requesting the commenter to change their comment, it seems like it would’ve been fine to just use that charitable interpretation.


I’m all for being charitable in our interpretations of the motives of others, but I think a total lack of the most basic courtesy should be seen for what it is. The lower end of the spectrum being lazy, and the high end being rude, even charitably no positive interpretation exists IMO.


That's a really cool concept. Write a medium article about it.


I can't tell of this is disengenuous or not..


I think that's the point.


The link in your profile is broken, FYI.


Please and thank you are such basic concepts that honestly, I judge the hell out of people who don’t seem capable of it regardless of platform. If that minimal degree of politeness isn’t reflexive for you, again, the world at large judges you. Not to put too fine a point on it, but maybe “why not add it in your head” is the kind of thing thst gives some in tech a bad name?

It costs nothing to be polite.


This is a bit off topic, but just to blow ones mind: Finnish, which I am natively speaking, is lacking direct "please" word completely and I have to admit that especially orally I forget it often. (We make things sound nice by conditional, "could you", "would you be so kind" etc..) I have no immediate reaction what so ever how it would feel if someone left out "please". I understand the concept, but it feels horrible how much wrong I can go with native English speakers without meaning it. Thanks for the reminder! (thank you we do have;)


That’s not off-topic at all! Honestly, I’m ashamed that I hadn’t considered the possibility of a non-native speaker’s linguistic quirks, and what this tells me is that I need to continue to work on my snap-judgements. Thank you for reminding me that languages don’t handle courtesy the same way.


This was really nice to hear! Don't worry, one can only think through ones own mind =) I think this already helps, if everybody are just a bit more aware about that others might think differently, and not be deliberately mean. I have learned several languages, since Finnish is very useless in a large scale, and it always amazes me how in different languages it is also thought differently. I can only recommend!


You're right that please and thank you could come off as sarcastic or a bit odd in a comment section like this.

The best route is probably to clarify that it seems like a good concept, and you think that adding x and y would be the best features to focus on that would make it even better.


Please use "Please" when you are requesting more work from OSS, OR even better, make the change yourself and how all of us how it should be done.


It's a suggestion, not a request and a good one at that. So who should be saying 'thank you' here (if anyone)?


There’s a human being at the other end of the line. They have feelings. This is their baby, which they’ve gestated for multiple years.

Rational fundamentalism is not the one ring to rule them all. There is room for love and compassion in the world. I know (painfully well) that Internet forums (and this one in particular) anything but empower empathy, but that doesn’t make it any less true or valuable.

There is room for love and empathy. There really is. Logical, rationalistic supremacy notwithstanding.


When your baby grows up, and you put it out in the world, you no longer need to--and no longer should--continue to treat it like a baby. The author is not her project, and statements about the project are not personal attacks on the author. And if she were going to take them as such, she probably wouldn't be sharing that project on HN.


This is not how humans work. Humans are not rational emotionless robots. Humans are complex beings , they have feelings, biases and emotions which they do not control. It’s humane , considerate and kind to keep that in mind when communicating with each other.

Nobody is perfect. We shouldn’t expect anyone to be. Being kind and considerate is not a waste of anyone’s time, it’s not disrespectful nor patronising.

Maybe all of this is not true for OP, maybe they’re so zen they immediately see past their human instinct to attach themselves to their work (a well known and documented human trait). But it can’t hurt to be tactful , just in case they’re not.

I know I personally would’ve felt hurt if that first comment were the reply to something I did. I might, hopefully, have realised that it was, indeed, a useful comment and that I should swallow my pride. And I hope I would have the presence of mind to reply with a smile and a thank you. But it would be a pretense.

Why put each other through that? We can be truthful and tactful. There is no reason to be blunt.

I can understand the desire to unshackle ourselves from the low entropy morass of politeness. It’s fake and not what you really feel, right? The truth shines through anyway, so do them the honour of not sugar coating it, and be straight? I do understand, but hear this: you lack the nonverbal cues to make that incredibly delicate call. Face to face you would see the twitch in someone’s eye, the soft breeze of disappointment whisking along their face. A tear of pain streaking their human heart while they smile and nod and say thank you. You have an incredibly attune radar for empathy, and you would adjust. But when you lose all that, and the medium for your truth is this harsh knife forged from the soulless steel of words, it cuts more sharply than you can imagine.

In a text only medium, I implore everyone on HN: take a moment to consider our humanity. Not just here, but in all threads. Not just of the person we reply to, but of those we speak of. They might read it one day. And they, too, are human.

(I should say the painful irony in this rant is palpable to anyone who knows me :( but I am not proud of it. Perhaps it’s precisely why I feel so strongly. It does, after all, take one to know one.)


Tact.


Stack overflow has conditioned me to not use manners. My personal preference is to use them.


He is not making any feature request. He is only pointing out a serious UI bug. Free work.


This has been fixed[1] and should behave in a more intuitive way now. Thanks for the suggestion!

[1] https://github.com/jwngr/sdow/commit/6e42e06488a592784e5d3d2...


awesome!!! you rock!!!


I second this, I did the exact same thing and it made me a bit mad -- why include interesting placeholders if you can't just "go" with them. So, in its current state, it might be a bit counterintuitive and could have a better UX.


I've poked at trying to build this multiple times for the last 10 years, but always ended up looking at the wikipedia XML and just balking.

The one difference that mine theoretically would have had is that I think it would be rad to include the paragraph you found the link along the edge in the graph, so you could see a little story.

Also, excluding years and places to make the routes more "interesting" (e.g. longest path under a limit without cycles).


The UX design is very well executed. This feels so polished. The floating graphs in the background, the individual paths under the chart, etc. It's all very well done with lots of little flourishes. Makes me want to up my game. Thanks for sharing.


Am I not understanding something? When I linked Obama to Uruguay, Myanmar came up (the only surprising one).

https://www.sixdegreesofwikipedia.com/?source=Uruguay&target...

I went to the Uruguayan wiki page and found nothing on Myanmar and nothing relating to Uruguay on the Myanmar page.

Awesome project btw, great idea.


[copy of answer from below] The Wikipedia database doesn't differentiate links which appear in the main article versus in the sources or categories sections. It's possible one of the intermediate links is in there. You sometimes need to do a CTRL+f in "View Source" to find the link. Also, the latest Wikipedia dump is from February 2nd, so it's possible the link has been deleted since that date. I'll regenerate my database when the new dump lands in early March.


It's a directed graph. Your source is Uruguay and your destination is Obama. You'll find Obama on the midpoints.

Opposite direction: no Myanmar https://www.sixdegreesofwikipedia.com/?source=Uruguay&target...


Which is kind of weird since he is using bi directional BFS.


I do a bi-directional BFS, but the search from the target node traverses incoming links as opposed to outgoing links. That's why I have to store both in the `links` table[1].

[1] https://github.com/jwngr/sdow/blob/f0b5a9ebe47ea0eca49d8220a...


Your notification for having JS disabled made me chuckle.

One suggestion I have is that the graph view seems to clutter up pretty fast. Maybe have a slider that increases the length of the lines between the vertices. Also, a SVG export would be cool for visualizing related concepts.


Glad you found one of the Easter eggs :D

The graph visualization / performance is definitely not ideal. I spent a ton of time trying to make d3 more performant and layout the graph more nicely, but ultimately I just had to cut my losses and go with what I had. I do think there is room for improvement and I'll look into your suggestion, which is something I didn't consider. SVG export is also a great idea!


Interesting you used d3. I used sigma.js on a very similar project in the past and only had performance issues with giant graphs. http://sigmajs.org/


> Glad you found one of the Easter eggs :D

Easter egg? I never have JS enabled on random websites.


Hi jwngr,

Thanks for the cool hack. Its nice.

How about representing the destination page as a circle around the whole graph, instead of a node? so that all the paths can be drawn in different directions but still reaching the same page. I somehow feel that it might look more beautiful.


Ooh cool idea! That certainly would improve the information density issue. I honestly never considered that at all and have no idea how I'd do it in d3, but I may try to hack it out. Thanks!


Maybe link all the outside nodes and then tell them to repel each other. If there are only two, it will be weird, but with three or more should work out.?


Great work!

Do you simply do a BFS to find the shortest paths? If so, are you doing any tricks to avoid the path explosion problem?


Thanks! I'm glad you asked. I actually do what I call a bi-directional breadth first search[1]. The gist of it is that instead of just doing a BFS from the source node until I reach the target node, I do a reverse BFS from the target node as well and wait until the two searches overlap. That helps with the exploding path problem, although that still becomes an issue for longer paths (>= 5 degrees generally). I also pre-compute all the incoming and outgoing links for each page when I create the database[2] so I don't need to do that upon every search, which resulted in a huge performance boost.

[1] https://github.com/jwngr/sdow/blob/a2699dc95d884ec64a4641630... [2] https://github.com/jwngr/sdow/blob/a2699dc95d884ec64a4641630...


When I was first playing with this, I actually really expected you to be using an expensive performant solution like neo4j. When I read that you were using sqlite, I didn't believe it at first and thought it was a mistype from dev until I looked over the source.

That's an impressive and well thought out performance enhancement, and that the app runs so blazingly fast on sqlite is very impressive.


My thoughts exactly. It's impressive that despite relying on BFS, the tool manages to be so fast on a tiny node on google cloud. Even when there's a depth of >= 5 !


Especially SQLite is often overlooked. A marvelous library that should be part of every personal toolkit.


Neo4j is not necessarily more performant, especially for a particular use case like this.


The downside of bidirectional seems to be that things like place names are linked via short paths through boring articles like "Census designated place" or "City"


The bi-directional nature of the search does not change the end result. It is simply a performance improvement.


Still - It can be annoying. Maybe there could be an option to exclude articles with specific text/link ratios?


It is bidirectional BFS: https://github.com/jwngr/sdow/blob/master/sdow/breadth_first...

A* can't be used given that path cost or expected remaining distance is unknown.

Any ideas on how such an algorithm could be used without precomputing the entire graph?


Apparently an idiot here. What is the difference between web hosting and server hosting?

The way my mind interprets that stack is the database is hosted on firebase while the page is hosted on the server?

Edit: Thank you all for the explanation. I used to think firebase was used as a database. I didn't know one could host front end files there. It seems I still have a long ways to go :)


No, not an idiot. I didn't use the best terms. I updated them to say "Web (frontend) hosting" which are my static files (the HTML, JS, CSS) which is deployed to Firebase Hosting and "Server (backend) hosting" which is my backend Python Flask web server which is deployed to Google Compute Engine (GCE).

So, the website files are hosted on Firebase while the backend is hosted on GCE. The database is actually not hosted; it's just a SQLite file stored on my GCE instance.


Firebase Hosting is a simple way to host frontend code, similar to (but a little easier than) using S3 to serve the frontend. Since the database is SQLite, it seems like the backend and DB are hosted on GCE.


I assumed the other way around. Firebase gives you the webpage, but when you send a query, it is sent to Compute Engine to calculate all the paths and sent back to the frontend to render.


This is an awesome tool!

I'm interested in building a fact checker from a wikipedia graph, and your SDOW seems like a great place to start (I'm intending to use an algorithm inspired from researchers at Indiana University http://journals.plos.org/plosone/article?id=10.1371/journal....). I was wondering if your database has a non-GUI API. Is there a URL or something I can hit to get back JSON or XML as a response?


I'd prefer you not send any additional load to my server (this is just a side project I'm paying out of pocket for), but you are welcome to download the data yourself. There are instructions in the project README[1] to download the SQLite files I use in the project and I should have documented enough about the schema for you to know what queries to make. I am happy to answer questions via GitHub issues if you have them.

[1] https://github.com/jwngr/sdow#get-the-data-yourself


That makes sense; thanks for your response!


Some years ago, I owned sixipedia.com. I never knew what to do with it.... If I still had it, I would have given it to you.


Could you please add a mode that does the opposite?

I've always enjoyed playing 6 degrees myself, so if it gives a link to the first page and names the second page, then only shows the available routes when I'm done, that would be a lot of fun.

I have a couple of "hub" articles that I like to use, but I'd like to see how much more effective I could have been with a tool like this. And if it randomizes my start and end like the placeholder text shows, that makes it even easier.



Love the idea, and it’s brilliantly executed! Well done.

Perhaps I misinterpreted the concept of “degrees of separation”, but I was expecting the site to tell me how to start at page X and get to page Y with the min number of clicks. If you wanted to achieve this, it doesn’t strike me as appropriate to use Bidirectional BFS but IANAL.

I did notice that someone pointed out that they get different results by swapping the order of X and Y. This seems pretty surprising?

Well done again!


Thanks, glad you enjoyed it!

> I was expecting the site to tell me how to start at page X and get to page Y with the min number of clicks.

Yup, this is exactly what the site does, and a bi-directional BFS is an efficient way to do it. The special thing about my bi-directional BFS is that I follow outgoing links when searching from the source page while following incoming links when searching from the target page[1].

> I did notice that someone pointed out that they get different results by swapping the order of X and Y. This seems pretty surprising?

This is expected, because it is a directed graph, with the links on Wikipedia being in one direction. Just because page A links to page B doesn't mean page B links to page A.

[1] https://github.com/jwngr/sdow/blob/master/sdow/breadth_first...


I just realized what my confusion was over!

https://www.sixdegreesofwikipedia.com/?source=Carnegie%20Mel...

I was looking at the results of going from CMU to my little secondary school in Dublin, Ireland. I saw the results and saw that the last page before my Irish school was "College" and assumed it must be wrong, because how could my tiny secondary school be on the Wikipedia page for "College"? But alas, I was wrong!! I just checked and turns out it IS on the college wiki page!

I also assumed you were looking at outgoing links for both X and Y - that explains a lot.

I am super interested in this, but I have never done any graph theory or searching/planning (I'm EE) - how did you build up all of the incoming links for each wiki page? Are you storing all of this? How much data is that? Thanks for the reply!


Check out an earlier comment I made[1] which has some information about this. It also includes links to the relevant code, which is all open source.

[1] https://news.ycombinator.com/item?id=16469260


I had a conversation with a friend a few weeks ago that surely this already exists, and if not that someone should make this.

Any plans to filter by mutual paths?


I'm definitely not the first to think of it or build a tool for it (lots of similar projects gave me inspiration), but I think I'm the first to make it really fast and with a nice usable UI. And to actually open source the code so others can build it themselves.

Can you tell me more about what you mean by filtering by mutual paths?


> Can you tell me more about what you mean by filtering by mutual paths?

I should have said mutual connections, my apologies. So if article A connects to B and article B also connects to A. I suppose you could do this mostly client-side, all you need is two searches (one the reverse of the other) and an intersection of the resulting graphs, no?


I implemented this pretty naively a while back [1]. I was interested in how yours was so fast. I expected some sort of complex heuristic; cool to see that your solution is straightforward!

[1] https://github.com/wwalexander/wikipath


I think GP wanted to find paths in both directions: X -> A -> B -> Y and Y -> C -> D -> X. Possibly where A = D and B = C.


I would like to hear a little more on how you organized the search and what you are pre-processing and what you calculate on-the-fly. Thanks.


The database creation script[1] has a lot of Unix junk in it, but reading through the comments and echo statements should give you an idea of what it does. The end result is a SQLite database with a size of about 9 GB which has four tables, the schema of which are described in the README[2]. The big things that are precomputed are redirects are "auto-followed" to reduce the total graph size and all incoming and outgoing links are stored in a |-separated string for each page (in the `links` table).

Every time a query is made, a bi-directional breadth-first search[3] is run which uses the |-separated incoming and outgoing links and runs a fairly standard BFS algorithm. A lot of the hard work was precomputed, which minimizes the number of required database queries and makes each search respond fairly quickly.

[1] https://github.com/jwngr/sdow/blob/master/database/buildData... [2] https://github.com/jwngr/sdow#database-creation-process [3] https://github.com/jwngr/sdow/blob/master/sdow/breadth_first...


Thanks!


I've not yet had a chance to look over the code (in case it's already there or infeasible due to architecture), but you may wish to consider caching prior queries and their results - this seems like the type of service that would be likely to have certain paths shared widely, such as the first few top-level comments on this post.


I'm not sure caching would help a ton given how I structure the data and do my searches in batches of pages, not for individual pages. I already do some "caching" by precomputing all incoming and outgoing links for each page when I create the database, which, as you would expect, yields a huge performance improvement. A cache certainly would help, but I would expect the hit rate on it to be extremely low, making it not worth the effort. I may have a different opinion after analyzing some of today's results though. Thanks for the suggestion!


Thus is awesome.

Just a heads up - some of the node colors can be difficult to differentiate for people who are red/green colorblind. Very minor, just wanted to mention it though.


Looks pretty cool.

Simple question I have is, are you hitting wikipedia api live? or you have dump of the wikipedia and running through it?

If running through dump do you update it regularly or how?

Thanks in advance.


The autocomplete suggestions hit the live Wikipedia API[1]. The actual search algorithm is on a dump of Wikipedia[2], which I plan to update monthly.

[1] https://github.com/jwngr/sdow/blob/f39398d112fecf7b993c64bd4... [2] https://github.com/jwngr/sdow#data-source


Hey man, just a nitpick. The input fields fudge up when using dark GTK themes, as in text isn't legible unless I select it. Might wanna look into it.


Great project! I wonder if you might be willing to go into more details about what made it an "interesting technical challenge"?


The sheer scale of Wikipedia (5 millions pages, half a trillion links) made it difficult to make the searches fast. Simply downloading the Wikipedia database dumps and parsing them into my own database took over a day on my first successful attempt. The site returns most results in just a few seconds despite the giant graph size.


Why not use a graph database like Neo4j instead of SQLite? This seems like the perfect use case.

Is it because of the resources required to run one versus SQLite?


I actually had a friend suggest it to me and the Neo4j docs happen to be one of the many tabs I currently have open. I was already so far into using SQLite for this project and I wanted to ship it, so I decided to stick with what I had. I would be interested to see how Neo4j performs with such a big dataset (the resulting SQLite file is around 9 GB with nearly 6 million nodes and 500 billion links). I was a bit worried that Neo4j wouldn't be able to scale to a graph of that size, but that is a completely untested and ignorant opinion. If you have any experience with Neo4j, I'd love to hear your thoughts.


I've only played around with Neo4j, but it looks like Neo4j 3.0 could handle your node and edge count (although not earlier versions).

Your app likely would have been easier to write with python/Neo4j than python/SQLite due to Neo4j's query language (incidentally, Neo4j also prefers to use bidirectional BFS for shortest path, and executing the search is a very simple query similar to an SQL query). Neo4j would possibly perform better as well (I'm not sure about that, given SQLite is pretty optimized for reads).

However, as the neighboring comment states, Neo4j is quite resource-intensive. It wouldn't work on any kind of micro instance, I'm pretty sure.


Awesome job!.A great learning tool for someone coming from Non-Cs background to actually dabble with BFS Graph algorithms. After a quick search, i managed to ferret out a few implementations, one of which is this. Uses Neo4J. https://github.com/erabug/wikigraph


Maybe the Neo4j guys would be into helping as they are quite into marketing. This could be a neat showcase speaking to both business and tech types.


I've used neo4j before and it likes to consume a lot of memory.


This is pretty cool. It would be great to write a blog around your technical decision making for this project. Thanks!


Damn, you beat me to it! I've been hacking on something similar for the longest time. Thanks for sharing your code!


Were either of your creations around since ~2012? I recall someone sharing this very concept in a room on turntable.fm, except it would list it's discovery in real-time as an ordered list. I've been racking my memory for 2 years trying to find a link again, but here we are!


How big does the SQLite database get? How do you maintain such great performance? (Other than indexes?)


The resulting SQLite database file is currently 8.3 GB, most of which is taken up by the `links` table. The big performance wins are having a handful of indexes (see the .sql files[1] for the database's schema) and preprocessing a lot of data so I don't have to do duplicate work every time a query occurs. For example, instead of the `links` table going from `source_id` to `target_id` and having a ton of rows which have the same `source_id`, I go from `id` to `outgoing_links` (which is a |-separate string of all source page IDs). Computing each page's incoming and outgoing links is the really heavy work and I only do that at database creation time, using a beefy GCP machine with 8 vCPUs, 52 GB RAM, and a 256 GB SSD. It still takes about an hour, but it's a one time cost and means I can run the actual service on a much smaller machine which won't cost me a fortune to maintain. Also, SQLite is just very fast and performant out of the box, so as usual, it's a matter of choosing the right tools for the job.

[1] https://github.com/jwngr/sdow/tree/master/database


Really cool. I'm interested to know how the graph is build. Did you use any third party components.


Thank you! The graph is built using vanilla d3, no library on top of it. The code for it lives all in one file, ResultsGraph.js [1]. I pieced together the code from a handful of other attempts online. I am still not 100% pleased with the performance of it with a larger number of nodes (250+), but that seems to be a common complaint with the d3 force simulation layouts.

[1] https://github.com/jwngr/sdow/blob/master/website/src/compon...


Thanks for your reply.


Looks pretty cool - do you have any plans to support subsites other than enwiki?


Possibly... follow this GitHub issue[1] if you want to be notified about it.

[1] https://github.com/jwngr/sdow/issues/11


Not sure if you deliberately designed it this way, but I noticed when spot checking some results that it includes the bibliography section links as connections. This seems like it may not be desirable. Example, I did a search that went from the Crusades to Buzz Aldrin and I noticed that Routledge was the first hop from the Crusades. It strikes me as odd that Routledge (a publishing company) would be mentioned on the Wiki article for the Crusades. So I went to look and noticed the link it took was the citation for a book published by this company. I wouldn't really count that as a legitimate hop since it's a citation, not a content link.

EDIT - I noticed this also applies to the Notes section.


A few years ago I made something similar (http://ratewith.science/) that only uses bi-directional links, that is pages that both link to each other, and this gives much more interesting results.

When two Wikipedia pages both link to each other they are usually related in some reasonable way, but unidirectional links give you things like Wikipedia -> California, which only exists because Wikipedia is headquartered in California, a pretty weak connection.

Other than the fact I have it running on an overburdened tiny VPS, my app is also really fast even though I only do a unidirectional BFS because I use a custom in-memory binary format that's mmapped directly from a file that's only 700MB, and a tight search loop written in D.


Maybe I made a mistake somewhere, but I'm not sure it uses bi-directional links. For a really odd search like GoldenEye -> Abbotsford [1] it uses multiple "special" Wikipedia links that I'm pretty sure wouldn't be bidirectional.

[1]: http://ratewith.science/#start=Goldeneye&stop=abbotsford


Yah so what it does is try to find a path with bidirectional links, and then if it can't it tries to find a unidirectional path. You can tell which one it did by whether the numbers in the path have a pulsing glow animation, that particular path does not.


On a scale of The Monty Hall Problem to Manifest Destiny this is a 5!


Perhaps you could refine that by also allowing double backward hops, triple backward hops etc.


Pretty cool!


Yeah unfortunately I don't know of any way to differentiate the different types of links. Wikipedia's pagelinks database doesn't different them. I agree it's undesireable but I just cannot figure out how to cull them.


I'm not sure how the backend is structured, but it seems that you must parse the individual pages at some point or another. I took a quick look at the Wikipedia HTML for a few pages and I would suggest stripping out anything within (or nested inside) of classes like "mw-cite-backlink", "reference-text", "citation book", "citation journal", etc. Also, you can probably strip out anything inside of a <cite> HTML tag.

I'm sure there are more classes and tags, but that hopefully should give you a solid place to start.

EDIT - You can also strip out or ignore anything inside of the ordered list for references - <ol class="references">...</ol>

Also, some pages aren't documented in the same way, so something like this page - https://en.wikipedia.org/wiki/X_Window_System - doesn't have any classes or easy way to parse it for the References section even though the Notes section was set up in a more organized way. However, you could take note that the <span> tag contains class="mw-headline" id="References" and the text value is also References and then ignore everything until the next <span> begins.


I never actually touch any of the source HTML. I think that would simply take way too long and would probably result in some very high bandwidth charges. I use three tables from a public dump of Wikipedia's database, which unfortunately don't differentiate between where the links occur on the page. Check out the first section of my README[1] for more information.

[1] https://github.com/jwngr/sdow#data-source


Ohhhhh, yeah that changes things. I didn't realize the pagelinks database you mentioned was not generated by you based on the master Wiki dump that they keep available for download. In that case, yes this seems unsolvable without 1) petitioning them to update the code on their side to adjust for this and exclude these erroneous links or 2) deciding to create your own pagelink database based on the master wiki download and updating this database periodically (I forget how often Wiki updates the master downloads). While 1) is clearly the preferable option, it is unlikely to occur, unless maybe they are willing to add some columns to provide more details about the links and what sections, tags, classes, etc. they are a part of. That might be more palatable for Wiki to provide as it is just a code change on their end to supply more information rather than something possibly affecting many users by excluding the links altogether.


If it was technically possible, it would probably also be worth culling anything linked within a template. Like disambiguation headers at the top of a page, or semi-related lists grouped in blocks at the bottom (things like "Cities in Australia").


That might be a problem of wikipedia itself. This project can hardly fix that. I mean, That link has hardly any business of being there; "Oxford University Press" isn't linked either, though the page exists, and why would it?


I think you'd have to parse the raw wikitext dump, keeping track of which section each link belongs in. Since the raw dumps is something like 50 GB of text, this sort of thing takes a while.


Anime --> Obesity

https://www.sixdegreesofwikipedia.com/?source=Anime&target=O...

Somehow, I didn't expect a one-stop layover in Dubai.


I went and made it political: Anime -> Alt-right.

The result was a little more predictable... I guess I shouldn't have had to look it up.

https://www.sixdegreesofwikipedia.com/?source=Anime&target=A...


But going Alt-right -> Anime has a second path through Vaporwave.


Now try Donald Trump -> Nazism


I found Alt-Right -> Ganguro to have interesting connections.


That's hilarious. We should make this a contest, the 2 length paths with the most unlikely layover


I was getting interesting results until I noticed a pattern involving the presence of "Wayback machine" as the only connecting dot between really different things.

That adds noise, since articles now automatically use the wayback machine for "archived" links, thus generating many paths that do not really connect topics, just because the text "wayback machine" is part of the link text.

It may be an interesting exercise to find outliers like that and compute paths without those nodes.


I considered this and may eventually add an option to ignore those kinds of pages, but I ultimately felt like the current mode remains more true to my goal for the project which is to traverse the links as any human would be able to. By the way, the two pages with the most incoming links are "Geographic coordinate system" (1,047,096 incoming links) and "International Standard Book Number" (955,957 incoming links).


Indeed, the Wikipedia pages "Geographic coordinate system" and "International Standard Book Number" have the highest PageRank. See: https://www.nayuki.io/page/computing-wikipedias-internal-pag...


After spending way too much time I got 9 degrees of separation. I did piggyback off of someone else's work with "Phinney". https://www.sixdegreesofwikipedia.com/?source=Lion%20Express...

I found a ton of 5 degree paths and only a couple of 6 degree ones. Then I pulled out the "big guns" (the dead-end pages category). https://en.wikipedia.org/wiki/Category:Dead-end_pages_from_F...


That's impressive.

Wikipedia is highly connected. Even pretty darn different things often seem to have only three degrees of separation. For example, here's Ramesses II to Ankylosaurus in three:

https://www.sixdegreesofwikipedia.com/?source=Ramesses%20II&...

After trying a bunch of things, I finally found a fourth-degree separation: William the Conqueror to Ankylosaurus.

https://www.sixdegreesofwikipedia.com/?source=William%20the%...


I don't know if this is legit. The Lion Express node only connects to a couple of Wikipedia's generic help pages, which I'm pretty sure don't link back to Lion Express.

https://en.wikipedia.org/wiki/Help:Link https://en.wikipedia.org/wiki/Help:Searching


Here's one with 7 hops and not using The Lion Express. https://www.sixdegreesofwikipedia.com/?source=Zevenhoven&tar...

Like I said I spent too much time on this yesterday.


Damn, I had trouble finding more than 3 degrees


yeah I just cheated by using orphaned articles and managed to 'win'


This makes the "How many clicks to Hitler" game much faster.

For those uninitiated, the game was to click the "Random Article" link in the sidebar and count how many links it took to get to Hitler. It is really interesting to see just how big of an event WWII was. Every country article has a section on their involvement or why they were not involved.

After playing with it more, this is pretty fun. I vote that a "degrees from Hitler" score be added to the top of every article. I think it might be an interesting proxy for how esoteric a particular page is.


This reminds me of the wikipedia rule I learned a while back: If you click the first link in an article (besides the pronunciation guide), you will always end up on philosophy.


According to the wikipedia page about this rule/game, over 97% of pages have this property. Interestingly, "Mathematics" seems to be a rare exception. You get stuck in a loop:

Mathematics -> Quantity -> Multitude -> Counting -> Elements -> Mathematics


There's even a great page with a small graph and a rundown of some more resources: https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosoph...


It even works for that page!


Finding ones that don't go through "science" is interesting.


Woah. This works. Neat.


Indeed. First thing I tried as well.

Found 1 path with 2 degrees of separation from Bitcoin to Adolf Hitler in 1.33 seconds!

Bitcoin -> Austria -> Adolf Hitler


Game is much more challenging if you ban pages that also have big categories.

(Look of Category: X exists and add some cutoff based on number of links there.)

Obviously also lists.


There are several easy to reach pages without categories.

Almost anyone alive during WW2 will have a WW2 link. Almost every country is 1 degree from Hitler. Most companies/publications around than will link to Hitler.


If you skip lists, it gets quite a bit harder. Countries or places would need special handling yes, they make way too many links.

Probably trimming top 5% of most linked pages will make for a fun game.


My colleague saw me looking at this site, and I went through exactly the same thought process as you! His feeble attempts at non-Hitler linked articles failed miserably.


Hilarious, my first search was "OFDM -> Hitler" for some completely unknown reason. (2 steps for the curious)


Age of Enlightenment -> Consumption of Tide Pods[0]

[0] https://www.sixdegreesofwikipedia.com/?source=Age%20of%20Enl...


Interestingly, if you go the other way it blows up:

[0] https://www.sixdegreesofwikipedia.com/?source=Consumption%20...


I'd assume very few articles lead in to "Consumption of Tide Pods", while many do to "Age of Enlightenment".


Kind of reminds me of PageRank in someway; the number of "backedges" is not guaranteed to equal number of "forwardedges", hence the difference in number of paths.


Mensa International -> Consumption of Tide Pods:

https://www.sixdegreesofwikipedia.com/?source=Mensa%20Intern...


Meg Whitman -> Consumption of Tide Pods

https://www.sixdegreesofwikipedia.com/?source=Meg%20Whitman&...

ಠ_ಠ


It shows 8 3-click paths for me. https://imgur.com/a/uo9oi


Actually, interestingly, I've been introducing this concept as a "party game" with other nerds at RL gatherings for some years now. The goal is to start on a random page and find the shortest path to another random page by only clicking links in the articles. It can be quite a lot of fun, despite what you're thinking! And anybody can understand the challenge and compete and have fun. It's not just something for geeks.


I had run into it with specifically how many links to get to the page for Hitler.


What is an RL gathering?


RL == "Real Life"


likely stands for "real life"


Real Live



One of your pages has nothing at all linking to it, so finding a route to that link is impossible.

https://en.wikipedia.org/wiki/Special:WhatLinksHere/Sputnik-...

However, you can go the other way:

  Sputnik-1_EMC/EMI_lab_model
  
  Sputnik 1

  Soviet space program

  United Nations Committee on the Peaceful Uses of Outer Space

  United Nations

  Model United Nations

  Dwight Schrute

  Spud gun

Also I'm realizing after finding that that I could have used the site in this post...


Spud gun links to "Sputnik 1", so if anything it's just weird that "Sputnik-1 EMC/EMI lab model" doesn't appear to link to "Sputnik 1"

Well, rather, it DOES. Maybe the unencoded '/' in the URL (https://en.wikipedia.org/wiki/Sputnik-1_EMC/EMI_lab_model) breaks the 6 Degress of Wikipedia?


I think that's because nothing on wikipedia links to that page.

I managed to go from spud gun -> great depression -> dissolution of the soviet union -> soviet union -> sputnik, but I couldn't get the last step!


I think it would be interesting to have a "swap" button between the start and end points, perhaps just under where it says "to". (similar to the swap buttons in translator apps and GPS apps to quickly swap the start and end points) As some other comments have mentioned, the paths are not necessarily the same route or length, and it is fun to see how they might be different.


Good suggestion! I intended to have that exact button, but couldn't find a way to put it in the UI without making things more confusing. I expect I'll add it in the future. Thanks for the suggestion!


Parquetry -> Romeo and Juliet, 46 paths (https://www.sixdegreesofwikipedia.com/?source=Parquetry&targ...). None through Shakespeare.

Parquetry -> Tromeo and Juliet, 508 paths (https://www.sixdegreesofwikipedia.com/?source=Parquetry&targ...). All through Shakespeare.


508 paths, nice. I was about to post Wake in Fright -> Las Vegas, which has an amazing 414 paths, but yours is a winner. https://www.sixdegreesofwikipedia.com/?source=Wake%20in%20Fr...


https://www.sixdegreesofwikipedia.com/?source=Frank%E2%80%93...

This isn't really the best measure though, because it only counts # of paths at the minimum depth level.

edit: although I found some deep searches with very few links: https://www.sixdegreesofwikipedia.com/?source=Frank%E2%80%93...


I found one with 813 paths with 4 degrees, but yours has 1,645 paths with 5 degrees. I suppose there are end points with many thousands of links. It's a bit of trivia, not really a measure of anything.


It's a bit surprising, since you'd think that with 1,645 paths with 5 degrees, there'd be at least one path with 4 degrees, but it doesn't work that way.


In case you receive the following message

"Sorry internet hipster, this little side project requires JavaScript."

here is a quick example of how to get the pages the "traditional way"[FN1]:

     #/bin/sh
     test $# -eq 2||exec echo usage: $0 source target;

     exec curl -H"Content-type: application/json" \
     -d '{"source":"'$1'","target":"'$2'"}' \
     https://api.sixdegreesofwikipedia.com/paths \
     |exec sed '
     s/\",/\"\
     /g;s/,\"/\
     \"/g;s/:{/:\
     {/g;s/}/&\
     /g;s/\"pages\":/&\
     /'
It appears the author is using the Wikipedia API. I did not add any HTML tags, etc. to the output, although this is very easy to do.

FN1. The original "web browsers" needed no GUI, no Javascript.


Now it makes more sense.

I'd prefer to have a brief technical explanation why Javascript is needed instead of this condescending labeling.

Edit: NoScript is not a luxury for technically minded geeks anymore, it's a necessary protection against tracking, CPU-consuming advertisement and attacks like Spectre/Meltdown and who knows what else Intel has for us.


Example:

    sdow true false
Output:

     {"isSourceRedirected":false
     "isTargetRedirected":false
     "pages":
     
     {"161711":
     {"description":"Value indicating the relation of a proposition to truth"
     "title":"Truth value"
     "url":"https://en.wikipedia.org/wiki/Truth_value"}
     
     "228748":
     {"description":"Wikimedia disambiguation page"
     "title":"True"
     "url":"https://en.wikipedia.org/wiki/True"}
     
     "40805040":
     {"description":"Wikimedia disambiguation page"
     "title":"False"
     "url":"https://en.wikipedia.org/wiki/False"}
     }
     
     "paths":[[228748,161711,40805040]]
     "sourcePageTitle":"True"
     "targetPageTitle":"False"}
Edit sed commands to make first line indented:

     |exec sed '
     s/{/\
     &/;
     s/\",/\"\
     /g;s/,\"/\
     \"/g;s/:{/:\
     {/g;s/}/&\
     /g;s/\"pages\":/&\
     /'



This is awesome! I looked up "New Zealand" and "Yellow". There was only one degree of separation, via Ochre: https://en.wikipedia.org/wiki/Ochre#In_Australia_and_New_Zea...

I learned that the Maori people of New Zealand used ochre mixed with fish oil to paint wakas (war canoes), and also as an insect repellent.


Incidentally, it´s a good way to find incorrect ambiguous wikilinks. [[Comparison of web browsers]] shouldn´t link directly to [[Gnome]], which is the article about small mythologic humanoids, not about the desktop environment.


What determines a "path"?

"Philosophy" is connected to "Ethiopia" through "Sexism", but I couldn't find that phrase on either page.

https://www.sixdegreesofwikipedia.com/?source=Philosophy&tar...


From the project description [1]:

Wikipedia dumps raw database tables in a gzipped SQL format for the English language Wikipedia (enwiki) approximately once a month (e.g. dump from February 1, 2018)...By default, the script grabs the latest dump (available at https://dumps.wikimedia.your.org/enwiki/latest/), but you can also call the database creation script with a download date in the format YYYYMMDD as the first argument.

I'd guess the tool is working off an out of date article.

1) https://github.com/jwngr/sdow


Same, New Zealand to Faroe Islands through Italy.

Maybe both New Zealand and the Faroes reference Italy?


At the bottom of the page for each of the countries "Articles related to Italy" is a huge expandable section, with even more expandable sections inside. You can actually find both New Zealand and Faroe Islands in there.

If you view-source the Italy page, you can find a link to both of those countries.


Sexism links to both philosophy and Ethiopia. It is treating links as an undirected graph (despite the visual indication showing a directional vector), which seems entirely fair.


It seems somewhat unfair. If it's not a directed graph, it means you can't actually navigate from the source to the target page by following these links...

edit: in fact, the author describes the goal of his project in another comment here: "my goal for the project which is to traverse the links as any human would be able to".


Parent was incorrect - the graphs are directed.


This is incorrect; links are considered a directed graph, as can be seen by reversing the start and end points of most paths.


Every single example I can find makes it painfully clear it is an undirected graph. Sets may differ going from one direction to the other simply because it is only trying to find a sample set of correlations, not an exhaustive set.


The graph is most definitely directed. One small example is Facebook -> Narcissism (1 path of 1 degree)[1] compared to Narcissism -> Facebook (8 paths of 2 degrees)[2].

[1] https://www.sixdegreesofwikipedia.com/?source=Facebook&targe... [2] https://www.sixdegreesofwikipedia.com/?source=Narcissism&tar...


Philosophy links to neither sexism or "The Demographics of Africa" (and while some older version might link to sexism, it seems unlikely that it ever linked to the Demographics of Africa"). Both of those do link to Philosophy, however.

Alternately it is directed, except when it doesn't find an easy route and uses an undirected result.


Click on "African People" in https://en.wikipedia.org/wiki/Philosophy#African_philosophy, you'll arrive at Demographics of Africa.

Likewise "Gender Bias" redirects to Sexism.


About 12 years ago, some colleagues (Yehuda Koren and Chris Volinsky) and I proposed a technique for measuring proximity in quasi-random (social) networks, that can handle out-of-memory databases, and more than two query nodes, also finds a visualizable subgraph that represents as much of the relationship as possible using dynamic programming. It is described here (includes some figures): http://web2.research.att.com/export/sites/att_labs/groups/in... The proposed heuristic approximates the "cycle-free effective [electrical] conductance" between the query nodes. In this paper, we were able to use anonymized phone calls to confirm 6 degrees of separation.

There used to be a live demo on an AT&T Labs website but it is not available now. There are published algorithms for all the phases of the proposed heuristic, but my recollection is that Yehuda found an efficient, robust implementation of k-disjoint-shortest-paths was not easy.

This is an interesting problem, thank you for making your work available. (I do agree the HTML form placeholders that change rapidly but are ignored when you press the GO button are a little confusing; it took me a minute or two to figure out what was going on.)


Thanks a lot for sharing! BTW, I just fixed the confusing interaction with the text input placeholders[1].

[1] https://github.com/jwngr/sdow/commit/6e42e06488a592784e5d3d2...


Could you make it possible to just test with the fun suggestions that scroll through?


Yeah, that really confused me. Distance from "M.C. Escher" to "Lucky Charms" sounds interesting only to get:

    You'll probably want to choose the start and end pages before you hit that.
Took me a while to realize it didn't "lock in" the suggestions and I had to manually enter it.


Which is a shame, because that specific example ends up having some interestingly diverse paths between the two: https://www.sixdegreesofwikipedia.com/?source=M.%20C.%20Esch...


Just implemented this with a quick hot fix[1]. Hopefully I didn't break anything else in my hurry to get out.

[1] https://github.com/jwngr/sdow/commit/6e42e06488a592784e5d3d2...


Thanks! Cool project :)


I clearly should've done some more user testing ;) I think that's a great suggestion and would be much better than showing an error message. I'll update that once I have some time when the server isn't crashing!


Took some tinkering to hit five degrees, from 'Hemisphere Dancer' (Jimmy Buffet's old seaplane) to the VC (Vapnik–Chervonenkis) dimension. Rewarding for the site to say 'wipes brow I really had to work for this one.'

[0] https://www.sixdegreesofwikipedia.com/?source=Hemisphere%20D...


Same with "Yilan Creole Japanese" to "Gustav Eberlein" with a nice 537 paths.


I would love to see a list compiled somewhere of two articles with exactly 6 degrees of separation. This is proving to be extremely difficult. In the entire HN thread so far, I only see one so far by dkuder https://www.sixdegreesofwikipedia.com/?source=Six%20Degrees%...


I'm storing all the search results and will do some analysis on the data. Maybe I'll even get around to adding a new page with some of the interesting stuff I find.


I finally found my first 6-degree path: https://www.sixdegreesofwikipedia.com/?source=Ellipsis%20%28...

Purell_hack has managed a whopping 9°, based off dkuder's Phinney - https://news.ycombinator.com/item?id=16469620


This thing is pretty dang good, "Golden Ratio" to "Pope Pius X", what could possibly go wrong? Nailed it with "1 path with 2 degrees of separation". "Infinite Series" to "Declaration of Independence", nailed it in "4 paths with 3 degrees of separation". This is a lot of fun


This is brilliant. For a long time I have had a 'motorway game' I play with my brother in law where we identify two cars on the road and then have to come up with the connection. So often you see badge engineered vans where one partner, e.g. Nissan rather than Renault, makes the van as seen, so you can match the other partner, e.g. Renault, in this game, since they are the same company, then the next vehicle, e.g. some Mercedes car, how do you get the link, and prove it when at motorway speed?

Clearly this app makes it effortless, you could find some supplier like Recaro actually make the seats in both, even the van. Or you could find some joint venture in Brazil that both companies share, who knows... So I look forward to using this app on the M4 some time soon!


Does anyone know what the "diameter" of the wikipedia graph is? In other words, what is the longest shortest path between two wikipedia articles?


Well that question only makes sense for connected graphs, we don't really know whether this is connected. So in general a version of this question that makes sense is, among all the connected components of the wikipedia graph, what is the largest diameter.


This is an interesting question that I'd like to answer now that I have all the data. I am curious to see how long it will take to find a solution as I believe even the most efficient algorithms for this have a high runtime complexity.

And yes, the graph is not connected (there are both nodes with no outgoing links and with no incoming links), but over 99% of the pages are connected, so the answer would still be interesting and worthwhile.


Another interesting question would be what the largest graphs are that are disconnected from the "main" one. Are there any larger than 1 or 2 nodes? Is there some set of pages on some obscure topic that only link within themselves?


Floyd-Warshall solves the all-pairs shortest path in time O(V^3). Running a BFS search rooted from every node would find the shortest path in an unweighted graph in O(V*(V + E)).


Hence, since there's a whole lot of Vs in the wikipedia graph, he's probably going to be satisfied with an approximate solution, unless he has a lot of CPU hours to spare.


There's 5,579,252 articles in English Wikipedia right now, which means the largest component has 5 million and change or so vertices. Each individual BFS should take a few CPU-seconds at most, and you can trivially parallelize the BFS for each node across as many CPUs as you have.


Yes, check this out http://mu.netsoc.ie/wiki/

> Several people were asking about what's known as the "diameter" of Wikipedia, that is, the distance between the two articles furthest apart (the longest shortest path if that makes any sense to you). This was in fact the original goal of the project but it turned out not to be very interesting. Wikipedia has giant "tails", almost linear linked lists of articles that stretch out for 70 links. The worst offenders were the subpages of List of named asteroids as each is only linked from the previous one, and it takes about 70 links to get from anywhere to the last one. So, you find that the two articles furthest from each other are List of asteroids/145701-145800, linked to by List of asteroids/145601-145700, linked to by List of asteroids/145501-145600, and so on for 70 links until you get back to some vageuly normal article. This is far less interesting that I was hoping. Even when I special-cased out that string of 70 boring articles, a new one appeared (I think it was linked pages about administrations of Myanmar or something). Rather than special-casing out reams of articles, I decided to pick a different metric, one that better reflects the structure of wikipedia without arbitrarily removing parts of it.

It appears that the particular tail has been "cut" however.


"You'll probably want to choose the start and end pages before you hit that."

Why? There were two random examples filled in the boxes. I have to retype them manually?

Really interesting and fun other than that! I'm trying something with 7 degrees, but so far most I could get are 4


Would be cool to see the paths ranked in order of “narrowness”, e.g. ranking a path with shorter, more specific articles more highly than a path that goes through a mega-page like “France”.


This reminds me of that thing back around 2010 where if you clicked the first* link on any wikipedia page you'd eventually get to the article on philosophy or something. I forget which it was exactly, and I have no idea if it still works, but it was pretty fun back then trying to find a starting article that couldn't find the "philosophy" article.

*Non-italicised because of the disambiguation suggestions


It still works. Technically it can be any of the following 25 articles because once you hit Philosophy, it loops through:

  Philosophy > Knowledge > Fact > Education > Learning > Evidence > Logic 
  > Ancient Greek > Greek language > Modern Greek > Colloquialism 
  > Vernacular > Human > Neontology > Biology > Natural science > Science 
  > Latin > Classical language > Language > Communication > Subject (philosophy) 
  > Subjective consciousness > Consciousness > Quality (philosophy) > Philosophy


It's from the alt text of xkcd: https://xkcd.com/903/



There's only one path from Mersenne Prime to Bolivia, if anybody was wondering. Mersenne Prime -> French People -> Bolivia.


Only one path of three. There are probably hundreds or thousands in four steps.


This is fabulous! Got from cheese to postmodernism in two hops [1]. Really excellent work, and thanks for making it open source!

[1] https://www.sixdegreesofwikipedia.com/?source=Cheese&target=...


I got 6 with "dig" to "dug" (though Dig ended up redirecting to "Double J Radio"):

https://www.sixdegreesofwikipedia.com/?source=Double%20J%20%...


Another similar project from years ago came to the conclusion that disregarding "list" wikipedia entries the centre was the article on The United Kingdom. You reach a similar conclusion?

http://mu.netsoc.ie/wiki/


I like the little trivia that show up, although sometimes they disappear too quickly.

My favorite mentioned which Wikipedia article had the longest article name:

Suzukake no Ki no Michi de "Kimi no Hohoemi o Yume ni Miru" to Itte Shimattara Bokutachi no Kankei wa Dō Kawatte Shimau no ka, Bokunari ni Nannichi ka Kangaeta Ue de no Yaya Kihazukashii Ketsuron no Yō na Mono (https://en.wikipedia.org/wiki/Suzukake_no_Ki_no_Michi_de_%22...)


The full fact list is on GitHub[1]. The fact list was a lot more interesting and important when searches took longer to run. One of the bad things about improving the performance so much was the fact that the facts don't have as long to display :P

[1] https://github.com/jwngr/sdow/blob/master/website/src/resour...


I had some fun doing something similar with Neo4j a few years ago: https://github.com/mirkonasato/graphipedia

(The code may need some tweaks to work with the latest Neo4j version.)


It would be a bit more interesting if it could distinguish between the "See Also" and other tables at the bottom of the pages, and the main content. E.g. I connected Donald Knuth to a Geneticist simply because they both won awards (in different categories).


Unfortunately I'm not aware of a way to distinguish between the two. Wikipedia stores both types of links in the same database. I would love to cull out all the links in category boxes and sources. If anyone has any ideas, let me know!


Very cool. And a lot less degrees of freedom between my tests than I expected - so far everything has a legitimate 3 degrees when I expected much more. Higgs Boson to Taylor Swift in 3 steps, amazing. Not only that, but 91 ways to get there in just 3 steps.

This basically quantifies what my wife and I jokingly refer to "rabbit-holing" online. She'll ask me what I'm reading and it will be something totally unrelated to what I said I was coming to look up. And she's always like "how did that happen?" And I never have a good answer. But now I do (if it involves Wikipedia)!


You need to go more obscure -- Fiber Bundle to Lars Ulrich was 4 steps. Calabi-Yau Manifold to Donnie Wahlberg is also 4.

I haven't gotten 5 yet.


I found a 5 by picking two Random article pages:

Xyloplax turnerae to Pagolo Arsago -- https://www.sixdegreesofwikipedia.com/?source=Xyloplax%20tur...

I love all the little touches, like the fun facts (2.51% of Wikipedia articles don't have incoming links) and UI touches ("wipes brow, I really had to work hard for this one").


I noticed that pages about countries are good hubs for this kind of connection. The page cites someone, their page has a link to his country of origin and these country pages are really extensive and have links to a variety of topics.

Edit: I finally managed to find a 5 degree route! https://www.sixdegreesofwikipedia.com/?source=Taquaral&targe...


I got 9 steps with "Here (company)" to "There (wikipedia disambiguation page)".


I got a 9 step one without a disambiguation page. https://www.sixdegreesofwikipedia.com/?source=Lion%20Express... Took a long time though.


Disambiguation pages are not supposed to be linked from within Wikipedia, so I'm not sure this one really counts.



You should track the "longest shortest paths" of things and display them - has anyone managed to hit 10 degrees of separation, for example? If someone finds one, let us know!


Those graph node colors are not very color blind friendly. They are easy to differentiate in the list view, but in the graph with black borders, they are just too low contrast.


I'll look into it. Interestingly, I'm using[1] one of the default d3 color scales, which I assumed would be color blind friendly out of the box.

[1] https://github.com/jwngr/sdow/blob/a2699dc95d884ec64a4641630...


I made something similar using Wikipedia data, specifically for Politicians http://www.poligraph.io


Did I find a bug?

https://www.sixdegreesofwikipedia.com/?source=Adolf%20Hitler...

It's showing "Bill Gates" and "Mark Zuckerberg" as the hops, but on the start page I don't see links to those.

(Apologies for the subject matter. It was the first thing I thought of, because of a Wikipedia-path-finding game I had heard of before.)


It uses a pre-downloaded database of links [1], so presumably certain pages that were once linked have been updated.

[1] https://news.ycombinator.com/item?id=16469427


This is incorrect, that's not the reason they are linked. It is using the category box at the bottom of the page, and both Hitler and Gates were voted 'Times Person of the Year' at some point.


Thanks! Makes me wonder why those links were in there in the first place...


That's an excellent question - I'd speculate that they were perhaps mentioned in the passage beginning,

  ...Further, Haffner claims that other than Alexander the Great, Hitler had a more significant impact than any other comparable historical figure...
However, I wasn't able to find out for sure with a quick browse through the page history.


Really cool! May be a bug: I tried Martin Luther Ling -> Elon Musk and it showed me a link from MLK -> Mark Zuckerberg -> Musk. But I couldn't find a Zuckerberg link from the MLK page. https://www.sixdegreesofwikipedia.com/?source=Martin%20Luthe...


It uses a pre-downloaded database of links [1], so presumably certain pages that were once linked have been updated. [1] https://news.ycombinator.com/item?id=16469427


I get a similar case with Donald Trump - Banjo Kazooie https://www.sixdegreesofwikipedia.com/?source=Donald%20Trump...


It looks like it uses links in the 'category' box at the bottom of the page. Both MLK and Zuckerberg appear in the 'Times Person of the Year' category.


Please make it Ctrl+Scroll to zoom (a UI standard obeyed in umpteen applications!), and leave Scroll alone for scrolling through the page.

Nobody wants to have to click outside the graph view to change the state of the UI so that Scroll scrolls the page.

(Of course, Ctrl+Scroll is already taken for browser zoom; but hijacking browser zoom in this situation is more acceptable than hijacking vertical scroll).


I'm confused and can't find the final link in this chain.

https://www.sixdegreesofwikipedia.com/?source=Late%20capital...

I'm guessing it's part of an older copy of the page? Is there an easy way to search revisions?


So the reason is that "Theodore Roosevelt" links to "Freddy Fazbear" (ctrl+f for "Articles related to Theodore Roosevelt" and then click it and then click "Teddy Bears") which redirects to "Friday Night at Freddy's". So, technically, there is a direct link there, albeit through the categories dropdowns. Unfortunately, Wikipedia doesn't have anything in their database dumps identifying if a link is in the article itself versus the categories dropdowns or sources, so I include them all.

More

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: