Somehow torvalds/linux is in Fronterra, next to JS projects, awesome-X lists, and frontend checklists.
Either kernel hackers unexpectedly love frontend, or more likely the people that write the code don't overlap much with the people that star Github projects!
I wonder if code embeddings might have been a better way to organize the projects, although probably infeasible given the amount of resources required to download and compute embeddings for each file.
People have been critiquing the collaborative filtering aspect of this work vs content analysis ("[why use stars instead of code similarity]") but there's something elegant about the simplicity of using less priors here.
A tf*idf matrix could be applied to the star-feature matrix too. Document = github repo. Term = name of user who starred it.
THUS, users who overstar are simply less important for computing similarities.
This would mitigate the phenomenon of massively popular github repos being clustered together because of folks who blithely star the most well known stuff.
Surprised at how small Rustland is. Barely a province in Clouderra.
Also, interesting how both Bevy and Veloren are in Rustland. Probably, the stars come more from the Rust community than the game dev community. Which I guess makes sense: the Rust ecosystem is still relatively small and feels like a lot of people doing X but in Rust.
As a fan of Julia, surprised to see how julialang/julia has so few links. It's a niche language; how isolated it is on this map is maybe not so unrepresentative of the user or developer experience.
There's a JuliaLand to the west of the island where julialang/julia is.
The fact that julialang/julia ended up near tensorflow and opencv, and actual Julia packages ended up elsewhere, probably reflects a difference between aspirational users and real users: a lot of people who starred the Julia project itself were numeric Python users who were looking for a new Python, but then mostly stuck to Python itself, so their other stars are in the numeric Python land. Those who starred the JuliaLand packages are the actual Julia users who aptly enough ended up near Moleculandia and AstroSpace and Quantumia.
Very neat and creative approach but I'm honestly conflicted whether the country/map metaphor is the best choice. In many cases the names are not that clear, so one has to zoom in to understand what they represent. It would perhaps be more interesting to do hierarchical clustering and show something like average connectiveness between the (super)clusters with lines, possibly with more descriptive/faithful LLM-generated labels for each cluster.
I was pleasantly surprised that it wasn’t a heavy line drawing creation. As someone who first did those in the 90’s and almost immediately learned their limits, I think this is nice because it doesn’t overclaim. It’s just a view, not a thesis.
I like diagrams where the axes mean something. Lines, shape, boxes/groups, distance, X vs Y, colour, thickness, texture, background, foreground. I also like simple. So often it’s lines to be fancy with no meaning. This one is just a pic, with some grouping, and it has personality. Yay?
I couldn't find a universal clustering algorithm yet: Frequently there is more than one way to group data that still makes sense, and as a result whichever final clustering option we choose - it will not be perfect.
Hm... unless maybe we do some sort of quantum clustering, which could be a fun project to explore!
It's a bit hazy now, but I remember trying hdbscan algorithm (hierarchical clustering), and on the graph of the GitHub size - I just couldn't fit it in memory.
I did end up using something similar to hierarchical clustering (mix of louvain/leiden/my own), and that's what we see in the final map.
Basically what others are guessing, lines represent the highest similarity scores based on "stargazers", which also forms the entire map. To anyone confused, the lines only appear once you click into a specific country.
I think it's the other way around. The similarity metric determines which repos have edges (possibly weighted?)
And then some clustering algorithm makes sense of this giant graph by laying out sets of nodes that have a lot of edges to each other, close to each other
The closeness is just layout, the edges is the data structure that determines closeness.
Jaccard similarity returns a value between 0 and 1 (in this case the vast majority of the values being close to 0). I suspect there's a hard-coded threshold value to determine an edge, e.g. if Jaccard similarity between A and B is > 0.2, create an edge.
Despite HTMX being backend-agnostic, I heard it pairs extremely well with Django, so that's probably why! Maybe the two are particularly well fitting pieces of the web dev puzzle.
I love Django and its my primary weapon of choice. But due to its structure and philosophy it doesn't move as fast as e.g. Rails (e.g. impressive how quickly they picked up SQLite for their Solid* stuff). I guess HTMX was a very welcomed solution from the "outside" to allow more interactive frontends.
I do have the theory that the more untyped the language is, the larger the islands are: Fronterra (JavaScript), Cloudderra (YAML), AILandia (Python) are way bigger than Java, Swift, DotNet, etc. even though the prejudice saying goes that the problem of software engineering is stale old enterprise code in Java/DotNet.
That might be the case, but the libraries seems to be more reusable!
Javascript made the barrier to entry for creating a package nearly zero. In contrast, it's fairly difficult to publish something on Maven Central (the main Java repository). You need to prove you own a domain, setup a GPG key for signing, manually register with Sonatype, which is more than many people are willing to do. I think that explains it much better.
The only problem I see is that projects don't fit so nicely in the division between languages (Pythonia, Javaland, Clojuria, etc) and applications (Gamedonia, AILandia, etc). There's a lot of intersection between them.
But the visualization is super-cool nonetheless. :)
I don't think anyone has ever seriously suggested PHP is dead. People may want it to be dead, but it's probably still the most-used language on the web!
Sorry, not sorry—PHP is alive, and thriving! The language runtime is getting ever faster, the packagist ecosystem in combination with composer (PHP's package manager) are rock-solid, there are event loops and application servers by now, serverless deployments are the default operation mode, and with Laravel or Symfony, there are trusted and extremely versatile frameworks available that do stuff out of the box that require lots of manual efforts with other languages.
Add to that the support for type annotations that can go all the way from fully untyped and dynamic, to runtime-enforced primitive constraints and object types, and you'll end up with a very good choice for web applications that evolve quickly.
Either kernel hackers unexpectedly love frontend, or more likely the people that write the code don't overlap much with the people that star Github projects!