99% chance that it's not possible and it won't be possible soon.
You can take a look at the table at https://bigquery.cloud.google.com/table/the-psf:pypi.downloa... and check out tabs "Schema" and "Preview".
Mix of country_code, details* and tls* could give you "rough" grouping, but there would probably be 100s of users in the same bucket.
It is probably designed in a way to not let you target single user that way. It's likely that there is a pipeline scrubbing small buckets to avoid privacy leaks. I had to write pipelines like this myself on different datasets.
Also, PMI shows up in many contexts, including word2vec. Your measure is at best hacky (no direct interpretation, leaves many small nodes without links).
Otherwise, a nice work! But it would be great if visualized more interactively and aesthetically (yeah, I can be biased here).
For interactive I use D3.js, on SVG. And I can add all functionalities I like. :) In particular, I am thinking about creating a general-purpose co-occurrence graph (after some success of TagOverflow), see e.g. http://p.migdal.pl/delab-matury/koincydencje/. If you like it, you can motivate me to making it a library :).
But: there should be exports from igraph to D3.js, and from Gephi to http://sigmajs.org/ (<- this is interactive, and higher level that D3.js; I haven't used it, though; with mouseover and zoom).
d3 tends to be very low level and there are a few successful libraries built on top of it.
Co-occurrence graph library certainly would be popular.
In my visualisation I've spent the most time fiddling with JavaScript and currently it's the bottle neck of quality of the visualisation.
My another problem was that due the graph size it took ~1 minute to stabilize[1] on my browser and it was very wobbly and janky. I had to pre-load nodes positions and disable animation to make it usable for viewers. The fully automated analsysis of big graphs would require some server side component pre-computing the graph.
1. It was stabilizing quicker with smaller collision (newly introduced in d3 v4 force layout) parameter, but nodes were more likely to overlap.
Some incarnation of word2vec, as they are suited for replacements (you can try googling node2vec, though you need to double check if it is the thing; if it is word2vec-like compression of the incidence matrix - it is what you are looking for). As in:
"a small, fluffy rissun run on a tree" -> "rissun" is something like a squirrel.
If you want to have something more straightforward, look at adjacency matrix squared ("friends of friends") or some other measure of two nodes having a lot of common neighbours.
In a week I am having more time (and tutoring for 2 weeks a students on word2vec/glove visualizations), so I would be happy to talk about it.
Honest question: Is anybody actually able to extract valuable information out of that type of visualization? They always seem like a bag of wool that my cat would want to play with. Maybe some better plotting algorithm would help?
This. Unfortunately there is nothing actionable from this visualisation and, at the risk of sounding like an asshole, it's virtually meaningless. The only piece of information this visualisation does provide is what the largest Python packages are.
What's missing is the ability for a user to "separate" clusters into spaces that they care about. Ideally though, the graphic should be able to display the relationships in a way that surfaces visually at a glance.
I agree. I was planning to add more things, but the weekend ran out. By using graphistry.com I will add:
- Zooming in and out.
- Mouse over the package to highlight nodes with an edge.
- Cluster the packages and let users remove clusters they are not interested in
I will try out the http://graphistry.com/ today after work. Zooming and mouse over will help extract information. In the second section of the post I posted some information I extracted.
It was hard to balance number of nodes well:
- Too few: Interesting clusters like robotics, numpy and openstack start to disappear, as they get dominated by massive clusters like django
If you pick a fringe package it just points way off center in empty space. It's not immediately clear what's happening. An extra dot for the center might be helpful.
You might be alyso interested to know that PyPI download stats are also available in BigQuery:
https://mail.python.org/pipermail/distutils-sig/2016-May/028...
(project recently unveiled by Donald Stufft)
I also added your article to the compilation at https://medium.com/@hoffa/github-on-bigquery-analyze-all-the..., thanks for your prolific series of analysis!