
Visualizing relationships between Python packages - kozikow
https://kozikow.com/2016/07/10/visualizing-relationships-between-python-packages-2/
======
fhoffa
Nice!

You might be alyso interested to know that PyPI download stats are also
available in BigQuery:

[https://mail.python.org/pipermail/distutils-
sig/2016-May/028...](https://mail.python.org/pipermail/distutils-
sig/2016-May/028986.html)

(project recently unveiled by Donald Stufft)

I also added your article to the compilation at
[https://medium.com/@hoffa/github-on-bigquery-analyze-all-
the...](https://medium.com/@hoffa/github-on-bigquery-analyze-all-the-
code-b3576fd2b150), thanks for your prolific series of analysis!

~~~
stared
Is there any way to combine downloads for a single user? I.e. to know that the
pip downloaded both package X and package Y?

~~~
kozikow
99% chance that it's not possible and it won't be possible soon.

You can take a look at the table at
[https://bigquery.cloud.google.com/table/the-
psf:pypi.downloa...](https://bigquery.cloud.google.com/table/the-
psf:pypi.downloads20160711) and check out tabs "Schema" and "Preview". Mix of
country_code, details* and tls* could give you "rough" grouping, but there
would probably be 100s of users in the same bucket.

It is probably designed in a way to not let you target single user that way.
It's likely that there is a pipeline scrubbing small buckets to avoid privacy
leaks. I had to write pipelines like this myself on different datasets.

------
stared
For strength of co-occurrence please use pointwise mutual information (or its
function), see [https://github.com/stared/tagoverflow#methods-and-
tricks](https://github.com/stared/tagoverflow#methods-and-tricks). I explored
quite a few, and only this one gave reasonable results, including when there
was a big difference in the number of counts of A and B (see:
[http://p.migdal.pl/tagoverflow/](http://p.migdal.pl/tagoverflow/)).

Also, PMI shows up in many contexts, including word2vec. Your measure is at
best hacky (no direct interpretation, leaves many small nodes without links).

Otherwise, a nice work! But it would be great if visualized more interactively
and aesthetically (yeah, I can be biased here).

~~~
kozikow
Looks great! Today after work I plan to try out version with PMI and
[https://github.com/graphistry/pygraphistry](https://github.com/graphistry/pygraphistry)
.

~~~
stared
I don't know this one... when not D3 I was using Gephi or igraph (see
[http://kateto.net/network-visualization](http://kateto.net/network-
visualization), it has also Python bindings).

~~~
kozikow
Does any of those libraries would support:

\- Mouse over to highlight neighbours

\- Zoom in and out

\- Browser based

\- Ideally "hide this cluster" functionality

I wanted to add some of those features using d3, but the weekend ran out. Also
"mouse over" would require me to move the vis from canvas to svg.

~~~
stared
For interactive I use D3.js, on SVG. And I can add all functionalities I like.
:) In particular, I am thinking about creating a general-purpose co-occurrence
graph (after some success of TagOverflow), see e.g. [http://p.migdal.pl/delab-
matury/koincydencje/](http://p.migdal.pl/delab-matury/koincydencje/). If you
like it, you can motivate me to making it a library :).

But: there should be exports from igraph to D3.js, and from Gephi to
[http://sigmajs.org/](http://sigmajs.org/) (<\- this is interactive, and
higher level that D3.js; I haven't used it, though; with mouseover and zoom).

~~~
kozikow
d3 tends to be very low level and there are a few successful libraries built
on top of it. Co-occurrence graph library certainly would be popular. In my
visualisation I've spent the most time fiddling with JavaScript and currently
it's the bottle neck of quality of the visualisation.

My another problem was that due the graph size it took ~1 minute to
stabilize[1] on my browser and it was very wobbly and janky. I had to pre-load
nodes positions and disable animation to make it usable for viewers. The fully
automated analsysis of big graphs would require some server side component
pre-computing the graph.

1\. It was stabilizing quicker with smaller collision (newly introduced in d3
v4 force layout) parameter, but nodes were more likely to overlap.

~~~
stared
For larger graphs than ~100 it's better to precompute it. If you want to make
all vis, then I will still use precomputed as a seed.

------
elcapitan
Honest question: Is anybody actually able to extract valuable information out
of that type of visualization? They always seem like a bag of wool that my cat
would want to play with. Maybe some better plotting algorithm would help?

~~~
michaelscott
This. Unfortunately there is nothing actionable from this visualisation and,
at the risk of sounding like an asshole, it's virtually meaningless. The only
piece of information this visualisation does provide is what the largest
Python packages are.

What's missing is the ability for a user to "separate" clusters into spaces
that they care about. Ideally though, the graphic should be able to display
the relationships in a way that surfaces visually at a glance.

~~~
kozikow
I agree. I was planning to add more things, but the weekend ran out. By using
graphistry.com I will add: \- Zooming in and out. \- Mouse over the package to
highlight nodes with an edge. \- Cluster the packages and let users remove
clusters they are not interested in

~~~
michaelscott
Sounds awesome man, keen to see it!

~~~
kozikow
I added some graphistry and clustering analysis to the link.

------
infocollector
This might be of interest as well:
[http://pypi.compgeom.com](http://pypi.compgeom.com)

------
Feneric
If you pick a fringe package it just points way off center in empty space.
It's not immediately clear what's happening. An extra dot for the center might
be helpful.

------
ameliaquining
I'd be interested to see what this looks like for npm.

~~~
kozikow
I plan to do the analysis for other languages soon.

------
toolslive
does it handle runtime python path trickery ?

