Hacker News new | past | comments | ask | show | jobs | submit login
Visualizing relationships between Python packages (kozikow.com)
63 points by kozikow on July 10, 2016 | hide | past | favorite | 24 comments



Nice!

You might be alyso interested to know that PyPI download stats are also available in BigQuery:

https://mail.python.org/pipermail/distutils-sig/2016-May/028...

(project recently unveiled by Donald Stufft)

I also added your article to the compilation at https://medium.com/@hoffa/github-on-bigquery-analyze-all-the..., thanks for your prolific series of analysis!


Is there any way to combine downloads for a single user? I.e. to know that the pip downloaded both package X and package Y?


99% chance that it's not possible and it won't be possible soon.

You can take a look at the table at https://bigquery.cloud.google.com/table/the-psf:pypi.downloa... and check out tabs "Schema" and "Preview". Mix of country_code, details* and tls* could give you "rough" grouping, but there would probably be 100s of users in the same bucket.

It is probably designed in a way to not let you target single user that way. It's likely that there is a pipeline scrubbing small buckets to avoid privacy leaks. I had to write pipelines like this myself on different datasets.


Thanks! I will join this data with pypi stats.


For strength of co-occurrence please use pointwise mutual information (or its function), see https://github.com/stared/tagoverflow#methods-and-tricks. I explored quite a few, and only this one gave reasonable results, including when there was a big difference in the number of counts of A and B (see: http://p.migdal.pl/tagoverflow/).

Also, PMI shows up in many contexts, including word2vec. Your measure is at best hacky (no direct interpretation, leaves many small nodes without links).

Otherwise, a nice work! But it would be great if visualized more interactively and aesthetically (yeah, I can be biased here).


Looks great! Today after work I plan to try out version with PMI and https://github.com/graphistry/pygraphistry .


I don't know this one... when not D3 I was using Gephi or igraph (see http://kateto.net/network-visualization, it has also Python bindings).


Does any of those libraries would support:

- Mouse over to highlight neighbours

- Zoom in and out

- Browser based

- Ideally "hide this cluster" functionality

I wanted to add some of those features using d3, but the weekend ran out. Also "mouse over" would require me to move the vis from canvas to svg.


For interactive I use D3.js, on SVG. And I can add all functionalities I like. :) In particular, I am thinking about creating a general-purpose co-occurrence graph (after some success of TagOverflow), see e.g. http://p.migdal.pl/delab-matury/koincydencje/. If you like it, you can motivate me to making it a library :).

But: there should be exports from igraph to D3.js, and from Gephi to http://sigmajs.org/ (<- this is interactive, and higher level that D3.js; I haven't used it, though; with mouseover and zoom).


d3 tends to be very low level and there are a few successful libraries built on top of it. Co-occurrence graph library certainly would be popular. In my visualisation I've spent the most time fiddling with JavaScript and currently it's the bottle neck of quality of the visualisation.

My another problem was that due the graph size it took ~1 minute to stabilize[1] on my browser and it was very wobbly and janky. I had to pre-load nodes positions and disable animation to make it usable for viewers. The fully automated analsysis of big graphs would require some server side component pre-computing the graph.

1. It was stabilizing quicker with smaller collision (newly introduced in d3 v4 force layout) parameter, but nodes were more likely to overlap.


For larger graphs than ~100 it's better to precompute it. If you want to make all vis, then I will still use precomputed as a seed.


Do you have any idea for algorithm that would find "alternative packages to X", e.g. matplotlib would return seaborn/bokeh/ggplot/etc. ?

Some rough idea would be high correlation of their neighbour weights, but low direct edge, but that doesn't work in some corner cases. More in https://kozikow.com/2016/07/10/visualizing-relationships-bet... .


Some incarnation of word2vec, as they are suited for replacements (you can try googling node2vec, though you need to double check if it is the thing; if it is word2vec-like compression of the incidence matrix - it is what you are looking for). As in:

"a small, fluffy rissun run on a tree" -> "rissun" is something like a squirrel.

If you want to have something more straightforward, look at adjacency matrix squared ("friends of friends") or some other measure of two nodes having a lot of common neighbours.

In a week I am having more time (and tutoring for 2 weeks a students on word2vec/glove visualizations), so I would be happy to talk about it.


Honest question: Is anybody actually able to extract valuable information out of that type of visualization? They always seem like a bag of wool that my cat would want to play with. Maybe some better plotting algorithm would help?


This. Unfortunately there is nothing actionable from this visualisation and, at the risk of sounding like an asshole, it's virtually meaningless. The only piece of information this visualisation does provide is what the largest Python packages are.

What's missing is the ability for a user to "separate" clusters into spaces that they care about. Ideally though, the graphic should be able to display the relationships in a way that surfaces visually at a glance.


I agree. I was planning to add more things, but the weekend ran out. By using graphistry.com I will add: - Zooming in and out. - Mouse over the package to highlight nodes with an edge. - Cluster the packages and let users remove clusters they are not interested in


Sounds awesome man, keen to see it!


I added some graphistry and clustering analysis to the link.


I will try out the http://graphistry.com/ today after work. Zooming and mouse over will help extract information. In the second section of the post I posted some information I extracted.

It was hard to balance number of nodes well:

- Too few: Interesting clusters like robotics, numpy and openstack start to disappear, as they get dominated by massive clusters like django

- Too many: It's hard to see things.


This might be of interest as well: http://pypi.compgeom.com


If you pick a fringe package it just points way off center in empty space. It's not immediately clear what's happening. An extra dot for the center might be helpful.


I'd be interested to see what this looks like for npm.


I plan to do the analysis for other languages soon.


does it handle runtime python path trickery ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: