Thought i'd share some recent work i'd done after being generally frustrated as to the available analytics for Python modules. Most seem to be either limited in time or incur charges to query.
Code is Apache 2 at https://github.com/clickHouse/clickpy
The downloads for Python modules are actually available in BigQuery - a row for every package download in the world. Last time i checked it's the largest BigQuery public dataset at almost 700b rows. Wanting to do some serious analytics led to a few frustrations though:
- speed for queries - BigQuery is great for complex SQL, less so for fast analytics.
- cost :) especially as i wanted to offer this for free.
Knowing that ClickHouse excels at this sort of problem (as a ClickHouse employee full disclosure), I set about exporting the data to GCS and importing to ClickHouse. A weekend learning NextJS+React and some help from a designer friend (thanks Daniel!) and ClickPy was born!
For now the analytics are quite simple but I plan to enrich them over time. I've made the cluster also public and read-only so users can run the app themselves - and hopefully contribute back.
Finally, i'm planning to keep the dataset up-to-date daily (maybe more in the future). It's proven a useful test case for the ClickHouse dev team and has already found a few issues in the core db.
Contributions welcome and let me know if you find it useful for digging into the downloads of your own package.
Cheers,
GingerWizard
This would be a really nice tool to reference to when people share their findings about malicious python packages being uploaded to pypi... take this one as example https://clickpy.clickhouse.com/dashboard/cobo-python-api
After looking through some of these malicious packages using your tool, I noticed a trend.. The malicious package usually targets high download count packages and create something similar so the moment they are uploaded, they start getting high number of download... perhaps it can be used as indicator.