Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: ClickPy – OSS project and service for Python analytics using ClickHouse (clickhouse.com)
17 points by gingerwizard on Nov 3, 2023 | hide | past | favorite | 2 comments
Thought i'd share some recent work i'd done after being generally frustrated as to the available analytics for Python modules. Most seem to be either limited in time or incur charges to query.

Code is Apache 2 at https://github.com/clickHouse/clickpy

The downloads for Python modules are actually available in BigQuery - a row for every package download in the world. Last time i checked it's the largest BigQuery public dataset at almost 700b rows. Wanting to do some serious analytics led to a few frustrations though:

- speed for queries - BigQuery is great for complex SQL, less so for fast analytics. - cost :) especially as i wanted to offer this for free.

Knowing that ClickHouse excels at this sort of problem (as a ClickHouse employee full disclosure), I set about exporting the data to GCS and importing to ClickHouse. A weekend learning NextJS+React and some help from a designer friend (thanks Daniel!) and ClickPy was born!

For now the analytics are quite simple but I plan to enrich them over time. I've made the cluster also public and read-only so users can run the app themselves - and hopefully contribute back.

Finally, i'm planning to keep the dataset up-to-date daily (maybe more in the future). It's proven a useful test case for the ClickHouse dev team and has already found a few issues in the core db.

Contributions welcome and let me know if you find it useful for digging into the downloads of your own package.

Cheers, GingerWizard




Curious - how many row are you ingesting daily?

This would be a really nice tool to reference to when people share their findings about malicious python packages being uploaded to pypi... take this one as example https://clickpy.clickhouse.com/dashboard/cobo-python-api

After looking through some of these malicious packages using your tool, I noticed a trend.. The malicious package usually targets high download count packages and create something similar so the moment they are uploaded, they start getting high number of download... perhaps it can be used as indicator.


Its about a billion rows a day. Nice idea, we could probably add a visual for possible malicious packages.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: