Hi all, we recently open sourced the first version of Hashquery, a Python library which lets you model analytics, DRY up common logic, and execute it against a database or warehouse.
We were originally rendering SQL directly for all our queries, but that spiraled out of control with more complex analysis; the SQL needed to be changed sporadically for each new database dialect (BigQuery, Redshift, Aethena, Postgres, etc etc) and the SQL fragments were very challenging to reuse (and so fragments were copy-pasted all over).
~~~
Advantages we think it has over writing SQL by hand:
- Queries are fully compossible, so any analysis can be chained into further analysis without refactoring.
- All analysis executes within the data warehouse itself, making them super fast for big data.
- It's just Python, so you can extend or parameterize query logic with a simple def function.
- You can run it anywhere, like inside of unit tests, ETL nodes, or Jupyter notebooks.
- Developer experience is pretty good since editors already know how to autocomplete Python. ChatGPT knows Python already too :)
- The library handles normalizing between database dialects. Write once and use in any database.
- Hashquery content is fully serializable, which makes it a good fit for exposing a flexible API for consumers to efficiently query datamarts or internal analytics. We use it as the endpoint for headless BI, as opposed to having to define and teach a DSL or GraphQL specification.
~~~
We've built native funnel analysis on top of Hashquery and have been thrilled with it so far, and we thought others might want to use it too. It's pretty early days so we're still trying to explain it, and the docs aren't perfectly clear, but the examples on the dev site are editable and you can download the pip package to play around with it!
- Break the dependency on your product. I need to be able to use the library even if your company goes under.
- Add a dbt library that makes it easy to use hashquery within dbt models. It gets you materialization for free and will answer a lot of questions you will get about dbt integration.
To comment more broadly, if you want to be a broad solution, the going trend in data integration seems to be at the warehouse level so you need to have SQL answers.
A bunch of tools all consume from our warehouse (existing BI, reverse ETL, data science systems). A BI definition tool won't work if I can't define segments in a way that all of those can access, even as just tables or views.
Our programmers and data science people know Python and are often very good at SQL, but their time is short and BI projects depending on them have been delayed. Our analysts know SQL, and have the dedicated time to make these projects happen.
This kind of code snippet isn't crazy to put into dbt, and if someone wants to do Python magic in the background they can: