We went from the get go to that infrastructure for multiple reasons in the first place:
* Having a durable buffer before ensures if you have big spikes that gets eaten by the buffer, not OLAP which when it is powering your online dashboard you want to keep responsive. Clickhouse cloud now has compute/compute that addresses that but open source users' don't.
* When we shipped this for the first time, clickhouse did not have the async buffering in place, so not doing some kind of buffered inserts was forwned upon.
* As oatsandsugar mentioned, since them we also shipped direct insert where you don't need a kafka buffer if you don't want it
* From an architecture standpoint, with that architecture you can have multiple consumers
* Finally, having kafka enables having streaming function written in your favorite language vs using SQL. Definitely will be less performance to task ratio, but depending on the task might be faster to setup or even you can do things you couldn't directly in the database.
> Clickhouse cloud now has compute/compute that addresses that but open source users' don't.
Altinity is addressing this with Project Antalya builds. We have extended open source ClickHouse with stateless swarm clusters to scale queries on shared Iceberg tables.
The durability and transformation reasons are definitely more compelling, but the article doesn’t mention those reasons.
It’s mainly focused on the insert batching which is why I was drawing attention to async_insert.
I think it’s worth highlighting the incremental transformation that CH can do via the materialised views too. That can often replace the need for a full blown streaming transformation pipelines too.
IMO, I think you can get a surprising distance with “just” a ClickHouse instance these days. I’d definitely be interested in articles that talk about where that threshold is no longer met!
MooseStack maintainer here. I helped author the post. Happy to answer any questions, but very curious to get feedback. We’ve been thinking a lot about developer experience for the OLAP stack.
I believe I am saying child processes can write to stdout as the main process is shutting down. Also, if the child processes are not shut down properly and are left dangling, and the child processes were set up as 'inherit' to be able to write directly to stdout/stderr then yes.
Not sure if this is what you are asking about, so if I misread feel free to correct me. You don’t have to install moose first on the deployment machine, in the tutorial I go through that to generate a dummy moose application to be deployed.
It is the same idea as a nextjs application you deploy through docker, you have your application and then you build your docker container that contains your code, then you can deploy that.
I tried to limit the port bindings, we usually expose moose itself since one of the use case is collecting data for product analytics from a web front end, which pushes data to moose. And then usually people want to expose rest apis on top of the data they have collected. The clickhouse ports could be fully closed, this was an example of if you want to connect PowerBook to it
We are built on top of them. Right now the techs above are what’s backing the implementation but we want to add different compatibilities. So that you can eventually have for example airflow backing up your orchestration instead of temporal.
You can think of moose as the pre-built glue between those components with the equivalent UX of a web framework (ie you get hit reloading, instant feedback, etc…)
I put this Docker-Compose recipe together to make kicking the tires on Moose—our open-source data-backend framework—almost friction-less.
What you get:
• A single docker compose up that spins up ClickHouse, Redpanda, Redis and Temporal with health-checks & log-rotation already wired.
• Runs comfortably on an 8 GB / 4-core VPS; scale-out pointers are in the doc if you outgrow single-node.
• No root Docker needed; the stack follows the hardening tips ClickHouse & Temporal recommend.
Why bother?
Moose lets you model data pipelines in TypeScript/Python and auto-provisions the OLAP tables, streams and APIs—cuts a lot of boilerplate. Happy to trade notes on the approach or hear where the defaults feel off.
I have a small open-source project, that uses docker compose behind the scenes, to help startup any service. You can look to add it in (or I am also happy to add it in) and then users are one command away from running it (insta moose). Recently just added in lakekeeper and various data annotation tools.
Interesting. How do you do dependencies between those pieces of infrastructure if there's any? For example, in our Docker Compose file, we have temporal that depends on progress and then moose depends on temporal. How is that expressed in Insta-Infra?
It leverages docker compose 'depends_on' for the dependencies (https://docs.docker.com/compose/how-tos/startup-order/). For example, airflow depends on airflow-init container to be completed successfully which then depends on postgres.
Founder here. Thanks for the interest! We built Moose because we were tired of the complexity involved in setting up and maintaining data pipelines.
What makes Moose different is how it simplifies the entire workflow - from ingestion to processing to serving data through APIs. We've found teams spend too much time wiring together different tools rather than focusing on the actual data insights.
The local development experience was a big focus for us. You can instantly test your changes with real data without waiting for deployments. And we've made sure the same code runs identically in production to eliminate those frustrating "works on my machine" moments.
Happy to answer any questions about our technical approach or how we're handling specific use cases. We're particularly interested in hearing about pain points you've experienced with existing data systems or any feedback you might have on Moose.
We are going towards 1.0 from an API perspective, we have just landed what we internally call DMV2 which is the latest iteration of the abstraction level for the api. Think SST / Terraform CDK vertically integrated for Data.
If you are looking to work with Moose in production we would love to chat with you :)
Hi Zephyr! I'm the Head of Engineering at F45 Training. We had early access to moose, and we've been using it in production since last year with thousands of our members. We use moose to manage the backend for LionHeart - our heart rate tracking system in studio. We also use Moose's paid hosting service called Boreal. It's a new product so still a bit rough around the edges - but it has scaled really well for us and the 514 Team has been terrific.
We went from the get go to that infrastructure for multiple reasons in the first place:
* Having a durable buffer before ensures if you have big spikes that gets eaten by the buffer, not OLAP which when it is powering your online dashboard you want to keep responsive. Clickhouse cloud now has compute/compute that addresses that but open source users' don't.
* When we shipped this for the first time, clickhouse did not have the async buffering in place, so not doing some kind of buffered inserts was forwned upon. * As oatsandsugar mentioned, since them we also shipped direct insert where you don't need a kafka buffer if you don't want it
* From an architecture standpoint, with that architecture you can have multiple consumers
* Finally, having kafka enables having streaming function written in your favorite language vs using SQL. Definitely will be less performance to task ratio, but depending on the task might be faster to setup or even you can do things you couldn't directly in the database.
Disclaimer I am the CTO at Fiveonefour