Hacker News new | past | comments | ask | show | jobs | submit login
What's New in ClickHouse 21.12 (clickhouse.com)
133 points by nnx 37 days ago | hide | past | favorite | 55 comments

I can't speak highly enough of Clickhouse. Since 20.x, which was already one of (the?) fastest column store databases around, it's gotten even better:

- Built-in replication without the need to run Zookeeper, with multi-tier sharding support. They rewrite the ZK protocol using multi-paxos within CH itself. It's great.

- Built-in batch inserts via the HTTP protocol... sorry, TCP :(. Previously you'd have to batch using buffer tables, proxies, or in-memory buffering within your client apps. This is no longer needed!

- Better support for external data formats (avro, parquet)

It's just... so good.

> Built-in replication without the need to run Zookeeper, with multi-tier sharding support. They rewrite the ZK protocol using multi-paxos within CH itself. It's great.

AFAIK this description is kind of misleading. When they say that they got rid of Zookeeper people expect that they can just connect clickhouse nodes to each other and the replication will work. But that is not how things work - you still have to run external service called clickhouse-keeper. Basically what they did is they rewrote Zookeeper in C++.

Fwiw, Clickhouse Keeper _can_ be run as an external daemon if you wish so. But it's packaged with the server binary itself, so once you add <keeper_config> to all your server nodes, you're good to go, without running anything else.

You can run clickhouse-keeper embedded in the server, though. That way, each "primary" handles incoming SQL connections _and_ paxos communication without additional infra.

> Built-in batch inserts via the HTTP protocol

Very welcome. We used to do that with a dedicated app.

> It's just... so good.

What Postgres is for RDBMs and SqlLite for embedded, ClickHouse is for time series. Tastefully designed and driven by engineering excellence. I wish them all the best.

> They rewrite the ZK protocol using multi-paxos within CH itself.

It appears to be implemented with Raft, not Paxos (per https://presentations.clickhouse.com/meetup54/keeper.pdf, slide 21).

Please also note that there are two concepts here that can easily get mixed. For clarification:

* Zookeeper's wire protocol is emulated in CH-Keeper. Nice! So all clients are compatible, etc.

* Zookeeper uses a distributed consensus algorithm called ZAB. Which is not Paxos --but many believes so. CH-Keeper uses Raft, and it can do so as the consensus algorithm is not exposed directly: it is an internal property hidden behind the API and obviously the wire protocol.

Do you have to do anything special to opt-in for the built-in batch inserts? Earlier, I was forced to use the buffer tables approach: how would one ditch that now?

If you're using the HTTP protocol, add `async_insert=1` to your connection string. You can tune the batching here: https://clickhouse.com/docs/en/operations/settings/settings/...

I'm a little surprised that Amazon hasn't created ClickHouse as a Service. They've done it for Elasticsearch, Presto (Athena), Hadoop, Kafka, MySQL, PostgreSQL. ClickHouse seems like it would fit with their strategy of offing a datastore as a service. While AWS does have things like Redshift and Timestream, it seems like ClickHouse offers a lot of potential untapped value that Amazon could capture.

They may be working on it already. Such a product would take awhile to create. It was only in September of this year that Clickhouse was spun out of Yandex and quickly raised a Series A + B.

Well Clickhouse has been open source for quite some time. The reason I could think of Amazon not doing it is because the demand are low. Clickhouse still isn't mainstream yet.

At this point product strategy for these companies must be built with Amazon in mind, with a goal of outpacing them at the juncture that AWS would be able to take a bite of the market share easily.

As much as it sucks to have been an OSS-ish product Amazon has taken a chunk out of, the game is now known and can be proactively neutered with good planning.

I am wondering whether it would a rival project to redshift?

You mean ParAccel? :P

But Clickhouse spinning off from Yandex is not a requirement for an AWS offering, right?

But with CH they don't get that sweet sweet vendor lock in like with Dynamo or Kinesis.

Many of those were introduced sometime after they because mainstream.

That'd be awesome actually.

We're big fans of ClickHouse here at GraphCDN, our entire analytics stack is based on it[0] and it's been scaling well from tens of thousands of events to now billions of events!

[0]: https://altinity.com/blog/delivering-insight-on-graphql-apis...

Not to be a the stick in the mud here. We recently moved from Clickhouse to Druid due to issues we were having when scaling and rebalancing the cluster. How does removing ZK help?

Druid has quite some intelligence baked in to handle the scaling by default. I am curious how clickhouse is doing in all those aspects.

When we did a PoC, the operational aspect of clickhouse and performance was severely lacking as compared to druid. Clickhouse had bigger resources at its disposal than druid during this PoC.

If they could improve the operational aspect and introduce sensible defaults so that the users don't have to go through 10000 configuration to work with data in clickhouse, I am sure I will give it a go for some other usecase. It is simple on surface but devil is in the details. Druid is much simpler and sane at the scale I need to operate.

Because ZK is garbage and complicates every clustered application that relies on it? Kafka is ditching ZK too.

Clickhouse cluster quite simply doesn’t support elastic rebalancing. Avoid CH if that is a hard requirement for your setup.

I used to have a wipe this ZK node clean and rejoin the cluster script to deal with ZK node outages that nobody could explain.

Interesting. We moved from Druid to Clickhouse for exactly the same reason :-)


Clickhouse is significantly easier to operate than Druid in my experience.

How does Clickhouse stack up against Druid? We're trying to make a decision on the two technologies, and found this recent article that shows Druid 8x faster than Clickhouse - https://imply.io/post/druid-nails-cost-efficiency-challenge-...

Happy to help you suss that out, at least from the Druid side. You can post on druid-user@googlegroups.com or on druidforum.org. I monitor those sites and if you give me some ideas of what you want to do, I can help you figure out if Druid is the right fit. In my 4 years working with Druid, I can tell you that the posters here are right about Druid V CH. We find that Druid is more stable and easier to manage for larger clusters, but it can be complex to get up and running. There is a new project in the Druid Community to address these issues and hopefully there will be a version early next year to play with. In terms of the benchmarks, the post was meant a little tongue in cheek... different queries in different environments with different data will just perform... different. I did some of the benchmarks against CH with SSD data and results were mixed. We tuned some stuff, added things to the code and things were still mixed. Both databases are very, very fast and very well suited to real time workloads. It all comes down to the use caase, deployment environment, etc.

Your statement is misleading at least.

In this short article nowhere said that Druid is faster than ClickHouse in 8 times. They claimed: "Druid is simply 2 times faster than ClickHouse" (actually by total runtime it's only 1.5 times faster)

There are also newer ClickHouse benchmark which total runtime = 0.779s This is almost the same number as in Imply statement.


The introduction of parallelized Parquet reads coupled with s3Cluster is really awesome. I feel ClickHouse is one step closer to unlocking the ephemeral SQL compute cluster (e.g. Presto, Hive) use case. I could imagine one day it having a HiveMetaStore read-only database option for querying existing data in companies...very fast, I might add.

>>> SELECT * FROM system.contributors


I made a PoC to replace our current ElasticSearch cluster with Clickhouse 20.10. 5x gains in space and 3x gains in QPS, without changing too much the schema model. It's still in a PoC repo, priorities I guess..

Very interesting! Having worked for both Elastic and now ClickHouse, I'd love to hear more about the use case or learnings you have from the POC. If you have any questions or issues that pop up, I'd be happy to try and get an answer for you from the team.

Coolest part of clickhouse is it’s ability to do ETL automagically.

It really is a super power.

What do you mean? Does it have an etl engine like MS SSIS or scheduler like airflow built-in?

I essentially load in two columns one called timestamp and another with a json blob.

I then use this to kick off materialized views to automagically pluck out relevant JSON fields into views

Similar to this


That's an interesting way to load data and then use with Materialized Views. However, I am curious how do you make efficient use of compression codecs[0] that Clickhouse provides, or some neat features like TTL policies [1] using this method?

[0]: https://clickhouse.com/docs/en/sql-reference/data-types/lowc...

[1]: https://clickhouse.com/docs/en/sql-reference/statements/alte...

Materialized views have a backing table where you can use the codecs. Can add ttl the ingestion table

The backing table in OP's case is just 2 columns, timestamp and a `String` column representing JSON blob. Can't do much there.

He said the injest table is defined as that not the MV backing table

Great minds think alike :)

We do the exact same thing at GraphJSON https://www.graphjson.com/guides/about

How does Clickhouse compare to Snowflake?

A few differences:

- Snowflake is SaaS, Clickhouse isn’t yet - Clickhouse is open source, Snowflake is proprietary - Snowflake has the virtual warehouse concept and ability to scale compute up and down with a single SQL statement. Clickhouse is a bit more traditional in architecture. - Snowflake is hella expensive - Snowflake is a bit more of a traditional data warehouse, whereas Clickhouse is philosophically about powering through big datasets such as denormalised click stream or logs

Both great products for their respective use cases

There are at least 6 or 7 SaaS implementations of ClickHouse: 3+ in China, Altinity.Cloud (AWS/GCP), Yandex (Yandex.Cloud), TinyBird (AWS)...And more on the way.

Disclaimer: I work on Altinity.Cloud.

Altinity have a Clickhouse cloud

What’s the K8s story? I doubt where I am now I can request physical servers and dedicated fast disks.

ClickHouse works great on Kubernetes. Check out the ClickHouse Operator for Kubernetes. [0] We just added a UI to it, blog article out shortly.

[0] https://github.com/Altinity/clickhouse-operator

Disclaimer: I work at Altinity.

Ooh! I like the idea of this, going to check it out for sure! The UI is a nice touch.

Has anyone here used MonetDB? I wonder how it holds up against other column-oriented databases.

I installed&configured&used it as part of my private PoC (I evaluated ~20 different DBs).

Just forget it - primitive, performance wasn't great even compared to classic RDBMS, and so on => cannot be used in a real-world scenario, but interesting as an experiment.

Does anyone know if clickhouse keeper replace Zookeeper in non clickhouse scenarios?

According to: https://clickhouse.com/docs/en/operations/clickhouse-keeper/

> By default, ClickHouse Keeper provides the same guarantees as ZooKeeper (linearizable writes, non-linearizable reads). It has a compatible client-server protocol, so any standard ZooKeeper client can be used to interact with ClickHouse Keeper. Snapshots and logs have an incompatible format with ZooKeeper, but clickhouse-keeper-converter tool allows to convert ZooKeeper data to ClickHouse Keeper snapshot. Interserver protocol in ClickHouse Keeper is also incompatible with ZooKeeper so mixed ZooKeeper / ClickHouse Keeper cluster is impossible.

So I guess yes?

valued at $2B. whats their business model? corporate support?

Company valuations are Monopoly money.

i guess

They’re going to build a cloud and rival Snowflake. Plus support/service open source users.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact