What's New in ClickHouse 21.12

tonyhb · on Dec 16, 2021

I can't speak highly enough of Clickhouse. Since 20.x, which was already one of (the?) fastest column store databases around, it's gotten even better:

- Built-in replication without the need to run Zookeeper, with multi-tier sharding support. They rewrite the ZK protocol using multi-paxos within CH itself. It's great.

- Built-in batch inserts via the HTTP protocol... sorry, TCP :(. Previously you'd have to batch using buffer tables, proxies, or in-memory buffering within your client apps. This is no longer needed!

- Better support for external data formats (avro, parquet)

It's just... so good.

dunkelheit · on Dec 16, 2021

> Built-in replication without the need to run Zookeeper, with multi-tier sharding support. They rewrite the ZK protocol using multi-paxos within CH itself. It's great.

AFAIK this description is kind of misleading. When they say that they got rid of Zookeeper people expect that they can just connect clickhouse nodes to each other and the replication will work. But that is not how things work - you still have to run external service called clickhouse-keeper. Basically what they did is they rewrote Zookeeper in C++.

mr-karan · on Dec 16, 2021

Fwiw, Clickhouse Keeper _can_ be run as an external daemon if you wish so. But it's packaged with the server binary itself, so once you add <keeper_config> to all your server nodes, you're good to go, without running anything else.

tonyhb · on Dec 16, 2021

You can run clickhouse-keeper embedded in the server, though. That way, each "primary" handles incoming SQL connections _and_ paxos communication without additional infra.

MrBuddyCasino · on Dec 16, 2021

> Built-in batch inserts via the HTTP protocol

Very welcome. We used to do that with a dedicated app.

> It's just... so good.

What Postgres is for RDBMs and SqlLite for embedded, ClickHouse is for time series. Tastefully designed and driven by engineering excellence. I wish them all the best.

ahachete · on Dec 16, 2021

> They rewrite the ZK protocol using multi-paxos within CH itself.

It appears to be implemented with Raft, not Paxos (per https://presentations.clickhouse.com/meetup54/keeper.pdf, slide 21).

ahachete · on Dec 16, 2021

Please also note that there are two concepts here that can easily get mixed. For clarification:

* Zookeeper's wire protocol is emulated in CH-Keeper. Nice! So all clients are compatible, etc.

* Zookeeper uses a distributed consensus algorithm called ZAB. Which is not Paxos --but many believes so. CH-Keeper uses Raft, and it can do so as the consensus algorithm is not exposed directly: it is an internal property hidden behind the API and obviously the wire protocol.

karterk · on Dec 16, 2021

Do you have to do anything special to opt-in for the built-in batch inserts? Earlier, I was forced to use the buffer tables approach: how would one ditch that now?

tonyhb · on Dec 16, 2021

If you're using the HTTP protocol, add `async_insert=1` to your connection string. You can tune the batching here: https://clickhouse.com/docs/en/operations/settings/settings/...

mdasen · on Dec 16, 2021

I'm a little surprised that Amazon hasn't created ClickHouse as a Service. They've done it for Elasticsearch, Presto (Athena), Hadoop, Kafka, MySQL, PostgreSQL. ClickHouse seems like it would fit with their strategy of offing a datastore as a service. While AWS does have things like Redshift and Timestream, it seems like ClickHouse offers a lot of potential untapped value that Amazon could capture.

ucarion · on Dec 16, 2021

They may be working on it already. Such a product would take awhile to create. It was only in September of this year that Clickhouse was spun out of Yandex and quickly raised a Series A + B.

ksec · on Dec 16, 2021

Well Clickhouse has been open source for quite some time. The reason I could think of Amazon not doing it is because the demand are low. Clickhouse still isn't mainstream yet.

tmp_anon_22 · on Dec 16, 2021

At this point product strategy for these companies must be built with Amazon in mind, with a goal of outpacing them at the juncture that AWS would be able to take a bite of the market share easily.

As much as it sucks to have been an OSS-ish product Amazon has taken a chunk out of, the game is now known and can be proactively neutered with good planning.

higeorge13 · on Dec 16, 2021

I am wondering whether it would a rival project to redshift?

gildedage77 · on Dec 16, 2021

You mean ParAccel? :P

stingraycharles · on Dec 16, 2021

But Clickhouse spinning off from Yandex is not a requirement for an AWS offering, right?

abraxas · on Dec 17, 2021

But with CH they don't get that sweet sweet vendor lock in like with Dynamo or Kinesis.

bdcravens · on Dec 16, 2021

Many of those were introduced sometime after they because mainstream.

wizwit999 · on Dec 17, 2021

That'd be awesome actually.

mxstbr · on Dec 16, 2021

We're big fans of ClickHouse here at GraphCDN, our entire analytics stack is based on it[0] and it's been scaling well from tens of thousands of events to now billions of events!

[0]: https://altinity.com/blog/delivering-insight-on-graphql-apis...

tobykeef · on Dec 16, 2021

Not to be a the stick in the mud here. We recently moved from Clickhouse to Druid due to issues we were having when scaling and rebalancing the cluster. How does removing ZK help?

log4shell · on Dec 16, 2021

Druid has quite some intelligence baked in to handle the scaling by default. I am curious how clickhouse is doing in all those aspects.

When we did a PoC, the operational aspect of clickhouse and performance was severely lacking as compared to druid. Clickhouse had bigger resources at its disposal than druid during this PoC.

If they could improve the operational aspect and introduce sensible defaults so that the users don't have to go through 10000 configuration to work with data in clickhouse, I am sure I will give it a go for some other usecase. It is simple on surface but devil is in the details. Druid is much simpler and sane at the scale I need to operate.

dreyfan · on Dec 16, 2021

Because ZK is garbage and complicates every clustered application that relies on it? Kafka is ditching ZK too.

Clickhouse cluster quite simply doesn’t support elastic rebalancing. Avoid CH if that is a hard requirement for your setup.

StreamBright · on Dec 16, 2021

I used to have a wipe this ZK node clean and rejoin the cluster script to deal with ZK node outages that nobody could explain.

benjaminwootton · on Dec 16, 2021

Interesting. We moved from Druid to Clickhouse for exactly the same reason :-)

https://timeflow.systems/why-we-moved-from-druid-to-clickhou...

Clickhouse is significantly easier to operate than Druid in my experience.

gildedage77 · on Dec 16, 2021

How does Clickhouse stack up against Druid? We're trying to make a decision on the two technologies, and found this recent article that shows Druid 8x faster than Clickhouse - https://imply.io/post/druid-nails-cost-efficiency-challenge-...

pedreschi · on Dec 17, 2021

Happy to help you suss that out, at least from the Druid side. You can post on druid-user@googlegroups.com or on druidforum.org. I monitor those sites and if you give me some ideas of what you want to do, I can help you figure out if Druid is the right fit. In my 4 years working with Druid, I can tell you that the posters here are right about Druid V CH. We find that Druid is more stable and easier to manage for larger clusters, but it can be complex to get up and running. There is a new project in the Druid Community to address these issues and hopefully there will be a version early next year to play with. In terms of the benchmarks, the post was meant a little tongue in cheek... different queries in different environments with different data will just perform... different. I did some of the benchmarks against CH with SSD data and results were mixed. We tuned some stuff, added things to the code and things were still mixed. Both databases are very, very fast and very well suited to real time workloads. It all comes down to the use caase, deployment environment, etc.

unamedrus · on Dec 17, 2021

Your statement is misleading at least.

In this short article nowhere said that Druid is faster than ClickHouse in 8 times. They claimed: "Druid is simply 2 times faster than ClickHouse" (actually by total runtime it's only 1.5 times faster)

There are also newer ClickHouse benchmark which total runtime = 0.779s This is almost the same number as in Imply statement.

https://altinity.com/blog/ultra-fast-data-loading-and-testin...

monstrado · on Dec 16, 2021

The introduction of parallelized Parquet reads coupled with s3Cluster is really awesome. I feel ClickHouse is one step closer to unlocking the ephemeral SQL compute cluster (e.g. Presto, Hive) use case. I could imagine one day it having a HiveMetaStore read-only database option for querying existing data in companies...very fast, I might add.

fasteo · on Dec 16, 2021

>>> SELECT * FROM system.contributors

Genius

drchaim · on Dec 17, 2021

I made a PoC to replace our current ElasticSearch cluster with Clickhouse 20.10. 5x gains in space and 3x gains in QPS, without changing too much the schema model. It's still in a PoC repo, priorities I guess..

bjhunter33 · on Dec 17, 2021

Very interesting! Having worked for both Elastic and now ClickHouse, I'd love to hear more about the use case or learnings you have from the POC. If you have any questions or issues that pop up, I'd be happy to try and get an answer for you from the team.

nojito · on Dec 16, 2021

Coolest part of clickhouse is it’s ability to do ETL automagically.

It really is a super power.

polskibus · on Dec 16, 2021

What do you mean? Does it have an etl engine like MS SSIS or scheduler like airflow built-in?

nojito · on Dec 16, 2021

I essentially load in two columns one called timestamp and another with a json blob.

I then use this to kick off materialized views to automagically pluck out relevant JSON fields into views

Similar to this

https://eng.uber.com/logging/

mr-karan · on Dec 16, 2021

That's an interesting way to load data and then use with Materialized Views. However, I am curious how do you make efficient use of compression codecs[0] that Clickhouse provides, or some neat features like TTL policies [1] using this method?

[0]: https://clickhouse.com/docs/en/sql-reference/data-types/lowc...

[1]: https://clickhouse.com/docs/en/sql-reference/statements/alte...

Redsquare · on Dec 16, 2021

Materialized views have a backing table where you can use the codecs. Can add ttl the ingestion table

mr-karan · on Dec 17, 2021

The backing table in OP's case is just 2 columns, timestamp and a `String` column representing JSON blob. Can't do much there.

Redsquare · on Dec 18, 2021

He said the injest table is defined as that not the MV backing table

flurly · on Dec 16, 2021

Great minds think alike :)

We do the exact same thing at GraphJSON https://www.graphjson.com/guides/about

rzk · on Dec 16, 2021

How does Clickhouse compare to Snowflake?

benjaminwootton · on Dec 16, 2021

A few differences:

- Snowflake is SaaS, Clickhouse isn’t yet - Clickhouse is open source, Snowflake is proprietary - Snowflake has the virtual warehouse concept and ability to scale compute up and down with a single SQL statement. Clickhouse is a bit more traditional in architecture. - Snowflake is hella expensive - Snowflake is a bit more of a traditional data warehouse, whereas Clickhouse is philosophically about powering through big datasets such as denormalised click stream or logs

Both great products for their respective use cases

hodgesrm · on Dec 20, 2021

There are at least 6 or 7 SaaS implementations of ClickHouse: 3+ in China, Altinity.Cloud (AWS/GCP), Yandex (Yandex.Cloud), TinyBird (AWS)...And more on the way.

Disclaimer: I work on Altinity.Cloud.

Redsquare · on Dec 18, 2021

Altinity have a Clickhouse cloud

gigatexal · on Dec 16, 2021

What’s the K8s story? I doubt where I am now I can request physical servers and dedicated fast disks.

hodgesrm · on Dec 17, 2021

ClickHouse works great on Kubernetes. Check out the ClickHouse Operator for Kubernetes. [0] We just added a UI to it, blog article out shortly.

[0] https://github.com/Altinity/clickhouse-operator

Disclaimer: I work at Altinity.

gigatexal · on Dec 17, 2021

Ooh! I like the idea of this, going to check it out for sure! The UI is a nice touch.

michelb · on Dec 16, 2021

Has anyone here used MonetDB? I wonder how it holds up against other column-oriented databases.

zepearl · on Dec 16, 2021

I installed&configured&used it as part of my private PoC (I evaluated ~20 different DBs).

Just forget it - primitive, performance wasn't great even compared to classic RDBMS, and so on => cannot be used in a real-world scenario, but interesting as an experiment.

polskibus · on Dec 16, 2021

Does anyone know if clickhouse keeper replace Zookeeper in non clickhouse scenarios?

e12e · on Dec 16, 2021

According to: https://clickhouse.com/docs/en/operations/clickhouse-keeper/

> By default, ClickHouse Keeper provides the same guarantees as ZooKeeper (linearizable writes, non-linearizable reads). It has a compatible client-server protocol, so any standard ZooKeeper client can be used to interact with ClickHouse Keeper. Snapshots and logs have an incompatible format with ZooKeeper, but clickhouse-keeper-converter tool allows to convert ZooKeeper data to ClickHouse Keeper snapshot. Interserver protocol in ClickHouse Keeper is also incompatible with ZooKeeper so mixed ZooKeeper / ClickHouse Keeper cluster is impossible.

So I guess yes?

aliswe · on Dec 16, 2021

valued at $2B. whats their business model? corporate support?

goodpoint · on Dec 16, 2021

Company valuations are Monopoly money.

aliswe · on Dec 17, 2021

i guess

nemo44x · on Dec 16, 2021

They’re going to build a cloud and rival Snowflake. Plus support/service open source users.