Hacker News new | past | comments | ask | show | jobs | submit login
Lesser known features of ClickHouse (pradeepchhetri.xyz)
158 points by pradeepchhetri on May 31, 2022 | hide | past | favorite | 33 comments



Here's one not mentioned in the article (currently an experimental feature gated by a toggle): ClickHouse can consume a Postgres logical replication stream. Instead of paying for Stitch/Fivetran/etc, you can get OLTP->OLAP replication for free, with no additional software. It's not as flexible as the commercial stuff, and has several sharp edges, but still it's very useful and just an amazing thing to be built-in.


In clickhouse-local example, you can omit the --structure argument and it will continue to work with automatic schema inference.


I recommend to write a follow-up with the topics:

1. clickhouse-git-import tool, what is it and how to use it. 2. ClickHouse as a monitoring agent with the system.asynchronous_log table. 3. Hosting static MergeTree tables on any web server.


Thank you for your response and going through the blog. I am curious about how to implement

2. ClickHouse as a monitoring agent with the system.asynchronous_log table.

Can you share some links around it.


About the system.asynchronous_metric_log table.

It's very simple. Every ClickHouse instance, even empty, is collecting system-wide monitoring information. You can query this table directly from Grafana dashboard (for example).


Good idea, makes sense. Thank you!


>3. Hosting static MergeTree tables on any web server.

Can you share some docs on that? AFAIK CH can use web server hosted data sources but I don't recall CH able to host merge tree tables on HTTP servers.



We have a use case for wide columnar data, used for mostly performance analytics. There are many types of events that share same columns, but mostly total unique columns are 20x of a typical events columns. My use case is filtering by some boolean logic and aggregating them. I use Elasticsearch with a mapping that does not tokenize text fields, (all strings are keyword type) and it works very well. I can add nodes as I want and adjust shards with ease with spot instances and make it cost effective.

I have not found a way/tool to replace this. Many of the tools fail at dynamic data with cardinality. Wanted to use Clickhouse like this, adding columns as they are discovered but it did not go well, but it has been a while. Also adding replicas is not as easy as Elasticsearch.

Does anyone have a similar use case implemented with Clickhouse when data is not known before hand and unfolds as time goes by?


Thanks for sharing this. It is a very interesting problem that highlights some of the technical challenges of working with modern event data, which happens to 'prefer' being semi-structured (i.e JSON is the most natural serialization format while creating events).

It's also something we're working on! Shameless plug - I happen to work at Sneller (sneller.io, open source at https://github.com/SnellerInc/sneller) that might be interesting to you.

A couple of key ideas - first, we bypass the need for any sort of 'semi-structured to relational' ETL/ELT overhead by running vectorized SQL on a (compressed) binary form of the JSON data which preserves its original structure. So we're schema-on-read first and foremost - you don't need to worry about adding new fields in the source JSON as long as your queries know of these new fields.

Second, we completely separate storage from compute. Unlike CH we don't use local disk as any sort of storage tier, and use cloud object stores as our _primary_ storage tier. So all your data (including the compressed binary version of your source JSON) lives in s3 buckets in your control.

Feel free to check us out and let us know what you think!

1. Github - https://github.com/SnellerInc/sneller

2. Intro blog - https://github.com/SnellerInc/blogs/blob/main/introducing-sn...


I think you should give PostgreSQL's intarray extension a try, especially using the "@@" QUERY_INT boolean logic queries over GIN indexed columns.


We at Dassana[1] use clickhouse underneath and solve the exact same problem. Have a look and feel free to reach out to me if you have any questions. You can find more details on my ShowHN post[2].

[1] https://lake.dassana.io

[2] https://news.ycombinator.com/item?id=31111432


CH approach: Load the dynamic data into a JSON field and use materialized columns when needed for performance beyond calling JSONExtract().


the "Object" data type works very well and is quite performant. But do note that if you have lots of "schemas" you will need to create multiple such columns that represent such columns e.g. one column each for GitHub,GitLab etc.


> And your materialised view/JsonExtract can be dispatched by schema name for a row.

How would that look like? MatViews in CH are like insert triggers so does it mean that one MatView per schema?

>I see the only suboptimal part if different schemas have different types for the same field and it has to convert it to String.

Agree, this is what I was actually after and it actually happens with system data sometimes- you can have "user" field as an int and string in different schemas.


Is it really the case? Assume you have column with schema name and one with json object. And your materialised view/JsonExtract can be dispatched by schema name for a row.

I see the only suboptimal part if different schemas have different types for the same field and it has to convert it to String.


Uber has a blog post about using a hybrid static column/array approach for log files that might be similar to your use case (they replaced Elasticsearch installations) https://eng.uber.com/logging/


you may check: https://github.com/jitsucom/jitsu. "Jitsu is an open-source Segment alternative. Fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days"

You can create an API endpoint, and send those JSON to it. In the "destination" part, it can sync to clickhouse (one of many choices, like redshift, snowflake,besides clickhouse) very quickly, and flatten the JSON into columns. If there is new key found in JSON, it will create a new column in clickhouse.


A very good article. Can anyone recommend more like this to learn more about clickhouse?



https://altinity.com/blog

(Disclaimer: I work at Altinity.)


Kind of related question: we really like clkckhouse but want a grafana alternative or at least an alternative to the vertamedia adapter. The adapter burns way more resources then clickhouse itself.



We maintain the Vertamedia adapter at Altinity and have hundreds of dashboards ourselves plus OSS users have countless thousands more. Can you log an issue so we can look at it? I've never heard of this problem but would love to learn more about it.

https://github.com/Altinity/clickhouse-grafana/issues


The vertamedia clickhouse plugin works perfectly for me. Have you tried opening an issue upstream.


I don't know what click house is, but that personal site is really awesome.


The best feature of Clickhouse is replacing entire data engineering and analytics teams and their “modern data platform” with one person and one application.


How’s that? How does clickhouse replace ETLs? How does it alleviate data quality problems? It is just a data warehouse , very fast but with limited data modelling and permissioning compared to MS OLAP or tabular models.


Im not really disagreeing with him, but you can use DBT to materialise 10 daisy chained views, or hit Clickhouse with a query and allow it to crunch data.

I am a proponent of modern data stack but it’s always at the back of my mind that an OLAP database on a big box negates the need for a lot of it.


Sure it does, but you have to feed it somehow and maintain it so that the data there is of good quality and makes business sense. Afaik, clickhouse is not great for multitable joins required by good data warehousing practices like star and snowflake schema.


CH is actually very good if you can model the dimensional data as dictionary. Also CH can use external store like PSQL as dictionary. But yes, if your fact table data size is as big as dimensional data, CH or any other such system will be quite slow.


ETL: Clickhouse can probably directly read any format or data source you have.

Quality: A diligent human like anywhere.

Data Modeling: I disagree, what do you think it is missing? I’ve never encountered anything I couldn’t achieve, and the Clickhouse specific aggregates, lambda support, and array functionality really simplifies many problems.

Permissioning: It has very granular permissions.


If you use the "Object" data type (json), you can have a schema-less datastore and that can alleviate the need of complex ETL pipelines. But yes, if you need to decorate the data before ingestion, you need some ETL pipeline. In other words, if the ETL pipeline is just storing the data in columnar format by reading some JSON fields, you don't need that anymore with the new CH json data type




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: