Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think it's an unfair comparison, notably because:

1) Clickhouse is rigid-schema + append-only - you can't simply dump semi-structured data (csv/json/documents) into it and worry about schema (index definition) + querying later. The only clickhouse integration I've seen up close had a lot of "json" blobs in it as a workaround, which cannot be queried with the same ease as in ES.

2) Clickhouse scalability is not as simple/documented as elasticsearch. You can set up a 200-node ES cluster with a relatively simple helm config or readily-available cloudformation recipe.

3) Elastic is more than elasticsearch - kibana and the "on top of elasticsearch" featureset is pretty substantial.

4) Every language/platform under the sun (except powerbi... god damnit) has native + mature client drivers for elasticsearch, and you can fall back to bog-standard http calls for querying if you need/want. ClickHouse supports some very elementary SQL primitives ("ANSI") and even those have some gotchas and are far from drop-in.

In this manner, I think that clickhouse is better compared as a self-hosted alternative to Aurora and other cloud-native scalable SQL databases, and less a replacement for elasticsearch. If you're using Elasticsearch for OLAP, you're probably better to ETL the semi-structured/raw data out of ES that you specifically wan to a more suitable database which is meant for that.



I address your concern from #1 in "2. Flexible schema - but strict when you need it" section - take a look at https://www.youtube.com/watch?v=pZkKsfr8n3M&feature=emb_titl...

Regarding #2: Clickhouse scalability is not simple, but I think Elasticsearch scalability is not that simple, too, they just have it out of the box, while in Clickhouse you have to use Zookeeper for it. I agree that for 200 nodes ES may be a better choice, especially for full text search. For 5 nodes of 10 TB logs data I would choose Clickhouse.

#3 is totally true. I mention it in "Cons" section - Kibana and ecosystem may be a deal breaker for a lot of people.

#4. Clickhouse in 2021 has a pretty good support in all major languages. And it can talk HTTP, too.


Hi! Assuming you are author of the PixelJets article would you consider submitting a talk to the Percona Live Online 2021 Conference? It's all about open source databases. We're doing an analytics track and welcome submissions on any and all solutions based on open source analytic databases. CFP runs through 14 March.

p.s., Everyone is welcome! If you see this and have a story please consider submitting. No marketing please. We are DBMS geeks.

https://altinity.com/blog/call-for-papers-on-analytics-at-pe...


Thank your for the invitation! Will definitely consider submitting my story.


You might be able to just put whatever you want into an Elasticsearch index, but I wouldn't recommend doing that. It could severely limit how you can query your data later, see: https://www.elastic.co/guide/en/elasticsearch/reference/curr...

Also it can cause performance problems if you have really heterogeneous data with lots of different fields https://www.elastic.co/guide/en/elasticsearch/reference/curr...


Yup, reading that comment all I thought was exactly what I said in another comment here, it'll work great until it doesn't, and by then you'll suffer a lot to work around it

Same with scaling, scaling ES is super easy until you realize your index sizes aren't playing nicely with sharding or something and have to start working around that.

Clickhole feels like it's targeting what most people end up using ES for. Comparing it to ES and talking about what's missing is kind of missing the point imo.


I manage a fairly small ES cluster of 20 i3en.2xlarge instances that ingest data from 300+ apps. Yes, the only problem I see is the field type collision and it happens occasionally.

Otherwise elastic doesn't require much operational time, may be an hour a week.

You pretty much want to keep your indices around 50gb and the ILM works well to manage that.


What about threadwriterejects and max-number-of-shards. If you don’t take into consideration how much data you ingest and in what format it should be afterwards and monitor that constantly:

“You gonna have a bad time”. You can automate a lot of stuff around elasticsearch, but when you provide/source it within company - other teams may not be as knowlegable and can shoot themselves into the foot very easly.

Ive seen it multiple times by now. People have no idea how to manage the size of their clusters.


Anyone accepting freeform objects into a single ES index knows the pain of field type collisions

But, most of the time, it "just works".

Hearing these argument about rigid schemas saving time tells me that nobody has had to support teams with 200+ apps.

^ This guy actually manages infra


I think the author addresses your point one in the article:

> SQL is a perfect language for analytics. I love SQL query language and SQL schema is a perfect example of boring tech that I recommend to use as a source of truth for all the data in 99% of projects: if the project code is not perfect, you can improve it relatively easily if your database state is strongly structured. If your database state is a huge JSON blob (NoSQL) and no-one can fully grasp the structure of this data, this refactoring usually gets much more problematic.

> I saw this happening, especially in older projects with MongoDB, where every new analytics report and every new refactoring involving data migration is a big pain.

They're arguing that using non-structured, or variable structured data is actually a developmental burden and the flexibility it provides actually makes log analysis harder.

It seems that the "json" blobs are a symptom of the problem, not the cause of it.


I disagree with the author on that.

Yes, SQL is nicer for structured queries, sure (“KQL” in Kibana is sort of a baby step into querying data stored in Elastic).

But in Kibana, I can just type in (for example) a filename, and it will return any result row where that filename is part of any column of data.

Also, if I need more structured results (for example, HTTP responses by an API grouped per hour per count), I can pretty easily do a visualization in Kibana.

So yes, for 5% of use cases regarding exposing logging data, an SQL database of structured log events is preferred or necessary. For the other 95%, the convenience of just dumping files into Elastic makes it totally worth it.


Agreed here. More and more data is semi structured and can benefit from ES (or mongo) making it easily exploitable. It's a big part of why logstash and elastic came to be.

One of the most beautiful use cases I've ever seen for elasticsearch was custom nginx access log format in json (with nearly every possible field you could want), logged directly over the network (syslogd in nginx over udp) to a fluentd server setup to parse that json + host and timestamp details before bulk inserting to elastic.

You could spin up any nginx container or vm with those two config lines and every request would flow over the network (no disk writes needed!) and get logged centrally with the hostname automatically tagged. It was doing 40k req/s on a single fluentd instance when I saw it last and you could query/filter every http request in the last day (3+bn records...) in realtime.

Reach out to datadog and ask how much they would charge for 100bn log requests per month.


That argument would apply to production backend databases but I don't see how it really applies to logs. It's like they just copy and pasted a generic argument regarding structure data without taking account of the context.

Logs tend to be rarely read but often written. They also age very quickly and old logs are very rarely read. So putting effort to unify the schemas on write seems very wasteful versus doing so on read. Most of the queries are also text search rather than structured requests so the chance of missing something on read due to bad unification is very low.


I'm the author of at least one of the ClickHouse video presentations referenced in the article as well as here on HN. ElasticSearch is a great product, but three of your points undersell ClickHouse capabilities considerably.

1.) ClickHouse JSON blobs are queryable and can be turned into columns as needed. The Uber engineering team posted a great write-up on their new log management platform, which uses these capabilities at large scale. One of the enabling ClickHouse features is ALTER TABLE commands that just change metadata, so you can extend schema very efficiently. [1]

2.) With reference to scalability, the question is not what it takes to get 200 nodes up and running but what you get from them. ClickHouse typically gets better query results on log management using far fewer resources than ElasticSearch. ContentSquare did a great talk on the performance gains including 10x speed-up in queries and 11x reduction in cost. [2]

3.) Kibana is excellent and well-liked by users. Elastic has done a great job on it. This is an area where the ClickHouse ecosystem needs to grow.

4.) This is just flat-out wrong. ClickHouse has a very powerful SQL implementation that is particular strong at helping to reduce I/O, compute aggregations efficiently and solve specific use cases like funnel analysis. It has the best implementation of arrays of any DBMS I know of. [3] Drivers are maturing rapidly but to be honest it's so easy to submit queries via HTTP that you don't need a driver for many use cases. My own team does that for PHP.

I don't want to take away anything from Elastic's work--ElasticSearch and the ecosystem products are great, as shown by their wide adoption. At the same time ClickHouse is advancing very quickly and has much better capabilities than many people know.

p.s., As far as ANSI capability, we're working on TPC-DS and have ClickHouse running at full steam on over 60% of the cases. That's up from 15% a year ago. We'll have more to say on that publicly later this year.

[1] https://eng.uber.com/logging/

[2] https://www.slideshare.net/VianneyFOUCAULT/meetup-a-successf...

[3] https://altinity.com/blog/harnessing-the-power-of-clickhouse...

p.s., I'm CEO of Altinity and work on ClickHouse, so usual disclaimers.


Thank you for what you guys do. Altinity blog and videos are an outstanding source of practical in-depth knowledge on the subject, so much needed for Clickhouse recognition.


You are most welcome. The webinars and blog articles are incredibly fun to work on.


> you can't simply dump semi-structured data (csv/json/documents) into it and worry about schema (index definition) + querying later

Unless you love rewrites, you can't simply dump semi-structured data into ElasticSearch either. Seen multiple apps with 5x or worse ES storage usage tied to 'data model' or lack thereof, and fixing it inevitably means revisiting every piece of code pushing stuff into and out of ES.

I love ES but this notion of schema free is dumb, in practice it's a nightmare.


Imagine trying to make the argument that forcing your developers/clients to send all their telemetry on fixed/rigid schemas to make it immediately queryable is quicker than updating 1 line of an etl script on the data warehouse side. That adding a new queryable field to your event now requires creating migration scripts for the databases and api versioning for the services so things don't break and old clients can continue using the old schema. Imagine making a central telemetry receiver that needs to support 200+ different external apps/clients, with most under active development - adding new events and extending existing ones - being released several times per day. What's the alternative you're proposing? Just put it in a json column and make extractors in the databases every time you want to analyze a field? I've seen this design pattern often enough in MSSQL servers with stored procedures... Talk to me about painful rewrites.

I'll take semi-structured events parsed and indexed by default during ingestion over flat logs + rigid schema events any day. When you force developers to log into a rigid schema you get json blob fields or "extra2" db fields, or perhaps the worst of all, no data at all since it's such a pain in the ass to instrument new events.

We're talking about sending, logging and accessing telemetry. The goal is to "see it" and make it accessible for simple querying and analysis - in realtime ideally, and without a ticket to data engineering.

ES type-inferrence and wide/open schema with blind json input is second to none as far as simplicity of getting data indexed goes. There are tradeoffs with the defaults such as putting lots of text that you don't need to fulltext search - you might want to tell ES that it doesn't need to parse every word into an index if you don't want to burn extra cpu and storage for nothing. This is one line of config at the cluster level and can be changed seamlessly while running and ingesting data.

I guarantee you there is more semi-structured data in the world than rigid schema, and for one simple reason: It's quicker to generate. The only argument against it has thus far been "yeah but then it's difficult to parse and make it queryable again" and suddenly you've come full circle and you have the reason elasticsearch exists and shines (extended further on both ends by logstash and kibana).

I'm not saying it makes sense to do away with schemas everywhere but for logging and telemetry - of any that you actually care to analyze anyway - there is rarely a reason to go rigid schema on the accepting or processing side since you'll be working with, in the vast majority of cases, semi-structured data.

Changing ES index mappings on the fly is trivial, you can do it with an much ease as alter table on clickhouse, and you have the luxury of doing it optimistically and after the fact, once your data/schema has stabilized.

Rewriting the app to accommodate this should never be required unless you really don't know how to use indexing and index mapping in ES. You would, however, have to make changes to your app and/or database and/or ETL every time you wanted to add a new queryable field to your rigid-schema masterpiece.

Ultimately, applications have always and will always generate more data than will be ultimately analyzed so saving development time on that generating end (by accepting any semi-structured data without first having to define a rigid schema) is more valuable than saving it on the end that is parsing a subset of that data. Having to involve a data team to deploy an alter table so you can query a field from your json doesn't sound like the hallmark of agile self-serve. I also believe strongly and fundamentally that encouraging product teams and their developers to send and analyze as much telemetry as both their hearts desire and DPOs agree to without worrying about the relatively trivial cost of parsing and storing it, will always come out on top vs creating operational complexity over the same. Maybe if you have a small team logging billions of heavy, never-changing events will seldom get queried it would tip the scales in favor of using rigid schema. I counter: you don't need telemetry you need archiving.

On that subject of pure compute and storage/transfer efficiency: Yes both rigid schema and processing-by-exception will win here every time as far as cycles and bits go. Rarely is the inefficiency of semi-structured so high that it merits handicapping an entire engineering org into dancing around rigid schemas to get their telemetry accepted and into a dashboard.

I hear you, platform ops teams... "But the developers will send big events! ! There will be lots of data that we're parsing and indexing for nothing!" Ok - so add a provision to selectively ignore those? Maybe tell the offender to stop doing it? On the rare occasion that this happens (I've seen 1 or 2 events out of 100s in my anecdotal experience) you may require some human intervention. Compare this labor requirement to the proposed system where human intervention is required every time somebody wants to look at their fancy new field.

In practice, I've not seen it be a nightmare unless you've got some very bad best practices on the ingestion or indexing side - both of which are easily remedied without changing much if anything outside of ES.

I think clickhouse is pretty cool, but it's not handing anywhere near the constraints that ES does even without logstash and kibana. ES is also getting faster and more efficient at ingestion/parsing with every release - releases the seem to be coming faster and faster these days.


> In this manner, I think that clickhouse is better compared as a self-hosted alternative to Aurora and other cloud-native scalable SQL databases, and less a replacement for elasticsearch.

Neither of which is normally used for logging.

I am glad there are some alternatives to ELK. Elasticsearch is great, but it's not as great when you have to ingest terabytes of logs daily. You can do it, but at a very large resource cost (both computing and human). Managing shards is a headache with the logging use-case.

Most logs don't have that much structure. A few fields, sure. For this, Elasticsearch is not only overkill, but also not very well suited. This is the reason why placing Kafka in front of Elasticsearch for ingestion is rather popular.


> Elastic is more than elasticsearch...

Grafana Labs sponsored FOSS projects are probably adequate replacement for the Elasticsearch? https://grafana.com/oss/

> ...clickhouse is better compared as a self-hosted alternative to Aurora and other cloud-native scalable SQL databases

Aurora would be likely be less better at this than RedShift or Snowflake.


> 3) Elastic is more than elasticsearch - kibana and the "on top of elasticsearch" featureset is pretty substantial.

Kibana is just messy. Their demos don't show any actionable intelligence but just dumps data in various ways and the interface doesn't look focused. It feels painful to deal with it daily.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: