Hacker News new | past | comments | ask | show | jobs | submit login
ClickHouse as an alternative to Elasticsearch for log storage and analysis (pixeljets.com)
383 points by jetter on March 2, 2021 | hide | past | favorite | 134 comments



Also wanted to share my overall positive experience with Clickhouse.

UPSIDES

* started a 3-node cluster using the official Docker images super quickly

* ingested billions of rows super fast

* great compression (of course, depends on your data's characteristics)

* features like https://clickhouse.tech/docs/en/engines/table-engines/merget... are amazing to see

* ODBC support. I initially said "Who uses that??", but we used it to connect PostgreSQL and so we can keep the non-timeseries data in PostgreSQL but still access PostgreSQL tables in Clickhouse (!)

* you can go the other way too: read Clickhouse from PostgreSQL (see https://github.com/Percona-Lab/clickhousedb_fdw, although we didn't try this)

* PRs welcome, and quickly reviewed. (We improved the ODBC UUID support)

* code quality is pretty high.

DOWNSIDES

* limited JOIN capabilities, which is expected from a timeseries-oriented database like Clickhouse. It's almost impossible to implement JOINs at this kind of scale. The philosophy is "If it won't be fast as scale, we don't support it"

* not-quite-standard SQL syntax, but they've been improving it

* limited DELETE support, which is also expected from this kind of database, but rarely used in the kinds of environments that CH usually runs in (how often do people delete data from ElasticSearch?)

It's really an impressive piece of engineering. Hats off to the Yandex crew.


I'd like to add an upside which is:

Totally great and simple on a single node.

I looked at a bunch of analytical databases and had a lot that started with "so here's a basic 10 node cluster". Clickhouse installed and worked instantly with decent but not "big" data with no hassle. A hundred million rows with lots of heavy text blobs and a lot of columns, that kind of thing. Happily dealt with triple nested joins over that, and with billions of entries in arrays on those columns didn't bat an eye.

I'm sure I could do some great magic in postgres but naive work didn't give anywhere near the same results as clickhouse (obvious caveat for my workload).

Pretty good with JSON data, my only issue there at the time (may have improved) was you had to format the JSON quite strictly.


Thanks, this is just what I'm looking for.


Most minor of nitpicks:

> timeseries-oriented database

Technically it’s a column oriented database that is good at time series stuff. I only say that because I know there are some databases that are even more specialised towards timeseries and ClickHouse can do way more.


> It's really an impressive piece of engineering. Hats off to the Yandex crew.

And thousands of contributors! Toward the end of 2020 over 680 unique users had submitted PRs and close to 2000 had opened issues. It's becoming a very large community.


Could you share more details about the limited JOIN capabilities? AFAIK, Clickhouse has multiple join algorithms and supports on-disk joins to avoid out of memory:

https://github.com/ClickHouse/ClickHouse/issues/10830

https://github.com/ClickHouse/ClickHouse/issues/9702#issueco...


Maybe I’m not doing anything particularly challenging with it, but I’ve not found anything lacking with the join functionality.


One issue I've come across is that the query optimizer in Clickhouse does not propagate `where` clauses through a join. My terminology might be wrong, so consider this example:

select * from a inner join b using (id) where b.foo = 'bar'

Clickhouse will not evaluate `foo = 'bar'` before performing the join, so you might wind up with a join that produces a large intermediate result before the filtering happens. Postgres (probably other databases) will optimize this for you. To force Clickhouse to filter first, you would need to write something like

select * from a inner join ( select * from b where foo = 'bar' ) b using (id)

Maybe not a strict limitation, but the workaround is a bit janky.


Interesting, I swear I’ve done that before and has it respect the clause.

You can also use PREWHERE I believe, which has the benefit of increasing performance.


Can clickhouse deal with medium-large blob data? Say the size of a normal email?

We are using Postgres to store email at my app: https://hanami.run The log is append only and getting scrub daily.

Can clickhouse deal with that? The query is very simple, just need to match exactly a single column(domain) and pagination?


Don’t know how you feel about vendor lock in, but I use Dynamo to store many millions of email logs and it works great. Got tired of having that data in MySQL.


I think it's an unfair comparison, notably because:

1) Clickhouse is rigid-schema + append-only - you can't simply dump semi-structured data (csv/json/documents) into it and worry about schema (index definition) + querying later. The only clickhouse integration I've seen up close had a lot of "json" blobs in it as a workaround, which cannot be queried with the same ease as in ES.

2) Clickhouse scalability is not as simple/documented as elasticsearch. You can set up a 200-node ES cluster with a relatively simple helm config or readily-available cloudformation recipe.

3) Elastic is more than elasticsearch - kibana and the "on top of elasticsearch" featureset is pretty substantial.

4) Every language/platform under the sun (except powerbi... god damnit) has native + mature client drivers for elasticsearch, and you can fall back to bog-standard http calls for querying if you need/want. ClickHouse supports some very elementary SQL primitives ("ANSI") and even those have some gotchas and are far from drop-in.

In this manner, I think that clickhouse is better compared as a self-hosted alternative to Aurora and other cloud-native scalable SQL databases, and less a replacement for elasticsearch. If you're using Elasticsearch for OLAP, you're probably better to ETL the semi-structured/raw data out of ES that you specifically wan to a more suitable database which is meant for that.


I address your concern from #1 in "2. Flexible schema - but strict when you need it" section - take a look at https://www.youtube.com/watch?v=pZkKsfr8n3M&feature=emb_titl...

Regarding #2: Clickhouse scalability is not simple, but I think Elasticsearch scalability is not that simple, too, they just have it out of the box, while in Clickhouse you have to use Zookeeper for it. I agree that for 200 nodes ES may be a better choice, especially for full text search. For 5 nodes of 10 TB logs data I would choose Clickhouse.

#3 is totally true. I mention it in "Cons" section - Kibana and ecosystem may be a deal breaker for a lot of people.

#4. Clickhouse in 2021 has a pretty good support in all major languages. And it can talk HTTP, too.


Hi! Assuming you are author of the PixelJets article would you consider submitting a talk to the Percona Live Online 2021 Conference? It's all about open source databases. We're doing an analytics track and welcome submissions on any and all solutions based on open source analytic databases. CFP runs through 14 March.

p.s., Everyone is welcome! If you see this and have a story please consider submitting. No marketing please. We are DBMS geeks.

https://altinity.com/blog/call-for-papers-on-analytics-at-pe...


Thank your for the invitation! Will definitely consider submitting my story.


You might be able to just put whatever you want into an Elasticsearch index, but I wouldn't recommend doing that. It could severely limit how you can query your data later, see: https://www.elastic.co/guide/en/elasticsearch/reference/curr...

Also it can cause performance problems if you have really heterogeneous data with lots of different fields https://www.elastic.co/guide/en/elasticsearch/reference/curr...


Yup, reading that comment all I thought was exactly what I said in another comment here, it'll work great until it doesn't, and by then you'll suffer a lot to work around it

Same with scaling, scaling ES is super easy until you realize your index sizes aren't playing nicely with sharding or something and have to start working around that.

Clickhole feels like it's targeting what most people end up using ES for. Comparing it to ES and talking about what's missing is kind of missing the point imo.


I manage a fairly small ES cluster of 20 i3en.2xlarge instances that ingest data from 300+ apps. Yes, the only problem I see is the field type collision and it happens occasionally.

Otherwise elastic doesn't require much operational time, may be an hour a week.

You pretty much want to keep your indices around 50gb and the ILM works well to manage that.


What about threadwriterejects and max-number-of-shards. If you don’t take into consideration how much data you ingest and in what format it should be afterwards and monitor that constantly:

“You gonna have a bad time”. You can automate a lot of stuff around elasticsearch, but when you provide/source it within company - other teams may not be as knowlegable and can shoot themselves into the foot very easly.

Ive seen it multiple times by now. People have no idea how to manage the size of their clusters.


Anyone accepting freeform objects into a single ES index knows the pain of field type collisions

But, most of the time, it "just works".

Hearing these argument about rigid schemas saving time tells me that nobody has had to support teams with 200+ apps.

^ This guy actually manages infra


I think the author addresses your point one in the article:

> SQL is a perfect language for analytics. I love SQL query language and SQL schema is a perfect example of boring tech that I recommend to use as a source of truth for all the data in 99% of projects: if the project code is not perfect, you can improve it relatively easily if your database state is strongly structured. If your database state is a huge JSON blob (NoSQL) and no-one can fully grasp the structure of this data, this refactoring usually gets much more problematic.

> I saw this happening, especially in older projects with MongoDB, where every new analytics report and every new refactoring involving data migration is a big pain.

They're arguing that using non-structured, or variable structured data is actually a developmental burden and the flexibility it provides actually makes log analysis harder.

It seems that the "json" blobs are a symptom of the problem, not the cause of it.


I disagree with the author on that.

Yes, SQL is nicer for structured queries, sure (“KQL” in Kibana is sort of a baby step into querying data stored in Elastic).

But in Kibana, I can just type in (for example) a filename, and it will return any result row where that filename is part of any column of data.

Also, if I need more structured results (for example, HTTP responses by an API grouped per hour per count), I can pretty easily do a visualization in Kibana.

So yes, for 5% of use cases regarding exposing logging data, an SQL database of structured log events is preferred or necessary. For the other 95%, the convenience of just dumping files into Elastic makes it totally worth it.


Agreed here. More and more data is semi structured and can benefit from ES (or mongo) making it easily exploitable. It's a big part of why logstash and elastic came to be.

One of the most beautiful use cases I've ever seen for elasticsearch was custom nginx access log format in json (with nearly every possible field you could want), logged directly over the network (syslogd in nginx over udp) to a fluentd server setup to parse that json + host and timestamp details before bulk inserting to elastic.

You could spin up any nginx container or vm with those two config lines and every request would flow over the network (no disk writes needed!) and get logged centrally with the hostname automatically tagged. It was doing 40k req/s on a single fluentd instance when I saw it last and you could query/filter every http request in the last day (3+bn records...) in realtime.

Reach out to datadog and ask how much they would charge for 100bn log requests per month.


That argument would apply to production backend databases but I don't see how it really applies to logs. It's like they just copy and pasted a generic argument regarding structure data without taking account of the context.

Logs tend to be rarely read but often written. They also age very quickly and old logs are very rarely read. So putting effort to unify the schemas on write seems very wasteful versus doing so on read. Most of the queries are also text search rather than structured requests so the chance of missing something on read due to bad unification is very low.


I'm the author of at least one of the ClickHouse video presentations referenced in the article as well as here on HN. ElasticSearch is a great product, but three of your points undersell ClickHouse capabilities considerably.

1.) ClickHouse JSON blobs are queryable and can be turned into columns as needed. The Uber engineering team posted a great write-up on their new log management platform, which uses these capabilities at large scale. One of the enabling ClickHouse features is ALTER TABLE commands that just change metadata, so you can extend schema very efficiently. [1]

2.) With reference to scalability, the question is not what it takes to get 200 nodes up and running but what you get from them. ClickHouse typically gets better query results on log management using far fewer resources than ElasticSearch. ContentSquare did a great talk on the performance gains including 10x speed-up in queries and 11x reduction in cost. [2]

3.) Kibana is excellent and well-liked by users. Elastic has done a great job on it. This is an area where the ClickHouse ecosystem needs to grow.

4.) This is just flat-out wrong. ClickHouse has a very powerful SQL implementation that is particular strong at helping to reduce I/O, compute aggregations efficiently and solve specific use cases like funnel analysis. It has the best implementation of arrays of any DBMS I know of. [3] Drivers are maturing rapidly but to be honest it's so easy to submit queries via HTTP that you don't need a driver for many use cases. My own team does that for PHP.

I don't want to take away anything from Elastic's work--ElasticSearch and the ecosystem products are great, as shown by their wide adoption. At the same time ClickHouse is advancing very quickly and has much better capabilities than many people know.

p.s., As far as ANSI capability, we're working on TPC-DS and have ClickHouse running at full steam on over 60% of the cases. That's up from 15% a year ago. We'll have more to say on that publicly later this year.

[1] https://eng.uber.com/logging/

[2] https://www.slideshare.net/VianneyFOUCAULT/meetup-a-successf...

[3] https://altinity.com/blog/harnessing-the-power-of-clickhouse...

p.s., I'm CEO of Altinity and work on ClickHouse, so usual disclaimers.


Thank you for what you guys do. Altinity blog and videos are an outstanding source of practical in-depth knowledge on the subject, so much needed for Clickhouse recognition.


You are most welcome. The webinars and blog articles are incredibly fun to work on.


> you can't simply dump semi-structured data (csv/json/documents) into it and worry about schema (index definition) + querying later

Unless you love rewrites, you can't simply dump semi-structured data into ElasticSearch either. Seen multiple apps with 5x or worse ES storage usage tied to 'data model' or lack thereof, and fixing it inevitably means revisiting every piece of code pushing stuff into and out of ES.

I love ES but this notion of schema free is dumb, in practice it's a nightmare.


Imagine trying to make the argument that forcing your developers/clients to send all their telemetry on fixed/rigid schemas to make it immediately queryable is quicker than updating 1 line of an etl script on the data warehouse side. That adding a new queryable field to your event now requires creating migration scripts for the databases and api versioning for the services so things don't break and old clients can continue using the old schema. Imagine making a central telemetry receiver that needs to support 200+ different external apps/clients, with most under active development - adding new events and extending existing ones - being released several times per day. What's the alternative you're proposing? Just put it in a json column and make extractors in the databases every time you want to analyze a field? I've seen this design pattern often enough in MSSQL servers with stored procedures... Talk to me about painful rewrites.

I'll take semi-structured events parsed and indexed by default during ingestion over flat logs + rigid schema events any day. When you force developers to log into a rigid schema you get json blob fields or "extra2" db fields, or perhaps the worst of all, no data at all since it's such a pain in the ass to instrument new events.

We're talking about sending, logging and accessing telemetry. The goal is to "see it" and make it accessible for simple querying and analysis - in realtime ideally, and without a ticket to data engineering.

ES type-inferrence and wide/open schema with blind json input is second to none as far as simplicity of getting data indexed goes. There are tradeoffs with the defaults such as putting lots of text that you don't need to fulltext search - you might want to tell ES that it doesn't need to parse every word into an index if you don't want to burn extra cpu and storage for nothing. This is one line of config at the cluster level and can be changed seamlessly while running and ingesting data.

I guarantee you there is more semi-structured data in the world than rigid schema, and for one simple reason: It's quicker to generate. The only argument against it has thus far been "yeah but then it's difficult to parse and make it queryable again" and suddenly you've come full circle and you have the reason elasticsearch exists and shines (extended further on both ends by logstash and kibana).

I'm not saying it makes sense to do away with schemas everywhere but for logging and telemetry - of any that you actually care to analyze anyway - there is rarely a reason to go rigid schema on the accepting or processing side since you'll be working with, in the vast majority of cases, semi-structured data.

Changing ES index mappings on the fly is trivial, you can do it with an much ease as alter table on clickhouse, and you have the luxury of doing it optimistically and after the fact, once your data/schema has stabilized.

Rewriting the app to accommodate this should never be required unless you really don't know how to use indexing and index mapping in ES. You would, however, have to make changes to your app and/or database and/or ETL every time you wanted to add a new queryable field to your rigid-schema masterpiece.

Ultimately, applications have always and will always generate more data than will be ultimately analyzed so saving development time on that generating end (by accepting any semi-structured data without first having to define a rigid schema) is more valuable than saving it on the end that is parsing a subset of that data. Having to involve a data team to deploy an alter table so you can query a field from your json doesn't sound like the hallmark of agile self-serve. I also believe strongly and fundamentally that encouraging product teams and their developers to send and analyze as much telemetry as both their hearts desire and DPOs agree to without worrying about the relatively trivial cost of parsing and storing it, will always come out on top vs creating operational complexity over the same. Maybe if you have a small team logging billions of heavy, never-changing events will seldom get queried it would tip the scales in favor of using rigid schema. I counter: you don't need telemetry you need archiving.

On that subject of pure compute and storage/transfer efficiency: Yes both rigid schema and processing-by-exception will win here every time as far as cycles and bits go. Rarely is the inefficiency of semi-structured so high that it merits handicapping an entire engineering org into dancing around rigid schemas to get their telemetry accepted and into a dashboard.

I hear you, platform ops teams... "But the developers will send big events! ! There will be lots of data that we're parsing and indexing for nothing!" Ok - so add a provision to selectively ignore those? Maybe tell the offender to stop doing it? On the rare occasion that this happens (I've seen 1 or 2 events out of 100s in my anecdotal experience) you may require some human intervention. Compare this labor requirement to the proposed system where human intervention is required every time somebody wants to look at their fancy new field.

In practice, I've not seen it be a nightmare unless you've got some very bad best practices on the ingestion or indexing side - both of which are easily remedied without changing much if anything outside of ES.

I think clickhouse is pretty cool, but it's not handing anywhere near the constraints that ES does even without logstash and kibana. ES is also getting faster and more efficient at ingestion/parsing with every release - releases the seem to be coming faster and faster these days.


> In this manner, I think that clickhouse is better compared as a self-hosted alternative to Aurora and other cloud-native scalable SQL databases, and less a replacement for elasticsearch.

Neither of which is normally used for logging.

I am glad there are some alternatives to ELK. Elasticsearch is great, but it's not as great when you have to ingest terabytes of logs daily. You can do it, but at a very large resource cost (both computing and human). Managing shards is a headache with the logging use-case.

Most logs don't have that much structure. A few fields, sure. For this, Elasticsearch is not only overkill, but also not very well suited. This is the reason why placing Kafka in front of Elasticsearch for ingestion is rather popular.


> Elastic is more than elasticsearch...

Grafana Labs sponsored FOSS projects are probably adequate replacement for the Elasticsearch? https://grafana.com/oss/

> ...clickhouse is better compared as a self-hosted alternative to Aurora and other cloud-native scalable SQL databases

Aurora would be likely be less better at this than RedShift or Snowflake.


> 3) Elastic is more than elasticsearch - kibana and the "on top of elasticsearch" featureset is pretty substantial.

Kibana is just messy. Their demos don't show any actionable intelligence but just dumps data in various ways and the interface doesn't look focused. It feels painful to deal with it daily.


> SQL is a perfect language for analytics.

Slightly off topic, but I strongly agree with this statement and wonder why the languages used for a lot of data science work (R, Python) don't have such a strong focus on SQL.

It might just be my brain, but SQL makes so much logical sense as a query language and, with small variances, is used to directly query so many databases.

In R, why learn the data.tables (OK, speed) or dplyr paradigms, when SQL can be easily applied directly to dataframes? There are libraries to support this like sqldf[1], tidyquery[2] and duckdf[3] (author). And I'm sure the situation is similar in Python.

This is not a post against great libraries like data.table and dplyr, which I do use from time to time. It's more of a question about why SQL is not more popular as the query language de jour for data science.

[1] https://cran.r-project.org/web/packages/sqldf/index.html

[2] https://github.com/ianmcook/tidyquery

[3] https://github.com/phillc73/duckdf


SQL doesn't compose all that well.

For example, imagine that you have a complex query that handles a report. If someone says "hey we need the same report but with another filter on X," your options are to copy paste the SQL query with the change, create a view that can optionally have the filter (assuming the field that you'd want to filter on actually is still visible at the view level), or parse the SQL query into its tree form, mutate the tree, then turn it back into SQL.

If you're using something like dplyr, then it's just an if statement when building your pipeline. Dbplyr also will generate SQL for you out of dplyr statements, it's pretty amazing IMHO.


> SQL doesn't compose all that well.

On that topic, I really enjoy working in Elixir because Ecto [1] lets you write "SQL" with Elixir's composable functional syntax. It sits somewhere between "the language is compiled to SQL" and ORM. The Ruby-esque syntax took some getting used to, but once I was past that hurdle my productivity skyrocketed. It's not 100% feature complete compatibility with all the different SQL dialects, but most of what you'll need is there.

[1] https://github.com/elixir-ecto/ecto


In a similar vein, I really like the LINQ style queries in the Julia package, Query.jl[1].

[1] https://www.queryverse.org/Query.jl/stable/linqquerycommands...


> your options are to copy paste the SQL query with the change

There are things like dbt, lookml that bring templating , macros on top of sql. Its much better than copy/pasting strings but is worse than a full fledged programming language.


I agree that dbplyr is a nice way to query databases, if already familiar with dplyr (actually I think dtplyr is more interesting for operating on data.tables). However, I'm not sure I really understand your point about the "if" statement.

If the data is already in a dataframe, why not still use the "if" statement, but one of the packages I mentioned earlier to further modify the data?

E.g.

if (x = 10) {

duckdf("SELECT * WHERE y >= 20")

}

I guess I was thinking about one query language to rule them all. If SQL was used everywhere, it's just about learning one paradigm and largely being able to use that across platform and language.


I read down the thread as it stands now, and the basic problem is this. Write the following function:

    Compose("SELECT a, b.revenue, b.name 
        FROM table 
        INNER JOIN  b ON whatever 
        WHERE name LIKE 'b%'",
    "revenue > 10000")
to yield a new valid SQL statement that composes these two fragments together in the way that it is obvious that I want. You basically can't. If you can in this small example, I can easily complexify the SQL until you can't anymore. Note how "revenue" in my supplemental WHERE clause may or may not be that "b.revenue", both in the sense that I may be asking for some other revenue entirely, and in the sense that that particular fragment may have come from somewhere that has no way to know about the "b" part of the name; one of the major issues is namespacing like this, though it is far from the only one!

It is almost certainly impossible to sanely implement this function literally in terms of strings. (If such a thing was possible, it would almost certainly have trash performance.) You need a richer set of data types and operations to permit this.

It has been done. You can hold on to a symbolic representation of the SQL in some internal library representation. However, to "bind" to SQL in this manner requires a binding to every single feature of the SQL you want to use at a pretty deep level (more than just strings, you need to understand the full syntax tree), and speaking from experience, no matter how cleverly you try to start writing it at first it gets more complicated than you think. It is certainly a thing that has been done, but it is a huuuuuuge project. Seriously. It would be a lot easier if we were using an underlying representation designed to do this sort of thing from the beginning.

I like what you can do with SQL, but I kinda hate the way it was juuust good enough to get ensconced and apparently permanently ensure that no improvement on it can ever get off the ground because no conceivable improvement can overcome the entrenched advantages SQL has.


SQL is a good language, but a terrible API.


I think the point from GP (which aligns with my thinking) is that more programmatic interfaces for querying data allow for control flow in constructing the query, not just in processing it. So you can conditionally add a predicate or whatever, and then run the whole query at once.

Whereas with the SQL packages you mentioned, you either have to conditionally construct your query string (ew) or pull results back and express your conditional as operations on the resulting data frame.

For pure dplyr-on-in-memory-data-frame there isn't much difference between the two, to be sure. For dplyr on dbplyr, or sparklyr, or for SparkR, or PySpark, or Dask, or pick-your-favorite-framework, the difference between operating on (effectively) query AST vs. result sets is huge.


Good points. That makes the larger than memory use case quite clear, and interesting.


It's for more complex queries, eg. active users per country:

  WITH active_users AS
    (SELECT DISTINCT user_id, user_country FROM ...
     WHERE last_login >= NOW() - 1 month)
  SELECT user_country, COUNT(user_id) AS user_count
    FROM active_users GROUP BY user_country
    ORDER BY user_count DESC
Now imagine someone says "what about users that have at least 5 friends?" If you're using dplyr and want to reuse most of your logic, it's just a matter of doing something like

  active_users_with_friends = active_users %>% filter(friend_count >= 5)
The SQL version is much hairier, since it's just code that's within a string.


I think we're talking at cross purposes, and I don't want to belabour the point too much, but there's no need to modify the original SQL query (I'm imagining this pulls data from a database). If the data is already in a dataframe, use one of the libraries which can apply SQL to in-memory dataframes.

e.g.

active_users_with_friends <- duckdf("SELECT * FROM active_users WHERE friend_count >= 5")


There's an aggregation part that comes after the CTE, at which point the "friend_count" or "last_login" fields are not available anymore. "active_users" isn't a table, it's the result of the CTE in the query that returns distinct users that have a last_login in the last 30 days.

Also, keep in mind that this is a pretty simple example, a more realistic one would probably have a half dozen to a dozen CTEs, some nested, at which point correctly mutating the SQL statement is not trivial.


> "active_users" isn't a table, it's the result of the CTE in the query

Maybe we really are at cross purposes. In the duckdf example above "active_users" is a dataframe. The duckdf library applies SQL queries on dataframes. It's not (necessarily) querying an on-disk database.

If you're querying data using dplyr then it's highly likely that data is already in a dataframe. By the same principle write the original SQL query to use SELECT with wildcard, to return everything. Then use a relevant library to apply SQL on dataframes and it's pretty much the same result as dplyr.

That's not to say complex SQL queries shouldn't be used to only return the minimally required information from the database. Clearly, sometimes there's too much data to fit in memory. My point is that if you're using dplyr on dataframes (or tibbles), then you could just as easily use an SQL based query on the same dataframe.


> Then your dplyr example above doesn't work, because active_users doesn't exist.

You can assume that it would contain the same data that would be in the CTE in the SQL query above. There's no such table in the database, it's a subquery.

If your argument is that chaining multiple SQL queries on data frames is needed, then we're both saying the same thing. SQL doesn't compose well, hence the need for multiple queries. At that point it's not just SQL though, it's querying of dataframes using SQL and composing the dataframes together using non-SQL code.

Luckily, dplyr will actually compose properly, and you can get the actual SQL statement that is equivalent to the dplyr expression, which would be rather annoying to obtain programmatically.


SQL tends to be non-composable which makes complicated scripts really messy to refactor and modify (or read even). CTEs make it more sensible but they're also a rather recent addition and don't fully solve the problem. Data Science tends to involve a lot of modifying of the same code rather than creating one off or static scripts.


I'm late to the party here, but as stated in other comments, dyplr / Linq etc. Compose better so you can extend a query easier.

Another advantage of composable syntaxes is that you can lazily evaluate parts of a query, and combine others, which allows the equivalent of a CTE on parts of a query you need to optimise.

It's also true that SQL allows pushdown of queries to the database much easier than composable syntaxes, as SQL is usually the native language there. As such it can make for a more exploratory syntax than say graphQL where you need to define joins at design-time.


I think the answer to your question is that, until relatively recently it was not possible to run arbitrary SQL against data frames in either language. sqldf and duckdf pass data off to another program (albeit in-process), run the query there, and pull the result set back to R, paying data movement and transportation costs along the way. Tidyquery is neat and avoids that issue, but is also only 18 months old or so


I agree that Tidyquery and duckdf are quite new, but sqldf has been around for a long time. The first package upload to CRAN was 2007.

Also, duckdf uses duckdb[1] to register a virtual in-memory table, so there's very little overhead for data transport.

[1] https://duckdb.org/docs/api/r


While sqldf has been around for a while, it's not been a performant choice, since it requires transforming and moving data into a database, albeit one that's in process. So with something like

    df <- data.frame(a=rnorm(5e8), b=runif(5e8))
    sqldf::sqldf("select count(*) from df where a > 3")
works, but takes 200 seconds and doubles the memory footprint of the process whereas

    nrow(df[df$a > 3, ])
    sum(df$a > 3)
take ~1.5 seconds and ~1s respectively on my machine.

I appear to have been too pessimistic about duckdf/duckdb though. It's docs[1] claim that registering an R data.frame as a table

> does not actually transfer data into DuckDB yet.

Which implied to me that it just deferred transfer until the user runs a query. However, in a basic test just now:

    con = dbConnect(duckdb::duckdb(), ":memory:")
    duckdb::duckdb_register(con, "iris_view", iris)
    duckdb::duckdb_register(con, "df_view", df)
    dbGetQuery(con, "select count(*) from df_view where a > 0")

it appears to execute the query directly against the data frame. At least, it runs in ~2.2s, and doesn't allocate more memory. Cool! As you've noted though, it's very new - looks like they released 0.1.0 last May?

I think the point stands: until very recently, SQL-on-dataframes was not a viable choice for anyone working at even moderate scales in R or Python, so preference has been for APIs exposed by libraries (pandas, data.table, dplyr, ...) that offered users reasonable performance, even if SQL would have been be a more ergonomic interface.

[1] https://duckdb.org/docs/api/r


Yeah, I agree sqldf is quite slow. Fair point.

As you've seen, duckdb registers an "R data frame as a virtual table." I'm not sure what they mean by "yet" either.

Of course it is possible to write an R dataframe to an on-disk duckdb table, if that's what you want to do.

There are some simple benchmarks on the bottom of the duckdf README[1]. Essentially I found for basic SQL SELECT queries, dplyr is quicker, but for much more complex queries, the duckdf/duckdb combination performs better.

If you really want speed of course, just use data.table.

[1] https://github.com/phillc73/duckdf


Didn't realize duckdf was your package. Cool! Definitely something I'll consider for future work, though I spend more time on the Spark-and-Python side of the fence these days.

If you wanted to add corresponding memory benchmarks the value-prop of duckdf might be clearer to those of us that have been scarred by sqldf :).


Sounds like it could be an interesting comparison. I'll look into it.


ClickHouse is incredible. It has also replaced a large, expensive and slow Elasticsearch cluster at Contentsquare. We are actually starting an internal team to improve it and upstream patches, email me if interested!


Your 2019 talk on this was great. I cited it above. Here's a link to slides to supplement others elsewhere in the thread: https://www.slideshare.net/VianneyFOUCAULT/meetup-a-successf...


Yep, do you guys have a writeup on this? Altinity actually mention Contentsquare case in their video, here: https://www.youtube.com/watch?t=2479&v=pZkKsfr8n3M&feature=y...


Hi

I'm of the guy who did the 2 presentations of Clickhouse at ContentSquare. There are no blog posts on the migration from ES to CH. But you can find the slides of the 2018 presentation here https://www.slideshare.net/VianneyFOUCAULT/clickhouse-meetup... And the slides of the 2019 presentation here https://www.slideshare.net/VianneyFOUCAULT/meetup-a-successf...

There is also a video recording of the 2019 presentation available here. https://www.youtube.com/watch?v=lwYSYMwpJOU nb: The video is not great because the camera is often losing focus but it's still understandable.


I'm not sure there is a public writeup. I know that the incredibly talented guy who created the first CH setup at CS planned to write a more global post about data analytics at scale, but after 2 years I still wait for it


I stopped answering to people about the release date of my next blog post because I'm always postponing it ;-).

But don't worry Paul, the day I'll release it you'll be one of the first to be informed.


I'll remind him about the post ;)


Can you share some more details?

How many nodes on both? How much data ingested and stored? What’s the query load?


https://github.com/ClickHouse/clickhouse-presentations/blob/... and the presentation itself https://www.youtube.com/watch?v=lwYSYMwpJOU 300Elastic nodes vs 12ClickHouse nodes / 260TB/ lots of querries


Sentry.io is using ClickHouse for search, with an API they built on top of it to make it easier to transition if need be. They blogged about it at the time they adopted it:

https://blog.sentry.io/2019/05/16/introducing-snuba-sentrys-...


I like sentry but it's the only app that I know uses 25 or so containers to stich itself together to get running which seemed insane, not to mention the usage of much ram.


I am using Clickhouse at my workplace as a side project. I wrote a Rust app that dumps the daily traffic data collected from my company's products into a ClickHouse database.

That's 1-5 billion rows, per day, with 60 days of data, onto a single i5 3500 desktop I have laying around. It returns a complex query in less than 5 minutes.

I was gonna get a beef-ier server, but 5 minutes is fine for my task. I was flabbergasted.


5 billion rows per day? What does your product do?


Security products


Uber recently blogged that they rebuilt the log analytics platform based on ClickHouse, replacing the previous ELK based one. The table schema choices made it easy to handle JSON formatted logs with changing schemas. https://eng.uber.com/logging/


Nice! Adding this to the post, thanks for the link!


I've been recording a podcast with Commercial Open Source company founders (Plug! https://www.flagsmith.com/podcast) and have been surprised how often Clickhouse has come up. It is always referred to with glowing praise/couldn't have built our business without it etc etc etc.


Thanks for sharing your podcast! Just subscribed.

To add yet another data point, we use Clickhouse as well for centralized logging for the SaaS version of our open source product, and can't imagine what we would have done without it.


My biggest problem with Elasticsearch is how easy it is to get data in there and think everything is just fine... until it falls flat on its face the moment you hit some random use case that, according to Murphy's law, will also be a very important one.

I wish Elasticsearch were maybe a little more opinionated in its defaults. In some ways Clickhouse feels like they filled the gap not having opinionated defaults created. My usage is from a few years back so maybe things have improved


Would you care to elaborate on what happened in your case. My company is using ElasticSearch extensively, and it is mission critical for us. I fear something might happen one day


It's been a few years so the details are fuzzy, but iirc it was just simple things like index sizing, managing shard count, some JVM tuning, certain aggregated fields blowing up once we had more data in our instances...

We also got our data very out of order. We had embedded devices logging analytics that would phone home very infrequently, think months between check-ins

I forget why but that became a big issue at some point, bringing the instance to its knees when a few devices started to phone-in covering large periods of time.

ES just has a ton of knobs, I imagine if its been important to you, you have people specializing in keeping it running, which is great... but the amount of complexity there that is specific to ES is really high.

It's not like there's no such thing as a Postgres expert for example, but you don't need to hire a Postgres wizard until you're pretty far in the weeds. But I feel like you should have an ES wizard to use ES, which is a little unfortunate


I work for a company that uses it as a mission critical product. It provides the search in our SAAS. I'd say it mysteriously fails 3-4 times a year. But no one wants to invest or encourage anyone to look at it proactively or even re-actively. We've had JVM issues and indexers spiral out of control using 99%CPU only to stop at random hours later. It's definitely a product to learn before it fails.


How does clickhouse compare to druid, pinot, rockset (commercial), memsql (commercial). I know clickhouse is easier to deploy.

But from user's perspective is clickhouse superior to the others?


Yeah Druid got blown away by ClickHouse at eBay Druid 700+ servers versus 2 region fully replicated ClickHouse system of 40 nodes. https://tech.ebayinc.com/engineering/ou-online-analytical-pr... and a webinar they did with us at Altinity https://www.youtube.com/watch?v=KI0AqpmcSOk&t=20s


FYI, we're using clickhouse since 2018 at ContentSquare.

I did a few POCs to compare clickhouse vs other databases on ContentSquare's use case. One of them was memSQL. Although memSQL was very good, since we don't need to JOIN big datasets or need killer features like fulltext search, clickhouse gave a better perf/cost ratio for us (I don't remember exactly but it was at least twice cheaper).


When Cloudflare was considering clickhouse, we did estimates on just the hardware cost and it was well over 10x what clickhouse was based on druids given numbers on events processed per compute unit.


Are you saying druid hardware costs were coming out to be 10x of clickhouse hardware costs?

Caveat: English is not my first language so might have missed your point in translation. :)


That's right, yeah. We would have had to buy 10x the servers in order to support the same workload that clickhouse could.


I've mentioned Pinot and Druid briefly in 2018 writeup: https://pixeljets.com/blog/clickhouse-as-a-replacement-for-e... (see "Compete with Pinot and Druid" )


Almost nobody wants to use elasticsearch.

People want to use kibana and put up with elasticsearch.


> People want to use kibana and put up with elasticsearch.

I don't buy this.

It's just a mess of data dumps and it's not exactly providing focused experience. You need to take a full month to show what you want comfortably.


+1 we actually looked a ch as our debug logs backend and while it is great for the most part (it’s also incredibly memory hungry) kibana is really an es killer feature


Or some people are considering it for eg. Facets


Does ClickHouse or anything else out there that even remotely compete with Splunk for adhoc troubleshooting/forensics/threat hunting type work?

I started off with Splunk and every time I try Elasticsearch I feel like I'm stuck in a cage. Probably why they can charge so much for it.


why is splunk better than ES?


I really want to answer you but I'm struggling a bit b/c I haven't worked with ES in a minute. Splunk just tends to be able to eat just about any kind of structured or unstructed content, operates on a pipeline concept similar to that of a unix shell or (gasp) powershell, and has a rich set of data manipulation of modification commands built in:

https://docs.splunk.com/Documentation/SplunkLight/7.3.6/Refe...

I primarily use it for security-related analysis, which is lowish on metrics and high on adhoc folding and mutilation of a very diverse set of data structures and types.


Sorry to hijack the thread but can anyone suggest alternatives to the 'search' side of Elasticsearch?

I haven't been following the topic and there's probably new and interesting developments like ClickHouse is for logging.


FWIW...

1. Toshi https://github.com/toshi-search/Toshi (Rust, 3.1k stars)

2. Tantivy https://github.com/tantivy-search/tantivy (Rust, 4.5k stars)

3. PISA https://github.com/pisa-engine/pisa (C++, 486 stars)

4. Bleve https://github.com/blevesearch/bleve (Go, 7.4k stars)

5. Sonic https://github.com/valeriansaliou/sonic (Rust, 10.9k stars)

6. Partial comparison https://mosuka.github.io/search-benchmark-game/ (tantivy Vs lucene Vs pisa Vs bleve)

7. Bayard https://github.com/bayard-search/bayard (Rust, on top of Tantivy, 1.4k stars)

8. Blast https://github.com/mosuka/blast (Go, on top of Bleve, 930 stars)

Algolia alternatives with some compatibility

1. MeiliSearch https://github.com/meilisearch/MeiliSearch (Rust, 12.4k stars)

2. typesense https://github.com/typesense/typesense (C++, 5.1k stars)


Wow! Thank you!


I'm personally very fond of sonic [0] for full text search.

> Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases. It is capable of normalizing natural language search queries, auto-completing a search query and providing the most relevant results for a query....

> When reviewing Elasticsearch (ELS) and others, we found those were full-featured heavyweight systems that did not scale well with Crisp's freemium-based cost structure.

> At the end, we decided to build our own search backend, designed to be simple and lightweight on resources

[0] - https://github.com/valeriansaliou/sonic


>It is used to index half a billion objects on a $5/mth 1-vCPU SSD cloud server (as of 2019).

Damn. Thank you! This looks very nice!


https://github.com/meilisearch/MeiliSearch gets a lot of traction recently. There are also Sphinx and its fork https://manticoresearch.com/ - very lightweight and fast.


I immediately thought of Sphinx when I saw MeiliSearch... it's uncanny how the use case and implementation semantics haven't changed much in 15 years.

The beauty of pointing it to your mysql tables and getting fulltext-via-api on the other side was quite nice.


Not sure if it's "new" but it's always interesting: Postgres offers a fine set of full-text search functionality, with the advantage of being able to also use the other ways in which Postgres shines: document-like data (storing JSON in columns), horizontal replication, PostGIS, and so on.


https://typesense.org/ comes to mind. Has support for Algolia's instantsearch.js as well.


You can use Redis for full-text search and some more SQL-like queries, including aggregations, with RediSearch: https://oss.redislabs.com/redisearch/


Someone recommended Meilisearch the other day. I've been playing around with it and it's pretty great so far. Still early in the project right now.


I just use elasticsearch, nothing wrong with it so far.


A related database using ideas from Clickhouse:

https://github.com/VictoriaMetrics/VictoriaMetrics


Are you familiar with VictoriaMetrics?

Can you elaborate on how it is similar and dissimilar to Clickhouse?

What specific techniques are the same?


The core storage engine borrows heavily from it - I'll attempt to summarize and apologies for any errors, it's been a while since I worked with VictoriaMetrics or ClickHouse.

Basically data is stored in sorted "runs". Appending is cheap because you just create a new run. You have a background "merge" operation that coalesces runs into larger runs periodically, amortizing write costs. Reads are very efficient as long as you're doing range queries (very likely on a time-series database) as you need only linearly scan the portion of each run that contains your time range.


I just there was a foss loki-like solution built on ch - that was stable and used in production.

I know there's a few projects (see below) - but I'm not aware of anything mature..

https://github.com/QXIP/cloki-go

https://github.com/lmangani/cloki


Regarding #1 in the article, Elastic does have SQL query support[1]. I can’t speak to performance or other comparative metrics, but it’s worked well for my purposes.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/curr...


I don't have any production experience running Clickhouse, but I have used it on a side project for an OLAP workload. Compared to Postgres Clickhouse was a couple orders of magnitude faster (for the query pattern), and it was pretty easy to setup a single node configuration compared to lots of the "big data" stuff. Clickhouse is really a game changer.


what happens when your data doesn't fit in a single node anymore?


There's replication in ClickHouse and you can just shove reads off to one of them if you'd like. From a backup/safety standpoint that's important, but I think there are other options besides just replicas, of course.

From an operations standpoint, however, ClickHouse is ridiculously efficient at what it does. You can store tens of billions, probably trillions of records on a single node machine. You can query at tens of billions of rows a second, etc, all with SQL. (The only competitor I know of in the same class is MemSQL.) So another thing to keep in mind is you'll be able to go much further with a single node using ClickHouse than the alternatives. For OLAP style workloads, it's well worth investigating.


I don't know why people still call replication a backup.


It scales to crazy numbers as you add nodes in 2018 CloudFlare was ingesting 11M rows per second into their CH cluster


I've been using it successfully in production for year and a half. I can think of no other database that would give me real time aggregation over hundreds of millions of rows inserter every day for virtually zero cost. It's just a marvelous work.


> ElasticSearch repo has jaw-dropping 1076 PRs merged for the same month

Code change frequency is not a measure of quality or development speed.

One organization can encourage bigger PRs while another encourage tiny, frequent changes.

One can care about quality and stability while another can care very little about bugs.


Anyone know more lightweight alternative to (ELK) Elastic Stack? I found https://vector.dev but it seems to be only the "L" part.


Promtail/Loki https://github.com/grafana/loki is an alternative to elk, but while it seems more lightweight, it definitely is less featureful. The integration with grafana/prometheus seems nice, but I've only toyed with it, not used in production.


I found Lucene's base library really easy to use without the configuration/infrastructure overhead of Elasticsearch, but haven't experienced it at scale:

https://lucene.apache.org/


Solr is the equivalent-ish of ES, if you are looking for a search server instead of a library that can be embedded: https://lucene.apache.org/solr/



I am looking into this. Do you have experience with it?


It is good. I can't find any CDC for Postgres for the incremental sync. And so I had to use the bulk update/sync and that causes performance issues occasionally. Also, some Algolia features are not available yet https://github.com/meilisearch/instant-meilisearch/issues/21...


ClickHouse will happily replace the ElasticSearch bit, and there’s a few open source dashboards you could use as a kibana stand in: - Metabase (with ClickHouse plug-in) - Superset - Grafana


Have you looked into Google Cloud Logging (Stackdriver)? It's the most affordable and decent-enough solution we've found. The only issue is querying can be slow on large volumes.


Have you tried creating a sink from stackdriver to bigquery?


Hi, not really because my understanding is bigquery requires structured data but Stackdriver just expects a arbitrary json payload which can be queried from. Do you have any experience?


I wish they had a data store shoot-out like Techempower has for Web stacks.


If you are looking an OSS ES replacement, CrateDB might also be worth a look :)

Basically a best of both worlds combination of ES and PostgreSQL, perfect for time-series and log analytics.


Does ClickHouse have integration with Superset and Grafana?


Yes to both, Altinity maintains the ClickHouse Grafana plugin https://altinity.com/blog/2019/12/28/creating-beautiful-graf...

And Superset has a recommendation of a ClickHouse connector https://superset.apache.org/docs/databases/clickhouse


I am curious how do they deal with GDPR or PPI when they do the logging? At first sight it looks like they are doing the logs themselves and not the API provider.


I'm happy that more people are "discovering" ClickHouse.

ClickHouse is an outstanding product, with great capabilities that serve a wide array of big data use cases.

It's simple to deploy, simple to operate, simple to ingest large amounts of data, simple to scale, and simple to query.

We've been using ClickHouse to handle 100's of TB of data for workloads that require ranking on multi-dimensional timeseries aggregations, and we can resolve most complex queries in less than 500ms under load.


[flagged]


Take my upvote, because the same happened to me.


Could not have been the only one :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: