
ClickHouse cost-efficiency in action: analyzing 500B rows on an Intel NUC - hodgesrm
https://www.altinity.com/blog/2020/1/1/clickhouse-cost-efficiency-in-action-analyzing-500-billion-rows-on-an-intel-nuc
======
ofek
Hey, Ofek from Datadog here!

I recently implemented our ClickHouse integration [1], so if any of you would
like to try it out we would appreciate feedback. I really enjoyed learning
about this database, and it has excellent docs :)

Oh fun fact, speaking of docs, this was the first integration of ours that we
scrape docs for as part of the test suite. So when a new built-in metric is
added it will fail our CI until we support it [2]. We just did this again for
Apache Airflow [3].

[1]: [https://github.com/DataDog/integrations-
core/pull/4957](https://github.com/DataDog/integrations-core/pull/4957)

[2]: [https://github.com/DataDog/integrations-
core/pull/5233](https://github.com/DataDog/integrations-core/pull/5233)

[3]: [https://github.com/DataDog/integrations-
core/pull/5311](https://github.com/DataDog/integrations-core/pull/5311)

~~~
tgtweak
So at datadog, after aggregation with spark and storage into parquet, what is
used for serving queries of all the datadog aggregated telemetry data (logs,
apm and infra telemetry) to the consumers?

(interestingly, we have a nearly identical data ingestion/ETL stack running on
spot instances and saving to parquet/s3)

~~~
ofek
Hi there! We actually talked quite a bit about that in a recent podcast
[https://www.dataengineeringpodcast.com/datadog-timeseries-
da...](https://www.dataengineeringpodcast.com/datadog-timeseries-data-
episode-113/)

------
jgrahamc
We use ClickHouse extensively and it's been great:
[https://blog.cloudflare.com/http-analytics-
for-6m-requests-p...](https://blog.cloudflare.com/http-analytics-
for-6m-requests-per-second-using-clickhouse/)

~~~
chupasaurus
You've lost a Russian IT meme in translation.

We use the verb " _to use brakes_ " to describe that something works slowly.
There was a long story of threads about Java performance, until the meme was
solidified after the news from 2005 DARPA Grand Challenge (racing competition
for autonomous robotic self-driven cars): a car named Tommy by Jefferson Team
which was running Java under Linux haven't used breaks before a turn and
crashed to the wall at 70 mph, hence _Java runs fast_ and _Java doesn 't use
brakes_ were both described perfectly by the same sentence.

Yandex used the meme to advertise Clickhouse for engineers.

edit: formatting

edit2: brakes instead of breaks, wtf

~~~
nkrisc
I think you mean "brakes," as I thought you meant "breaks" like in an
iterator. Not that familiar with Java though, just assuming based on the
reference to a vehicle failing to brake before a turn.

I thought maybe it meant using break statements in your loops slows things
down or something.

~~~
chupasaurus
Yeah, my bad.

------
patelh
Not exactly a good comparison if you don't generate the data the same way for
the test setup. Your generated data is more compressible by clickhouse, that
skews the comparison. Would have been better to not change the test data if
you wanted to do a comparison.

~~~
hodgesrm
The important difference is that we used a more realistic temperature profile,
which as you say does affect compression _for that column_. Schema design
(including sort order, compression, and codecs) for the remaining columns is
just good ClickHouse practice. Much of the storage and I/O savings is in the
date, time, and sensor_id and columns.

It's also useful to note that the materialized view results would be
essentially the same no matter how you generate and store data because the
materialized view down-samples temperature max/min to daily aggregates. The
data are vastly smaller no matter how you generate them.

The article illustrates that if you really had such an IoT app and designed it
properly you could run analytics with surprisingly few resources. I think
that's a significant point.

~~~
delusional
That's what you wanted to show, but what you ended up showing is that if you
have different data, then the query performance can be quite good.

I get the desire to critique the temperature profile, but completely changing
it makes the comparison worthless. From a data perspective it's like saying
"if all the sensors just report 1 for temperature every reading, computing the
min, max, and average is super fast". No shit, that wasn't the task though.

~~~
jayleeg
But they didn't set the temperature reading to anything that would advantage
their tests. Without access to the original data they simply generated a
dataset as close to the original dataset and volume as possible. The fact they
took a few sentences talking about the temperature doesn't equate to
invalidating the test.

Looking at this your way - Scylla used an INT, Altinity used a Decimal type
with specialized compression (T64). I can tell you that this would have
hampered ClickHouse and advantaged Scylla. It's the opposite of what you're
saying. They actually performed this benchmark with one arm tied behind their
back.

It's a funny benchmark anyway because the two systems have very different use
cases but it doesn't invalidate the result.

------
avisk
At Sematext we replaced our HBase based metrics datastore with Clickhouse. We
are happy with the performance gain and flexibility. We also added support for
Clickhouse monitoring - [https://sematext.com/blog/clickhouse-monitoring-
sematext/](https://sematext.com/blog/clickhouse-monitoring-sematext/)

------
subhajeet2107
We use Clickhouse extensive at work and boy is it better than anything i have
used in column oriented databases so far, documentation is good, http query
interface and features such as builtin url parsing are amazing, we also tested
Druid and found Clickhouse to be better than Druid, it is easier to setup and
maintain as well

------
rossjudson
Curious about what Google does in this area?

[https://blog.acolyer.org/2019/09/11/procella/](https://blog.acolyer.org/2019/09/11/procella/)

------
valyala
VictoriaMetrics core developer here.

The performance numbers from ClickHouse running on Intel NUC are impressive!
We are going to publish VictoriaMetrics performance numbers for the original
Billy benchmark from ScyllaDB [1] running on the same hardware from packet.com
. Initial results are quite promising [2], [3].

[1] [https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-
one...](https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one-billion-
rows-a-second/)

[2]
[https://mobile.twitter.com/MetricsVictoria/status/1209116702...](https://mobile.twitter.com/MetricsVictoria/status/1209116702606278657)

[3]
[https://mobile.twitter.com/MetricsVictoria/status/1209186575...](https://mobile.twitter.com/MetricsVictoria/status/1209186575558565889)

------
johnchristopher
Note: not your standard $250 NUC, it's a canyon something with much more
oomphs.

~~~
miffy900
It does have more oomph than most NUC's, but it's not a Canyon-something NUC;
as stated in the article, the CPU model is Intel i7-6770HQ, which was released
in Q1 2016 on Intel's 14nm node, so it's Skylake.

[https://ark.intel.com/content/www/us/en/ark/products/93341/i...](https://ark.intel.com/content/www/us/en/ark/products/93341/intel-
core-i7-6770hq-processor-6m-cache-up-to-3-50-ghz.html)

~~~
johnchristopher
No, it is a Skull Canyon NUC from 2016 (with an i7-6770HQ), see
[https://www.techradar.com/reviews/pc-mac/pc-mac-
desktops/int...](https://www.techradar.com/reviews/pc-mac/pc-mac-
desktops/intel-skull-canyon-nuc-1322597/review/2) for a description (I think
this [https://www.intel.com/content/www/us/en/products/boards-
kits...](https://www.intel.com/content/www/us/en/products/boards-
kits/nuc/kits/nuc6i7kyk.html) is the Intel product page).

Intel is using the Canyon word for their range of powerful NUCs (Skull Canyon,
Hades Canyon, etc. and the upcoming Ghost Canyon).

------
pachicodev
We have been using ClickHouse in production for some months already and we
find it a real game changer. We're running queries over 3B rows for business
intelligence purposes.

As a side project, a group of friends and I are working on a simple web
analytics project powered by ClickHouse (what it's been originally build for).
If anyone wants to contribute, just let me know.

Cheers

------
atombender
Is ClickHouse good for event data when you want to do rollups? For example,
say all my events are of the form:

{event: "viewedArticle", article_id: 63534, user_id: 42, topic: "news", time:
"2020-01-06"}

I want to be able to build aggregations which shows number of "viewedArticle"
events grouped by hour, grouped by topic, counting unique user_ids within each
bucket.

Or let's say I want the top K articles viewed each day, filtered by a topic.

That's something that's trivial with Elasticsearch, which has a hierarchical
aggregation DSL. Is ClickHouse good at this?

Whenever I see time-series databases such as InfluxDB mentioned, they look
like they're focused on measurements, not discrete rows. You can attach the
event data as "labels", but this isn't efficient when the cardinality of each
column is very high (e.g. article IDs or user IDs in the above example).

~~~
manigandham
Yes. Clickhouse is a column-oriented relational database among many others
like MemSQL, Vertica, Redshift, BigQuery, Snowflake, Greenplum, etc. They're
all focused on analytical queries over very large datasets using SQL.

An aggregation with several `group by` statements is no challenge and all of
these databases also support approximate counting via HyperLogLog for faster
results.

Clickhouse has some unique features where each table can have a separate
'engine' including some that automatically apply aggregations. Start with a
normal table though since it'll be plenty fast enough for most use cases.

~~~
atombender
Thanks! Looks like the only downside is that, as it returns rows as results,
you end up getting a lot of duplicate column data back and need to "nest" the
nested buckets yourself.

For example, a result like:

    
    
      topic;time;count
      news;2020-01-01;44
      news;2020-01-02;31
    

Now you have "news" repeated, and to group this into buckets for rendering
summary tables and such (with sub totals at each level), you need to iterate
through the flattened results and generate nested structures. This is
something Elasticsearch gives you out of the box.

Last I looked at Clickhouse, it had master/slave replication only, and if you
want shards of data distributed across a cluster it's something you need to
manually manage?

~~~
hodgesrm
> Now you have "news" repeated, and to group this into buckets for rendering
> summary tables and such (with sub totals at each level), you need to iterate
> through the flattened results and generate nested structures. This is
> something Elasticsearch gives you out of the box.

ClickHouse has a number of optimization for solving 'visitor' problems that
you describe. Assuming you just want to group in different ways an idiomatic
ClickHouse solution is to construct a materialized view that aggregates counts
(e.g., of unique users like uniq(user)). You can then select from the
materialized view and further aggregate to have larger buckets. ClickHouse can
also compute single-level totals using the WITH TOTALS modifier.

If you need to have cascading sub-totals within the same listing as far as I
know you'll have to compute the totals yourself. (That feature actually might
be an interesting pull request since ClickHouse generates JSON output.)

> Last I looked at Clickhouse, it had master/slave replication only, and if
> you want shards of data distributed across a cluster it's something you need
> to manually manage?

ClickHouse replication is multi-master. The model is eventually consistent.
Also, ClickHouse can automatically shard INSERTs across a cluster using
distributed tables. That said, many users insert directly to local nodes
because it's faster and uses fewer resources.

~~~
atombender
Sounds great, thank you!

------
thekozmo
ClickHouse achieved good (great) results. However, it's a bad comparison.
Clickhouse is an analytics DB while Scylla is a realtime, random access one.
/me am Scylla co-founder.

We could stack 100k rows in a single partition and be 1000x faster in this use
case than the performance we demonstrate but we wanted to keep it real.
Actually the use case we wanted to show is a single row per partition which
would require more machines but surprisingly we couldn't provision that many
on AWS.

The presented usecase by ClickHouse is 100x slower on writes (8M row/s) as
they report. It doesn't matter since it's just a completely different use
case. Use Clickhouse for analytics (I wonder why stop in SSE and not to go all
the way to the GPU like SqreamDB) and use Scylla for OLTP

~~~
jayleeg
The ScyllaDB one is a bit funny anyway as it doesn't really target analytical
workloads. On SSE/GPUs - the ClickHouse guys don't use GPUs today (GPUs are on
the roadmap for next year) as their workloads target volumes greater than GPU
memory. If your hot dataset sits totally in GPU memory then it makes sense for
some things otherwise they found the cost/performance ratio doesn't add up
after you paginate in/out. I don't doubt GPU based DB perf numbers but cost is
the main factor.

Now just to clarify - you're saying Scylla writes are 100x faster on the same
hardware as ClickHouse (so 800M row/s on a NUC). Using the same code that
Altinity used I manage around 25M rows/s on my home PC (8 cores/16HT) and
elsewhere in this thread the guys from VictoriaMetrics pulled in 53M rows/s on
a single node with 28 cores/56 threads (probably doable with ClickHouse on
similar hardware I'd suspect).

I'm going to test this with Scylla on my home PC to validate your 800M row/s
claim and I'll post about it - I should be able to hit around 2.5 billion
rows/s with Scylla if what you've said is true. I've had CH write 300M row/s
on my 8 core box using memory buffered tables but that was only at burst.

------
codexon
One thing I haven't seen anyone note about clickhouse though which would be
really important to many for data durability, is that it does not use fsync
anywhere at all.

~~~
FridgeSeal
I’m not understanding the implication of this, could you explain?

~~~
DasIch
When you write to a file, you generally don't write to physical storage.
Instead the writes get buffered in memory and written to physical storage in
batches. This substantially improves performance but creates a risk: If there
is some sort of outage before the data is flushed to disk, you might lose
data.

In order to address that risk, you can explicitly force data to be written to
disk by calling fsync. Databases generally do this to ensure durability and
only signal success after fsync succeeded and the data is safely stored.

So ClickHouse not calling fsync implies that it might lose data in case of a
power outage or a similar event.

~~~
somurzakov
is this really that important, thought, since all servers feed power from
uninterruptible power supply and most data centers have multiple power
sources.

~~~
DasIch
It’s a significant deviation from what I would expect from a disk oriented
database. So I would definitely expect it to be well documented, along with
the reason for it, why the developers believe it is a reasonable (or even
safe) choice, what assumptions went into that (such as availability of
uninterruptible power supply) etc.

Additionally keep in mind that with EBS most people probably use network
attached storage and fsync involves the network. Outage doesn’t just mean
power outage, it could also be a network issue.

------
xs83
I love stuff like this but I am massively put off by any database system that
I need to "Load" into. There are so many amazing file formats now that can be
queried directly as files - a 17H load time simply isnt feasible for much of
this work.

Does anyone know what format it stores it internally? Is there a way to simply
have it query in-place?

~~~
jayleeg
I've worked with MPP DBs, Hadoop, Spark, ElasticSearch, Druid, kdb, DolphinDB
and now ClickHouse and performance wise it's all true - in our case ClickHouse
was 10-20x faster than Spark and used 4x less memory. I've seen it outperform
the fastest commercial timeseries stores by 2x.

This will make me unpopular but my conclusion is that the file based data
lake, splitting data from compute, is not the right approach in many (not all)
cases and that Spark was not really that revolutionary. I would go as far to
say that the direction data has taken has been a failure and ClickHouse and
such come closer to solving the real problem of 'BigData'.

So two things here about 'loading'...

1) ClickHouse table/data files are completely portable (like Parquet) and can
be moved from one server to another, copied or cloned etc.. there is even a
mechanism to allow remote execution or to pull just the files from a remote
server or an S3 store etc.. Just because the CH native file format isn't
spoken about in the same circles as Parquet and ORC doesn't mean it can't be
treated the same way if thats your thing. The CH native format is far more
performant/compressible than Parquet or ORC and the specification is Open
Source. Someone could implement a CH native file format serdes for Hive for
example.

2) In this instance they were generating the data so no different to running
Spark and writing to a Parquet file and running analytics on it later. Spark
can't write / generate this amount of data in this amount of time on these
resources and write out / compress the data to Parquet or whatever other
preferred format. I've tried.

ClickHouse isn't perfect and I'm not affiliated with the Altinity guys but I
can tell you this is the real deal.

~~~
xs83
I would like to see comparisons between CH files, I would specifically
challenge the compressability of them vs ORC which pretty much maxes out
current compression techniques.

As soon as I see CH format being widespread enough to interact with the
multitude of other tools that are available then I would consider getting on
board - for now a "loadable" data warehouse does little for the kind of
workflows we deal with as the loading would take longer than the processing.

With regards to item two - we use a standard consumer GPU (1060 GTX) to handle
the conversion from CSV to ORC / Parquet and it is much much faster and
cheaper than a 20+ node spark cluster - hence the preference to work on files.

As everything else runs off these files it is kind of integral to our workload

------
tkyjonathan
This is going to be your next cloud agnostic data warehouse.

~~~
FridgeSeal
For what it’s worth, I’ve used Clickhouse and Snowflake and I strongly prefer
Clickhouse.

Performance was superior, client libraries and built-in HTTP interface was a
god-send, it supported geospatial queries. I had perpetual issues with getting
Snowflake to properly escape strings in CSV, handle JSON in anything
approaching a sensible way, there’s claims that it integrates properly with
Kafka as a consumer, but it most certainly does not. The UX is horrible to
boot.

~~~
manigandham
What were the issues with JSON? Snowflake is the best cloud data warehouse
when it comes to support for unstructured data and far better than Redshift,
Bigquery or the others. Snowflake also has geo support.

It doesn't support streaming data though so things like Kafka aren't a good
fit yet. They have a connector but it's basically a little app that automates
consuming from Kafka, writing files to S3, then loading them into your tables.

~~~
FridgeSeal
Getting it to export JSON was needlessly complicated, those Variant type
columns are the worst.

Does it have geospatial support now? Because they definitely didn’t have it
when I used it.

Ultimately though, I don’t see enough benefits of SF over CH: feature set
overlaps but with no real winner features. observed performance was worse, the
client libraries are few and far between and not great, and no HTTP interface,
and no real streaming support. On top of being hideously expensive, and closed
source.

~~~
manigandham
Yes it has geospatial support. Variant columns are better than every other
database so far. Redshift and Bigquery just have a text field and require far
more verbose commands to operate and cast. It sounds like most of your issues
are with importing and exporting data rather than querying it?

Snowflake is basically EC2 servers reading files from S3 so you get more
bandwidth with a larger warehouse size but it's fundamental limit and will
have much higher latency compared to running on a local SSD with clickhouse.
Lack of streaming is a known problem. They actually do have an HTTP interface,
you just don't see it but that's how all the ODBC drivers are implemented
(with HTTP calls and JSON data in the background).

If your data fits or you don't mind the operational overhead of running your
own data warehouse then it's almost always a cheaper and faster option.

------
rossjudson
Push-down operator algebra, taken to an extreme...

[https://www.yellowbrick.com/](https://www.yellowbrick.com/)

------
kresten
What’s the tldr on why it is fast?

~~~
SloopJon
I'd say the materialized view is the main thing:

> Thus, we add the following materialized view ... At the end we should have
> 1440 times fewer rows in the aggregate than the source table.

The cost of populating that view is amortized over the 17.5 hours it took to
load the data.

~~~
hodgesrm
Mat views are great as the article showed. I use them to get query response
down to milliseconds, as they vastly reduce the amount of data ClickHouse must
scan.

That said, there are a lot of other tools: column storage, vectorwise query,
efficient compression including column codecs, and skip indexes to name a few.
If you only have a few billion rows it's still possible to get sub-second
query results using brute force scans.

Disclaimer: I work for Altinity, who wrote this article.

~~~
hodgesrm
p.s. Loading the view is low-cost compared to loading the source data. On the
NUC it's 40-60 minutes, so worst case it's something like 1h / 17.5h = 5.71%.
Also, you can still query the source data. That is fast for individual sensors
as the examples showed.

