
Clickhouse Local - mooreds
https://clickhouse.tech/docs/en/operations/utils/clickhouse-local/
======
georgewfraser
Why do people on HN love Clickhouse so much? As far as I can tell, it’s an
ordinary column store, with a bunch of limitations around distributed joins
and a heuristic-based query planner. There are several good analytical
databases that will give you the same scan performance and a much better query
planner and executor.

This is not a rhetorical question, I would really like to know why it gets so
much attention here.

~~~
atombender
Because it's one of the few mature, fast, scalable analytics-oriented
databases that is also completely open source.

The closest open source thing that matches its feature set is Presto, but that
one is quite different.

Apache Druid is supposed to be very mature, but also very difficult to set up
and manage. I've not used it myself.

There's also Vespa, but I don't know how well it performs with large numbers
of columns.

A lot of people use Elasticsearch for analytics. Being based on Lucene, it's
_kind of_ columnar, and it can perform very well indeed on aggregations.

InfluxDB may be good, but it's not fully open source.

~~~
lflux
What about Greenplum or CitusDB?

~~~
atombender
CitusDB is not relevant here, I believe, as it still uses Postgres' table
storage, so it's not columnar. It might be good for analytical workloads, but
I very much doubt it will perform anywhere close to ClickHouse.

Greenplum: I've not used it, but it does support columnar tables, so maybe
it's comparable.

~~~
justinsaccount
[https://github.com/citusdata/cstore_fdw](https://github.com/citusdata/cstore_fdw)

------
wikibob
How Sentry.io uses Clickhouse

[https://blog.sentry.io/2019/05/16/introducing-snuba-
sentrys-...](https://blog.sentry.io/2019/05/16/introducing-snuba-sentrys-new-
search-infrastructure)

~~~
vladcretu
We do. We are also hiring. Reach out to vlad@sentry.io for more info.

------
1996
Clickhouse is one of the most underrated databases.

This basically replaces most of my usages of SQLite.

When its SQL "dialect" matures, Clickhouse will eat MySQL lunch, then
PostgreSQL.

~~~
oreoftw
For analytics - sure. Clickhouse was not designed to handle OLTP workload,
there's no transaction support.

~~~
1996
I cringe a bit inside at people using say nosql approaches when it makes
literally no sense to do so.

Therefore I think the lack of OLTP will not matter much and that clickhouse
will be widely used, but also misused when it becomes too fashionable.

~~~
atombender
This makes no sense.

For example, aside from the lack of transactions, Clickhouse is designed for
insertion. There's an INSERT statement, but no UPDATE or DELETE statements.
You _can_ rewrite tables (there's ALTER TABLE ... UPDATE and ALTER TABLE ...
DELETE), but they're intended for large batch operations, and the operations
potentially asynchronous, meaning that they complete right away, but you only
see results later.

ClickHouse has many other limitations. For example, there's no enforcement of
uniqueness: You can insert the same primary key multiple times. You can dedupe
the data, but only specific table engines support this.

There's absolutely no way anyone will _want_ to use ClickHouse as a general-
purpose database.

~~~
1996
I should have phrased that differently: if something is good enough in some
key metric, it extends to other uses - even if it makes a poor fit.

So I insist: everyone will WANT to use clickhouse as a general purpose
database, and will create ways to make it so (ex: copy table with the columns
you don't want filtered out, drop the original, rename)

It is just too fast and too good for many other things, so it will expand from
these strongholds to the rest.

A personal example: I am migrating my cold storage to clickhouse, because I
can just copy the files in place and be up and running.

I know about insert and the likes, I have a great existing system - but this
lets me simplify the design, and deprecate many things. Fewer moving parts is
in general better.

After that is done, there is a database where I would benefit from things like
alter tables or advanced joins, but keeping PostgreSQL and ClickHouse side by
side, just for this? No. PostgreSQL will go. Dirty tricks will be deployed.
Data will be duplicated if necessary.

~~~
hodgesrm
Advanced joins (specifically merge joins) and object storage are on the way.
See the following PRs:

* [https://github.com/ClickHouse/ClickHouse/pulls?q=is%3Apr+mer...](https://github.com/ClickHouse/ClickHouse/pulls?q=is%3Apr+mergejoin) \-- Recent work to enable merge joins

* [https://github.com/ClickHouse/ClickHouse/pulls?q=is%3Apr+s3](https://github.com/ClickHouse/ClickHouse/pulls?q=is%3Apr+s3) \-- Same thing for managing data on S3 compatible object storage

There's been a lot of community interest in both topics. Merge join work is
largely driven by the ClickHouse team at Yandex. Object storage contributions
are from a wider range of teams.

That said I don't see ClickHouse replacing OLTP databases any time soon. It's
an analytic store and many of the design choices favor fast, resource
efficient scanning and aggregation over large datasets. ClickHouse is not the
right choice for high levels of concurrent users working on mutable point
data. For this Redis, PostgreSQL, or MySQL are your friends.

------
dang
A related article from last year:
[https://news.ycombinator.com/item?id=20163017](https://news.ycombinator.com/item?id=20163017).
No comments though.

------
bdcravens
Clickhouse Local is great also for importing into your Clickhouse Server,
where you can validate and preprocess CSVs into Clickhouse's native table
format.

------
dzonga
funny thing, just learnt about clickhouse today. for experienced people that
use columnar stores and pandas for analytics, which tool do you usually prefer
for BI stuff ? do ya'll load data into clickhouse then analyse it using
pandas. or all analysis is done via the clickhouse sql dialect. As i'm sure
things like pivot tables and rolling windows are a PITA in SQL

~~~
meritt
Why would you bother using a database like clickhouse to store data if you're
just going to analyze it in pandas? Just store it in a csv, parquet, or orc.

> As i'm sure things like pivot tables and rolling windows are a PITA in SQL

I can't speak for clickhouse, but group-by and window functions are a very
standard part of any SQL analysts toolbelt.

~~~
Shelnutt2
Another option to consider here is storing your data in TileDB[1] which allows
you to access it via Python/Pandas, MariaDB for SQL (embeddable or
standalone), Spark, R and more. With embedded MariaDB[2] you can query
directly into pandas with minimum overhead. TileDB is similar to parquet in
that it can capture sparse dataframe usage, however it is more general in that
TileDB supports multi-dimensional datasets and dense arrays. Other major
features built directly into TileDB include handling updates, time traveling
and partitioning at the library level, removing the need for using extra
services like Delta Lake to deal with the numerous Parquet files you may
create. TileDB also has native support for S3 and in the next release we'll
have native Azure Blob Storage support.

[1] [https://github.com/TileDB-Inc/TileDB](https://github.com/TileDB-
Inc/TileDB)

[2] [https://docs.tiledb.com/developer/api-usage/embedded-
sql](https://docs.tiledb.com/developer/api-usage/embedded-sql)

Disclosure: I am a member of the TileDB, Inc. team

~~~
dzonga
thank you!

------
polskibus
I wonder how performant this is in comparison to other quick and dirty methods
to work on local files like bash, pythonz etc.

~~~
hodgesrm
It's much faster in cases that involve significant processing. Where possible
query execution is vectorized and runs in parallel on as many cores as you
make available.

There are some examples in the article cited by dang:
[https://news.ycombinator.com/item?id=20163017](https://news.ycombinator.com/item?id=20163017)

