Hacker News new | past | comments | ask | show | jobs | submit login
ClickHouse – high-performance open-source distributed column-oriented DBMS (clickhouse.yandex)
243 points by mechmind on June 15, 2016 | hide | past | favorite | 70 comments

This is huge. It seems to me that it's similar to BigQuery but has many other features that I didn't see in other databases.

AggregatingMergeTree is especially one of them and allows incremental aggregation which is a huge gain for analytics services.

Also it provides many table engines for different use-cases. You don't even need a commit-log such as Apache Kafka in front of ClickHouse, you can just push the data to TinyLog table and and move data in micro-batches to a more efficient column-oriented table that uses different table engine.

This looked interesting at first sight (considering I work on ORC & Hive) - the single machine configuration with TinyLog and the aggregate indexes look good[1].

But the more I look into the distributed storage section, the more corners I see cut in the data consistency section.

"There are no quorum writes. You can't write data with confirmation that it was received by more than one replica. If you write a batch of data to one replica and the server with this data ceases to exist before the data has time to get to the other replicas, this data will be lost."

"The reason for this is in case of network failures when the client application doesn't know if the data was written to the DB, so the INSERT query can simply be repeated. It doesn't matter which replica INSERTs were sent to with identical data - INSERTs are idempotent. This only works for the last 100 blocks inserted in a table."

"Failover is automatic (for small differences in data) or semi-automatic (when data differs too much, which may indicate a configuration error)."

"When the server starts (or establishes a new session with ZooKeeper), it only checks the quantity and sizes of all files. If the file sizes match but bytes have been changed somewhere in the middle, this is not detected immediately, but only when attempting to read the data for a SELECT query."

These sort of network blips and bad disk problems happen every week at least once on a big cluster and something like hadoop wastes a lot of IO doing bit-rot checks on cold data all the time.

The nature of clickstream data makes it somewhat okay to lose a few chunks in transit - I can imagine at least a few of the beacons will get dropped purely over the HTTP mechanism which pumps data into the system.

At some point, the data consistency costs money, slows down inserts and creates all sorts of limitations on how recovery of data would play out.

But as a general purpose replicated DBMS which serves as a system of record against fraud allegations (for instance), I can't see this comparing well.

[1] - 200 million rows/sec for join+aggregation is ~2x as good as Hive 2.0 single node LLAP on a hot run (https://people.apache.org/~gopalv/LLAP.gif)

For those who are wondering: Doesn't build under osx, tho the issues initially don't see insurmountable, more that it hasn't been tried under CLANG.

[ 3%] Building CXX object contrib/libpoco/Foundation/CMakeFiles/PocoFoundation.dir/src/AbstractObserver.cpp.o error: unknown warning option '-Wno-unused-local-typedef' [-Werror,-Wunknown-warning-option] error: unknown warning option '-Wno-for-loop-analysis'; did you mean '-Wno-loop-analysis'? [-Werror,-Wunknown-warning-option] make[2]: * [contrib/libpoco/Foundation/CMakeFiles/PocoFoundation.dir/src/AbstractObserver.cpp.o] Error 1 make[1]: * [contrib/libpoco/Foundation/CMakeFiles/PocoFoundation.dir/all] Error 2

A bit of googling shows this is likely because Clang -Wall overrides any other flags set earlier which is apparently different than GCC. This makes me think they aren't lying when they say it supports linux in that they likely haven't tried building it on mac much.

That said it doesn't see to be using any crazy deps that don't support multiple platforms. Poco above is fully cross platform.

Note per apparent comments below: Lots of people develop or use macs and so they'd be interested if they'd have to have a VM or other option to use this. Since the readme is super thin and it just says Only linux xxx I felt they didn't have much info. I'm used to the days where people built projects that compiled everywhere but didn't build packages for them for some reason.

It supports Linux x86-64 with gcc >= 5.0 .

I can read as well. Neither your comment nor their website say how far it might or might not be from running on another platform. My comment was merely a initial analysis for how far it might be from working on a mac which is important to a lot of people in here. Seeing that it's not working due to clang testing shows that they likely haven't tried on mac much.

Did you try to build it on Mac with GCC from Homebrew?

No just default build.

I found this [1] much more informative in terms of what this is good for.

[1] https://clickhouse.yandex/reference_en.html

Ok, we changed the URL to that from https://clickhouse.yandex/.

Reference is way more better than the actual page.

This looked really interesting and then saw this: "This is not a cross-platform system. It requires Linux Ubuntu Precise (12.04) or newer, x86_64 architecture with SSE 4.2 instruction set."

That pretty much rules out most enterprise deployments or big data appliances.

Anyone know the difficulty of getting something like this running on RedHat distros. Surely it's nothing major.

This page seems to cover most of the prerequisites, and some other advice: https://github.com/yandex/ClickHouse/blob/master/doc/build.m...

And it does say this: "With appropriate changes, build should work on any other Linux distribution...Only x86_64 with SSE 4.2 is supported"

Thanks so much for that. It definitely looks like the documentation might be out of sync as there are also some references to ODBC support in the code:


Going to look into building a Spark data source for this so we can see how it well it compares to databases like Cassandra.

Not sure, that it is a problem. If you consider to use such scalable system, you certanly will build a completely separate cluster for it. For such case, there is no real difference what distro to use, even if your ops team are redhat/whatever fans

There's definitely a difference from an operational support standpoint. All of your configuration management for security and third-party tooling has to be rebuilt, you have to handle an entirely different set of patches, potentially some of your monitoring software has to be customized to support another OS, etc.

I don't think it's 2x the work to support two different distros, but it's definitely additional overhead that people may not necessarily want to deal with in order to play with something shiny and new.

What's wrong with just running it in a container?

"Enterprise" is a code word for "we have knowledge inhouse and can do anything" and not "the products we buy restrict us to the limitations set by the company we made ourselves depend on". Free and universal thinking, however, is not found very often in companies, so not every company will grow into an "enterprise".

Interested in how this differs from kdb+, I see it isn't part of the performance comparison chart but seems to have some feature overlap.

Knowing the kdb community, someone will post a benchmark full of nonsensical values in a plain text file available on an ftp site that shows kdb winning by 100000000 of some unknown unit...

In my experience the kdb community does exactly the opposite, showing a real world use case (high frequency trading for example) using real world data and shows that their platform is significantly faster than the competitors. Is it worth the price? Not so sure, but it certain use cases it certainly is.

Yeah kdb is built so that the entire core of the db fits in L1 CPU cache. It really is an impressive feat of engineering and optimization.

I did hate it when we used it however. It required more or less a team of people doing querying on it fulltime. Have you seen the queries in Q? It looks like someone set the baud speed wrong on a serial connection:

From wikipedia's page on Q (the query language of kdb):

    The factorial function can be implemented directly in Q as
    {prd 1+til x}
    or recursively as

Makes sense to me, assuming 'til' is 'iota' and 'prd' is 'fold-with-multiplication'.

I don't know Q, but I'd guess you read it right-to-left like J/K/APL. `x` is the right argument (in J it's `y`), so if we were to call `factorial 5`:

1. Create an array of [0,x) | 0, 1, 2, 3, 4

2. Add 1 to each element of the array | 1, 2, 3, 4, 5

3. Fold the array with multiplication | 120

Array languages are super elegant and fun once you use them a bit.

It's not like that first factorial function is actually any worse than the usual Haskell one:

    fac = \n -> product [1..n]
The recursive one is also much the same as a Haskell one, but Q is hardly built around idiomatic recursion.

This[1] would be the one, right?


Looks really interesting and not another SQL on Hadoop solution. The benchmarks look impressive, but all of the queries were aggregations of a single table. I did not see any joins. I wonder how mature the optimizer is.

I think it's because the docs say "If there isn't enough memory, you can't run a JOIN." . While SQL on Hadoop solutions work also without enough RAM by spilling on disk. I don't think a comparison with JOINs would be fair in this case.

Yandex has their own TLD? :/

Are you really surprised, that largest European internet company has own TLD when even Suzuki, BMW and mormons have one? https://nic.yandex

Oh they are bigger than I realized. Thanks.

The googs have it: https://www.registry.google/ All you need is $$$

I guess the community will step in and for starters will: add support for other Linux distros, and add comments in English alongside Russian in the source code.

This looks impressive as hell.

Wow, just wow.

I guess it's time to rewrite some backends...

Actually, the benchmarks aren't against alternative columnstore DBs, which would really put proof on their claims. Would love to see a comparison with Greenplum or Redshift.

For those who haven't worked with them before, columnstore databases are clever, but also do as much processing as they can at ingestion time, so that they can have good SELECT benchmarks later.

Vertica is a pretty large commercial column store with a lot of customers. Although I'm not sure if the Yandex folks got permission to publish benchmarks about it...

(MonetDB is a classic research columnar database, and I believe MemSQL has a columnar mode. Greenplum is missing, for some reason.)

It looks like they benchmarked against MemSQL 3.2, which is nearly 2 years old and Hive .11, which is close to 3 years. Not sure how out of date the other comparisons are.

That's what I want to know, how fast is it agaisn't Vertica.

Well, they have benchmarks! Against Vertica! Which is surprising, because I imagine Vertica's EULA forbids that.


Wait, Vertica forbids publishing benchmarks? What exactly are they going to do?

Clicked one of the vote buttons while trying to hit reply, sorry if it was a down...

Most DB vendors forbid benchmarks, Oracle, IBM DB2, etc, they claim they can remove your rights to run the software if you publish them

sweet, now where can I see the projections :)

Isn't MonetDB a columnar store? They did benchmark against MonetDB.

Edit: MemSQL too is a columnar store

I stand corrected!

Edit: However, the reference states "At this time (May 2016), there aren't any available open-source and free systems that have all the features listed above". Greenplum does actually implement all of these features, and is open source. I guess I just take issue with all the hyperbole in use.

You managed to check all of their docs already? Because as far as I know greenplum doesn't have incremental aggregation.

"Incremental Aggregation" is just UPDATE (new value = old value + JOIN value), right? If so, yes :)

Yeah... do you have ANY experience with column stores?

Link to greenplum db src code: https://github.com/greenplum-db/gpdb

Actually not all column store databases do work on the ingestion path. Cassandra, HBase etc being notable exceptions. But then again not quite the same use case as the others you mentioned.

This really is begging for a better benchmark.

I thought Cassandra was all about doing work in the ingestion path. Tables are built to specify very narrow access patterns. Data is streamed to disk in a format that's very efficient to read back later (SSTable).

Cassandra is not a column store.

Another summary with examples: http://bit.ly/1Uc5uqb

Awesome, thanks for sharing!

Are they planning to release any libraries? HTTP interface is quite good but I think it'd be better not to have the HTTP overhead and have an direct interface.

Without supporting other operating systems it is hard to consider it as an alternative to anything. We have several clusters using another operating system than the one supported by ClickHouse. Unfortunately few customers are going to invest into a different platform to try out something new like this. The lower the entering bar for new tech is the better.

> ClickHouse manages extremely large volumes of data in a stable and sustainable manner. It currently powers Yandex.Metrica, world’s second largest web analytics platform, with over 13 trillion database records and over 20 billion events a day, generating customized reports on-the-fly, directly from non-aggregated data. This system was successfully implemented at CERN’s LHCb experiment to store and process metadata on 10bn events with over 1000 attributes per event registered in 2011.

Might want to rethink that consideration. :-)

I have systems with 100 trillion records and 100 billion events / day using a technology that runs on CentOS, Ubuntu, FreeBSD, Windows Server and Solaris. What am I doing wrong? Why should I vet a technology that does not run on the platform that our company has chosen, it is approved by security and compliant with all the corporate requirements while the only platform ClickHouse is using is not?

For understanding, for analysis, for education. It's typically a mistake to blow off systems based on compliance checkboxes. Policies can be changed, systems can come into compliance.

What open source columnar storage do you use for 100 trillion records? It's very impressive.

Hive/Tez + ORC on HDFS and on S3. I do not think it is very impressive just by the numbers mentioned so far. The performance of data access like how long an average query runs is much more interesting. You can easily store 100T lines but it is much harder to make it queryable. There is a lots of interesting work being done in that area, many in-memory engines are out there to provide access of a subset of the entire dataset at much faster speed.

Of course I'm asking with context of massive parallel realtime queries from customers as it's going in yandex metrika or google analytics. And yes 100T lines with Hive+ORC it's not an issue, but it's another league.

I assume that only Kudu is a competitor for ClickHouse now. Or may be Greenplum.

Thank you for your answer.

If it runs on Ubuntu it'll run on almost any Linux distribution ;) -- Though, not sure if sse4_2 is enabled on every virtual machine.

And it contains Docker files in the source tree so you can just run it in container on any distribution.

> Without supporting other operating systems it is hard to consider it as an alternative to anything.

They are just published source code to github. Yandex is a Russian search engine. This project is not commercial. You can clone source code from github for adding support of your operation system.

Will do whenever this new system offers any business value for us and it is worth the effort.

Mind Blown /puts hadoop down and reads this.

When to use such a DB?

Imagine the layout on disk for both a row based and a column based database...or see this link:


Now imagine which areas need read when you perform a query like "average price" for all dates. In row-oriented databases we have to read over large areas, in column-oriented databases the prices are stored as one sequential region and we can read just that region. Column-oriented databases are therefore extremely quick at aggregate queries (sum, average, min, max, etc.).

Why are most databases row-oriented? I hear you ask. Imagine we want to add one row somewhere in the middle of our data for 2011-02-26, on the row oriented database no problem, column oriented we will have to move almost all the data! Lucky since we mostly deal with time series new data only appends to the end of our table.

I see.

So if my data consists of time series I use column stores?

Is this the only use case? The page talks about games.

so this is basically Russian Cassandra? Written in C, not Java. Enough for me; will test it.

No, it's actually not.

Cassandra is not columnar db. It means if you need all user_id from your table Cassandra will scan all your data (e.g. 10PB) on disks. But ClickHouse will only scan 1 column file (e.g. 10GB).

ClickHouse is Russian Kudu.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact