
Migrating from Redshift to ClickHouse - hodgesrm
https://www.altinity.com/blog/migrating-from-redshift-to-clickhouse
======
mmcniece
Cool to see ClickHouse get some interest here. Been using it for about a year
now and consistently impressed with the performance on analytical queries
across very large (1b+ rows) tables

Couple pain points though:

* Integration Maturity - Many tools/services either don't have integration with CH or are missing features

* User Management/Security - Have to configure users in their custom XML format, only applied at database level (no table level or row level option), and doesn't plug into SSO, LDAP, etc.

* Getting "current state" for a table - e.g. some table has users and some attributes, harder than it should be to get the "current attribute value" for all users, to do analytics on

* Log Format - very challenging to pull into log aggregation tool and get helpful information from

------
wgjordan
> Reasons to move to ClickHouse

> the [VACUUM] process requires an outrageous amount of time

As of Dec 19 2018, Amazon Redshift now runs VACUUM DELETE automatically, and
has been made drastically more resource-efficient [1].

> There are no queries in Redshift that take less than a couple of seconds.

This is likely due to incorrectly-tuned Workload Management Queues. In
addition, as of Aug 8 2018, Redshift automatically enables short query
acceleration [2], which should speed up short queries by default without
additional tuning.

> According to our calculations, deploying ClickHouse on AWS instances with
> the same resources was exactly half as expensive.

You also need to factor in the cost of migration (two full-time specialists,
three months effort), and ongoing support effort needed to maintain the now-
custom system.

Plus, it sounds like some (possibly similar?) cost savings could have been
achieved simply optimizing the existing Redshift cluster:

> We can’t make a straight comparison about query speed because the data
> schema changed so much. But many queries sped up simply because less data
> are read from the disk. Truth be told, we should have made this change in
> Redshift, but we decided to combine it with our migration to ClickHouse.

In short, continue testing assumptions and tracking updated feature-sets on a
regular basis- valid reasons to move to ClickHouse yesterday may no longer be
valid today, or at some point in the future. Also, don't double down on sunk
costs- it might make sense to migrate back to Redshift in the future, if your
issues with the system are eventually improved or resolved. Or maybe not,
depending on your use-case.

[1] [https://aws.amazon.com/about-aws/whats-new/2018/12/amazon-
re...](https://aws.amazon.com/about-aws/whats-new/2018/12/amazon-redshift-
automatic-vacuum/)

[2] [https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-
re...](https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-
automatically-enables-short-query-acceleration/)

~~~
hodgesrm
I think the authors were up-front that some of the issues they encounters (for
example table optimization) could have been done in Redshift.

What I found intriguing about the cost savings was that it was achieved on
Amazon as opposed to going on-prem where you can push costs down further
especially if the workload is reasonably stable. It's therefore likely that
there are still substantial savings that can be squeezed out. That gap gets
bigger if you scale to large cluster sizes.

In thinking about costs it's also important to remember you can run a data
warehouse perfectly well without a cloud management plane. It really depends
on the data warehouse complexity, how many other things the team has to
manage, and their comfort level with the technology. Redshift--to its credit--
has so changed the conversation about data warehouse operation that we tend to
forget this.

Disclaimer: I work for Altinity and the article came from our blog. We did not
write it and the authors are not customers.

------
manigandham
Clickhouse is a fantastic piece of engineering but really needs work on the
operational and maintenance side. Replication, backups, updates/deletes, etc
are very rough and a big obstacle for greater usage.

That being said, even a single server can be many times faster than other data
warehouse solutions if your data fits.

------
devereaux
I love clickhouse: it's simple yet flexible enough and free software.

I'm migrating a lot of data to clickhouse on clusters of servers.

One of the unexpected gains is how it uses far less disk space, which means
the data can be put on a few NVME in RAID10, which provides speed benefits
even for data that can't fit in memory.

I am now also considering ClickHouse for "cold storage": either by compressing
the directories or leaving them as such on a smaller server.

Being able to standardize on one thing for both production and storage would
be nice.

~~~
PeterZaitsev
Can you share results what is the space consumption difference for your data ?

~~~
sethhochberg
Not the poster above, but using Clickhouse for similar purposes
(archival/analytics on huge data that started out in MySQL and gets imported
to Clickhouse for the long term). Some real-world numbers on our end for the
same time period, roughly 2.5 billion rows:

InnoDB Barracuda - (logical data + indexes 583GB, physical ZFS LZ4: 386GB,
~1.51x compression radio)

Clickhouse MergeTree - (logical 349GB, physical ZFS LZ4: 279GB, ~1.27x
compression ratio)

The MySQL version of the data has a few different compound indexes which
certainly contribute to the size difference, but regardless Clickhouse is
dozens of times faster on complex queries against this data.

We run both our MySQL and Clickhouse servers on ZFS (Clickhouse accesses its
datastore over NFS on a 10gbit link... in practical terms we have not seen a
severe performance penalty for doing this, even though we know it isn't
encouraged by the Clickhouse maintainers).

~~~
devereaux
[my numbers match yours]

It's very interesting that we made more or less the same choices: NFS on a LAN
for cold storage. In practice, I haven't seen performance penalties either. I
will certainly standardize on that.

If needed, 'reviving' the cold stored data is as simple as copying it to other
machines-- which will also offer sharding.

My only reproach is that servers have to be identical, as clickhouse doesn't
seem to be able to take into account the speed differences. My backup servers
are quite different from the production server, but in an emergency I'd like
to use them all.

