Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: ClickHouse-local – a small tool for serverless data analytics (clickhouse.com)
289 points by zX41ZdbW on Jan 5, 2023 | hide | past | favorite | 70 comments
Me with my friend Vitaly Ludvichenko made an experiment to combine ClickHouse server and client to make a self-contained program running a database engine and processing data without a server: https://github.com/ClickHouse/ClickHouse/pull/150

Development continued in the past 6 years, and now clickhouse-local becomes a swiss-army knife for data processing. Say "ffmpeg" for datasets and more.

It can resemble textql, octosql, dsq, duckdb, trdsql, q, datafusion-cli, spyql, but has better capabilities and performance.

Here is a tutorial: https://clickhouse.com/blog/extracting-converting-querying-l...

"serverless" in the same sense as here: https://www.sqlite.org/serverless.html and also in the more common sense - clickhouse-local can be packaged into AWS Lambda and serve queries on a per-request basis, as here: https://github.com/ClickHouse/ClickHouse/issues/43589




Congrats on the Show HN!

It's great to see more tools in this area (querying data from various sources in-place) and the Lambda use case is a really cool idea!

I've recently done a bunch of benchmarking including ClickHouse Local, and the usage was straightforward, with everything working as it's supposed to.

Just to comment on the performance avenue though, one area I think ClickHouse could still possibly improve on - vs OctoSQL[0] at least - is that it seems like the JSON datasource is slower, especially if only a small part of the JSON objects is used. If only a single field of many is used, OctoSQL lazily parses only that field, and skips the others, which yields non-trivial performance gains on big JSON files with small queries.

Basically, for a query like `SELECT COUNT(*), AVG(overall) FROM books.json` with the Amazon Review Dataset (10GB), OctoSQL is twice as fast (3s vs 6s). That's a minor thing though (OctoSQL will slow down for more complicated queries, while for ClickHouse decoding the input is and remains the bottleneck, with the processing itself being ridiculously fast).

Godspeed with the future evolution of ClickHouse!

[0]: https://github.com/cube2222/octosql


There's newer json support in clickhouse where you can load arbitrary json objects into a JSON column type:

https://clickhouse.com/blog/getting-data-into-clickhouse-par...

Behind the scenes it flattens things into real columns with auto detected types.

I suppose if you planned on running many queries that would be faster, but I bet octosql could also convert it to something like parquet for similar gains.

There's also some tricks you can do with json filtering if strings are involved, if you have a query like "select path from http_logs where hostname = 'news.ycom...'" you can read a whole chunk of data, do a `memmem` for the string and if no matches, not even bother decoding ANY of the json.


It is interesting to see how ClickHouse webpage changed, lately. If I didn't know there is open source version, I would not find it.

You have to go to Product menu to see Open source ClickHouse or scroll several screens to find low contrast Open source quickstart button, right next to high contrast Try Cloud now button. The button will take you to a tutorial section of the documentation, where you have to scroll over a into video of getting started on ClickHouse Cloud, and then you have to switch from ClickHouse Cloud tab to Self-Managed tab, to get to how to actually use it.

It's like the non-hosted version is a secret!

I understand everybody wants to make money, and people took money from VCs and they want results, etc. But I remember time when the webpage was just docs and that was it, and it leaves bit of a sad taste in my mouth.


This is not what we want. I admit this problem. Changes will be made.

Just for amusement, here is the 2016 website version: https://presentations.clickhouse.com/original_website/


  % ./clickhouse local -q "SELECT toHour(time) h, count(*) t, bar(t, 0, 100, 25) as c FROM url('https://datasets-documentation.s3.amazonaws.com/hackernews/clickhouse_hacker_news.csv', CSVWithNames) GROUP BY h ORDER BY h"
This is awesome. One line graph from the internet.


This is, admittedly, the most unusual use of "serverless" I've ever heard of


To me it makes more sense to call something that doesn't need a separate server "serverless". It's quite weird that the usual use of "serverless" means something on a server that you don't manage.


Ha! I agree! That usage seems to presume YOU are "serverless" not the application or the system running.


In the context of (cloud-native?) databases, serverless seems to really mean something else entirely: https://me.0xffff.me/build-database-in-2020s.html

But to me, serverless the way sqlite is, also makes ample sense.


And yet, at the same time, it does make some sense, as they took the “server” component away and packaged it as a regular cli tool.

I’d argue that it’s actually more accurate a description than the more common use of the serverless definition.


Huh! I recently used this to play with CH, and had no idea it was so new! Also, I've been impressed with the range of formats/protocols/types of storage CH can use, and it's a good strategy by the developers because it makes it much more appealing to consider working with.

For what it's worth, my main uncertainty about CH at this stage is about just how we can/can't use joins while keeping memory use reasonable. (What I've gathered from playing around is CH may fail a query because it wants to create a hashtable that's too big for RAM, when other tools would spill temporary data to disk.) We can minimize joins by denormalizing, but can't completely get rid of them, and notably joins are really handy for the early steps of building "wide" denormalized tables from normalized source data.

I know CH has run the Star Schema Benchmark so it must support some join strategies; still, a lot of the examples focus on the single-table queries CH is best at, and so far I have less of an idea what join strategies CH can use and what the limitations are so I can work within them.


This inspired me to dig around a little more, and the details at https://clickhouse.com/docs/en/operations/settings/settings#... leave me more confident you can do large joins if your flow requires them. It still looks like CH's default setting is extra work if you're used to writing queries and letting the query planner worry about join strategies.


Is there some way to use this on Windows / MinGW shells?

  $ curl https://clickhouse.com/ | sh
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  100  2473    0  2473    0     0   2063      0 --:--:--  0:00:01 --:--:--  2078
  Operating system 'MINGW64_NT-10.0-18362' / architecture 'x86_64' is unsupported.

I do not like Windows, I am forced to use it where I am right now.

It would be great if software producers would finally accept the reality that there are at least three mainstream operating systems, Windows, OSX, Linux, that should be supported.


On windows, linux is supported via wsl2 - and is generally a better nix/posix than mingw/cygwin etc.

I'm not saying most software should just not support windows - but in this case you should sort of already have what you ask for?


This. My laptop runs Windows (for non-programming reasons). It would feel like a prison except that WSL is really good--far more convenient than messing around with MacOS workarounds like brew. If you can't run Linux directly, Windows is the way to go.


What happens if you download the install script and edit it?

I don't have a windows machine at hand right now so I can't check myself.

Something like this:

    curl https://clickhouse.com/ --output install.sh
Then open install.sh with your favorite text editor, and replace the first 58 lines with code setting the `OS`, `ARCH` and `DIR` variables as suitable (you might get away with OS=`linux` if the Windows Subsystem for Linux is installed?).


Looks like no Windows support for the whole clickhose product family!

This is actually great as it gives me one more reason to ask for a real operating system! Now with clickhouse-local only available on Linux (or OSX) I have a very good point to make, as everybody loves data tools like these! Finally!

Follow up blog post to be done: "How clickhouse-local helped me to finally introduce Linux as a client OS in our working group."


Will this be available with APIs to use this embedded like DuckDB?

All the examples I see are using the CLI.


It is not available as a library, only CLI.


Ah, okay. Won't work for my use case. Still cool though!


As the author of textql ( https://github.com/dinedal/textql ) - thanks for the shoutout!

Looks great, I love more options in the space for CLI based data analysis tools! Fantastic work!


This is really beautiful! I can think of a ton of use cases; so often I'm writing scraps of code or setting up quick DBs just to parse some specific dataset. I have a couple (naive) questions:

1. How are column types inferred? Does it try to avoid FLOAT types? What about blobs or JSON? (It would be a pretty stunning trick if it could notice valid JSON in a CSV file)... [edit: I see now from other columns that JSON is supported; nifty =)]

2. Does it try to infer charsets or just go with UTF8-mb4 or something else?

and... any chance we'll see an NPM version?


> It can resemble textql, octosql, dsq, duckdb, trdsql, q, datafusion-cli, spyql, but has better capabilities and performance.

Higher performance than q? Is that one of the opensource implementations or q on kdb? If the later, I would be interested in seeing some performance comparisons.

Also, congrats on this. It is right up my alley. I like DuckDB for columnar data analytics, but have run into issues dealing with some larger datasets. I am interested in testing this.


I think they're talking about https://github.com/harelba/q, which is not very fast.


Any plans to be able to directly run this in the browser similar to DuckDB. Nice to see more options in the space.


It’s an amazing tool. It really opens the world up to all the cool stuff clickhouse has to offer.


Can Clickhouse local query from files on AWS S3?

We recently had some issues querying parquet files on AWS S3. I tried benchmarking various tools and DuckDB seemed the most promising. Didn't find good documentation for clickhouse local with AWS.


Sure you can read from S3.

You can define table that will look at your files like this https://clickhouse.com/docs/en/integrations/s3/s3-table-engi...

Or just use s3 table functions https://clickhouse.com/docs/en/integrations/s3/s3-table-func...

You can use it as a command line tool running one query at a time or in interactive mode.


Do you know if it's possible to tell clickhouse-local (without the need of a config file) to use environment variables for AWS authentication? I know there is the "use_environment_credentials" property in the config, but I really don't want to create a config for easy use-cases like just querying some data in a private S3 bucket.


I recently had a few months of AWS Cloudfront CSV logs I needed to pull some analytics from, and I ended up using clickhouse-local to do the analysis and was just blown away by it. Super fast and easy to use.


Can you share some details on how you did this? I am interested too! Did you use S3 table functions or just downloaded the CSV locally and ran `clickhouse-local`?


I wonder how hard it would be to put this in process to compete with DuckDB?


So cool. Similarly, can we also build clickhouse-remote?


If you downloaded `clickhouse-local`, you can simply rename it to `clickhouse-server` and run. It will automatically transform itself to ClickHouse server.


> curl https://clickhouse.com/ | sh

It feels a bit weird to not use a subdomain (install.clickhouse.com) or a specific route for this, and instead mix this in with your marketing site.


Yes, and it's a bit annoying that the script only magically gets returned based on headers set (presumably user agent) - as that means not all cli/http download tools will work to get the script.

At least the script is simple - unfortunately it's in fact so simple that just linking to: https://builds.clickhouse.com/master/ with a directory listing would have been preferable. Unfortunately that doesn't work, and one has to link to an actual file, like: https://builds.clickhouse.com/master/macos-aarch64/clickhous... (which of course will change depending on what master points to - so it's the worst of both a listing and a direct url...).


Tangential, but I'm curious about the installation command:

    curl https://clickhouse.com/ | sh
It seems like it just detects curl('s user agent?) and serves up the script, serving HTML otherwise (ex. `wget -O- ...` would print HTML). I wonder what the design decisions here were.


Yes, it is detecting curl by the user agent - by using the "page rules" in Cloudflare. Cloudflare is good for traffic management rules.

I've created this distribution channel as an experiment and for my own need - I use it to download and install ClickHouse without the need of package manager. So I don't need to run "apt update" to install ClickHouse. It works if the package manager is broken, and it works well on very old servers, and rare Linux distributions.

It is using the fresh build from the master branch. I have also created a website for my team to monitor the status of the continuous integration system: https://aretestsgreenyet.com/

We also have deb, rpm, tgz and Docker as wrappers for the release builds.


> I've created this distribution channel as an experiment and for my own need - I use it to download and install ClickHouse without the need of package manager. So I don't need to run "apt update" to install ClickHouse. It works if the package manager is broken, and it works well on very old servers, and rare Linux distributions.

Makes perfect sense! Thanks for the explanation :D


Interesting - for me broken package manager is indicator of that server to be fixed or reprovisioned, I'd be wondered to see some DB running on such kind of server - highly likely it's missing updates as well


Totally agree. But it's usually more complicated.

For example, MongoDB supports Ubuntu 20.04, but not 22.04: https://www.mongodb.com/docs/manual/tutorial/install-mongodb...

Or another example - folks in a large company successfully running Ubuntu 12.04 in production for 10 years and cannot upgrade it because some Perl dependencies of their amalgamation of microservices working slightly differently.


Hmm, how do you create this "page rule" in Cloudflare? I can't seem to figure it out.



Sketchy would be detecting bash and serving a different page depending on that; this is only a bit hackish.

Now that I’m thinking about it, normal browsers usually ask for text/html or text/* as a preference don’t they? While curl and wget just say Accept: *. Then content negotiation could achieve the same result per the actual HTTP semantics, without any hacks involving the User-Agent header.


    curl -vvv  https://clickhouse.com/
    ...
    > user-agent: curl/7.79.1


Yes, that's obvious. The question is more WHY -- why serve curl detection from the root domain? Why only detect curl? ...


easier to remember download link?


There’s nothing in that link that says "I’m a download link" and I don’t think you would remember that the site you have open in your browser also serves as the download link if you curl it with the right user-agent.


What's the evidence for superior performance? The post submitter, zX41ZdbW, has made uncharitable comments about DuckDB in previous threads.


Here is one example: https://colab.research.google.com/github/dcmoura/spyql/blob/... Note: DuckDB is not present in this comparison, because it does not support JSON processing (only CSV and Parquet). ClickHouse supports everything.

Here is another example: https://datastudio.google.com/u/0/reporting/6a2c38d4-3a22-41... - startup times for serverless engines. Note: DuckDB is not present in this comparison, I'm not sure why.

Both comparisons are independent.

I know a few cases when clickhouse-local is worse than DuckDB:

- it does not use the embedded metadata to filter while processing Parquet files;

- the syntax for accessing the files in s3 is clunkier in clickhouse-local;

- finally, there is no Python module and integration with dataframes.

I know many cases when clickhouse-local is better than DuckDB. Performance is mostly better, because ClickHouse is more advanced in the query engine. DuckDB mostly keeping up, but sometimes is already ahead, in some scenarios. Query language, data types support, feature completeness, stability and testing - much better in ClickHouse.

I did not make uncharitable comments about DuckDB. We have recently met with Hannes Mühleisen and the team from DuckDB labs, and I have very good impression about the technology and the team. I see them as our friends. I'm also enthusiastic about every data processing technologies.


I'd love to see evidence for some of these claims, like stability and testing. Would you also be able to comment on correctness?

(co-founder at MotherDuck)


This summer I was preparing the ClickBench: https://benchmark.clickhouse.com/

When I tried to use DuckDB on the same dataset as ClickHouse, it simply did not work due to OOM: https://github.com/duckdb/duckdb/issues/3969

I also told them about our experience of using various memory allocators, and why you should never use the GLibC's malloc.

This issue was fixed.


The memory aspect is pretty interesting.

I tried to play with "serverless ELT" (the other kind of serverless) where I would define AWS Lambda that turns incoming CSV files into Parquet for archiving and querying.

It seems that in DuckDB, the amount of memory it needs to do that is always at least a few gigs, and the memory goes up proportionally with the file size (which I think is due to data type inference?), which is both expensive and annoying because you are either overpaying or you need to go and increase a size of your Lambda when things start crashing.

I wonder if ClickHouse-local or some other tool can do that with constant memory, no matter the file size. I know Spark can, but Spark is kind of pain to work with.

(Yes, I do realize this is a bizarre use case.)


Thanks, sounds like relative immaturity.

I did not catch you address correctness?

FWIW I have some experience with Clickhouse as I ran product at Firebolt and played a critical role in being more transparent about their foundations and giving credit where it is due. However, I do have some first hand experience with Clickhouse, which was jarring, considering my previous experience was with BigQuery.



Recently I tried to load another dataset to DuckDB: https://github.com/ClickHouse/ClickHouse/issues/42363 for my own entertainment.

But it did not work (the server became unresponsive after consuming all memory).


From my practical experience with ClickHouse, I saw OOMs as well, on load and on query. How confident are you that your chosen dataset is neutral?


ClickHouse will throw an exception in case of not enough memory and continue to serve other queries. Under certain configurations it can OOM as well.

> How confident are you that your chosen dataset is neutral?

I have no idea if it is "neutral", I picked it randomly.

I test ClickHouse on every interesting dataset, see here: https://github.com/ClickHouse/ClickHouse/issues?q=is%3Aissue...

The reason - I love working with data :) If I see a dataset, I load it into ClickHouse - this is the first thing I do. This is not a kind of marketing or promotion of ClickHouse - you know, if it were some directed task, it would be uninteresting for me.


wonderful to hear! So any word on correctness or nah?


I did say about

> Query language, data types support, feature completeness, stability and testing

nothing about correctness.

In terms of stability, I see a couple of pretty old, and still unresolved issues about memory safety (data races, segmentation faults) in your repository, found by users.

In contrast, most of the memory safety issues in ClickHouse are found by continuous fuzzing before the release. And finding similar issues will give you a reward: https://github.com/ClickHouse/ClickHouse/issues/38986

Our testing system successfully finding issues in well known and widely used libraries - jemalloc, rocksdb, grpc, AWS, Arrow, Avro, ZooKeeper, Linux kernel... It is kind of surprising, and it makes an impression like we are the only product that does testing for real.

I also remember an example of using SQLancer from 1.5 years ago. When SQLancer appeared, we started to use it on ClickHouse, and it has found a few issues and one crash. At the same time, it has found a lot of crashes in DuckDB. But this example is very old, and DuckDB evolved a lot since then - it is a much younger technology after all.


> The post submitter, zX41ZdbW, has made uncharitable comments about DuckDB in previous threads.

I felt this too [0], but I guess it is mostly down to the fact that perhaps English may not be their first or even second language.

[0] https://news.ycombinator.com/item?id=34026661


hmmmm interesting. clickhouse becoming duckdb while duckdb becoming clickhouse.


I believe that's Newton's First Law of DBMSs


Next they become an email system.


For those that don't know, this is referring to MotherDuck which was recently announced: https://motherduck.com/


nb: Engs and academics leading the development of DuckDB don't work for MotherDuck (yet).


It will be interesting to see what MotherDuck is. Is it going to be like juniper notebooks but DuckDB? It it going to be something that scales vertically in the cloud? Hard to tell from the marketing so far :p


my impression from a short chat with their cofounder is think "firebase for duckdb". pretty solid initial plan, at least


offtopic question - what is the meaning of the follow: list in your hn bio? is it ingested in some tool somewhere?


No tool; mostly a reminder-to-self to keep an eye on comments from those users. Also helps when researching keywords on hn.algolia: For ex, "gpu pcwalton", "trust mjg59", "advice nostrademons" etc

See also: tptacek's must-read list https://news.ycombinator.com/user?id=tptacek




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: