Hacker News new | past | comments | ask | show | jobs | submit login

Author here.

A few weeks ago, we wrote about how we implemented SIMD instructions to aggregate a billion rows in milliseconds [1] thanks in great part to Agner Fog’s VCL library [2]. Although the initial scope was limited to table-wide aggregates into a unique scalar value, this was a first step towards very promising results on more complex aggregations. With the latest release of QuestDB, we are extending this level of performance to key-based aggregations.

To do this, we implemented Google’s fast hash table aka “Swisstable” [3] which can be found in the Abseil library [4]. In all modesty, we also found room to slightly accelerate it for our use case. Our version of Swisstable is dubbed “rosti”, after the traditional Swiss dish [5]. There were also a number of improvements thanks to techniques suggested by the community such as prefetch (which interestingly turned out to have no effect in the map code itself) [6]. Besides C++, we used our very own queue system written in Java to parallelise the execution [7].

The results are remarkable: millisecond latency on keyed aggregations that span over billions of rows.

We thought it could be a good occasion to show our progress by making this latest release available to try online with a pre-loaded dataset. It runs on an AWS instance using 23 threads. The data is stored on disk and includes a 1.6billion row NYC taxi dataset, 10 years of weather data with around 30-minute resolution and weekly gas prices over the last decade. The instance is located in London, so folks outside of Europe may experience different network latencies. The server-side time is reported as “Execute”.

We provide sample queries to get started, but you are encouraged to modify them. However, please be aware that not every type of query is fast yet. Some are still running under an old single-threaded model. If you find one of these, you’ll know: it will take minutes instead of milliseconds. But bear with us, this is just a matter of time before we make these instantaneous as well. Next in our crosshairs is time-bucket aggregations using the SAMPLE BY clause.

If you are interested in checking out how we did this, our code is available open-source [8]. We look forward to receiving your feedback on our work so far. Even better, we would love to hear more ideas to further improve performance. Even after decades in high performance computing, we are still learning something new every day.

[1] https://questdb.io/blog/2020/04/02/using-simd-to-aggregate-b...

[2] https://www.agner.org/optimize/vectorclass.pdf

[3] https://www.youtube.com/watch?v=ncHmEUmJZf4

[4] https://github.com/abseil/abseil-cpp

[5] https://github.com/questdb/questdb/blob/master/core/src/main...

[6] https://github.com/questdb/questdb/blob/master/core/src/main...

[7] https://questdb.io/blog/2020/03/15/interthread

[8] https://github.com/questdb/questdb






Cool project! You posted a comparison to Clickhouse on the front page. Do you have any insights on how QuestDB compares performance-wise to PrestoDB?

Also, PrestoDB is known to be manageable at scale, for example as part AWS Athena. What are you thoughts on building out an Athena-like service using QuestDB?


We will definitely make QuestDB distributed in the near future. Today, our focus is to extract the maximum performance out of a single node. If we start scaling out too early, it may become harder to improve performance after the fact. So all improvements we get today are foundational: they will have multiplicative effects once you start to scale out.

Athena has been benchmaked here with similar dataset (1.1B vs our 1.6Bn):

https://tech.marksblogg.com/benchmarks.html


Hi Vlad,

what BI tools currently can connect to QuestDB for ad-hoc reporting (without need to write SQL queries by hands)? I guess popular solutions like Tableau/PowerBI doesn't support QuestDB.

Is there a demand for BI tools that can use QuestDB as a data source? I'm asking about that because we can add a connector for QuestDB to our BI tool (SeekTable) with a little effort, but this has sense only if someone really will use it.


I cannot talk for the whole industry, but I know for sure, that I (working in BigData / OnlineMarketing) was looking for exactly this recently: A BI tool that is agile and a database that is fast. I ended up with Tableau Server and ClickHouse (which is impressivly fast). Problem was: Tableau does not really support CH and Tableau is not that agile. So to answer your question: Yes. If QuestDB is that fast, there is a demand.

JFYI SeekTable already has a connector for ClickHouse :) it works over native binary protocol (with connection-level compression). Since you have control over SQL query and SQL expressions used for dimensions/measures ClickHouse SQL-dialect is not a problem.

That's a very cool demo, good job! I have a few questions if you don't mind.

- Could you share the specs of the hardware powering the demo?

- For IoT-type usage, one common access pattern is "get the readings of sensor X from device Y in the given time frame". But unfortunately, the data set in the demo doesn't include something like "cab_id" column that would allow me to simulate such a query. Is there a demo that has such a column?

- Does QuestDB require rows to be inserted strictly in order, or is there some leeway?

- Are you using any sort of compression? Timeseries columns are extremely compressible, and a good compression algorithm can greatly reduce IO wait times. We're using the Postgres-based TimescaleDb at work, and its compression capabilities have been a game changer [0].

- Do you plan on supporting GROUP BY clauses?

[0]: https://blog.timescale.com/blog/building-columnar-compressio...


Thanks for your feedback, I am pleased to know that you like it!

- This is a c5.metal instance (AWS) with 48 physical cores and 192 GB of memory. We use only 24 of those 48 cores.

- If I understand your example well, the query "select * from trips where trip_type = 'Dispatch' and pickup_datetime = '2018-03' and vendor_id = 'VTS';" should be similar. There is no limit in how many filters you can specify in the where clause, it will also have a very minor (if any) impact on the performance.

- Currently yes but this is something we are working on. Out of order insertions is one of our priority and you can expect this to be implemented in a very near future.

- Currently no. However, the way data is stored means that columns are extremely compressible as you mentioned. Our internal API (how we store, access/retrieve columns) would make it easy for us to implement compression, we will do it if this is something important for our users. In the meantime, a filesystem like BTRFS or ZFS would do a nice job (if disk usage is your concern).

- GROUP BY is automatically inferred, users don't need to specify it explicitly, you can find more details about this here: https://questdb.io/docs/sqlExtensions#important-differences-...

EDIT: fix typos


Great demo!

I'm curios why the following query doesn't finish in tens of seconds:

    select sum(passenger_count) from trips where passenger_count=1
The same query is executed in a hundred of milliseconds on ClickHouse running on the same hardware (~20 vCPUs).

the use of where clause right now turns off vectorization completely is falls back on simple single threaded execution. This is the reason why the query is slow right now. We are going to vectorize 'where' really soon.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: