A few weeks ago, we wrote about how we implemented SIMD instructions to aggregate a billion rows in milliseconds  thanks in great part to Agner Fog’s VCL library . Although the initial scope was limited to table-wide aggregates into a unique scalar value, this was a first step towards very promising results on more complex aggregations. With the latest release of QuestDB, we are extending this level of performance to key-based aggregations.
To do this, we implemented Google’s fast hash table aka “Swisstable”  which can be found in the Abseil library . In all modesty, we also found room to slightly accelerate it for our use case. Our version of Swisstable is dubbed “rosti”, after the traditional Swiss dish . There were also a number of improvements thanks to techniques suggested by the community such as prefetch (which interestingly turned out to have no effect in the map code itself) . Besides C++, we used our very own queue system written in Java to parallelise the execution .
The results are remarkable: millisecond latency on keyed aggregations that span over billions of rows.
We thought it could be a good occasion to show our progress by making this latest release available to try online with a pre-loaded dataset. It runs on an AWS instance using 23 threads. The data is stored on disk and includes a 1.6billion row NYC taxi dataset, 10 years of weather data with around 30-minute resolution and weekly gas prices over the last decade. The instance is located in London, so folks outside of Europe may experience different network latencies. The server-side time is reported as “Execute”.
We provide sample queries to get started, but you are encouraged to modify them. However, please be aware that not every type of query is fast yet. Some are still running under an old single-threaded model. If you find one of these, you’ll know: it will take minutes instead of milliseconds. But bear with us, this is just a matter of time before we make these instantaneous as well. Next in our crosshairs is time-bucket aggregations using the SAMPLE BY clause.
If you are interested in checking out how we did this, our code is available open-source . We look forward to receiving your feedback on our work so far. Even better, we would love to hear more ideas to further improve performance. Even after decades in high performance computing, we are still learning something new every day.
Also, PrestoDB is known to be manageable at scale, for example as part AWS Athena. What are you thoughts on building out an Athena-like service using QuestDB?
what BI tools currently can connect to QuestDB for ad-hoc reporting (without need to write SQL queries by hands)? I guess popular solutions like Tableau/PowerBI doesn't support QuestDB.
Is there a demand for BI tools that can use QuestDB as a data source? I'm asking about that because we can add a connector for QuestDB to our BI tool (SeekTable) with a little effort, but this has sense only if someone really will use it.
- Could you share the specs of the hardware powering the demo?
- For IoT-type usage, one common access pattern is "get the readings of sensor X from device Y in the given time frame". But unfortunately, the data set in the demo doesn't include something like "cab_id" column that would allow me to simulate such a query. Is there a demo that has such a column?
- Does QuestDB require rows to be inserted strictly in order, or is there some leeway?
- Are you using any sort of compression? Timeseries columns are extremely compressible, and a good compression algorithm can greatly reduce IO wait times. We're using the Postgres-based TimescaleDb at work, and its compression capabilities have been a game changer .
- Do you plan on supporting GROUP BY clauses?
- This is a c5.metal instance (AWS) with 48 physical cores and 192 GB of memory. We use only 24 of those 48 cores.
- If I understand your example well, the query "select * from trips where trip_type = 'Dispatch' and pickup_datetime = '2018-03' and vendor_id = 'VTS';" should be similar. There is no limit in how many filters you can specify in the where clause, it will also have a very minor (if any) impact on the performance.
- Currently yes but this is something we are working on. Out of order insertions is one of our priority and you can expect this to be implemented in a very near future.
- Currently no. However, the way data is stored means that columns are extremely compressible as you mentioned. Our internal API (how we store, access/retrieve columns) would make it easy for us to implement compression, we will do it if this is something important for our users. In the meantime, a filesystem like BTRFS or ZFS would do a nice job (if disk usage is your concern).
- GROUP BY is automatically inferred, users don't need to specify it explicitly, you can find more details about this here: https://questdb.io/docs/sqlExtensions#important-differences-...
EDIT: fix typos
I'm curios why the following query doesn't finish in tens of seconds:
select sum(passenger_count) from trips where passenger_count=1
- single node database, not [yet] distributed
- primary focus is time-series data, specifically in-order time series data (the `designated timestamp` extension)
- physical data layout is an append-only column store
- Implements a small subset of SQL with some affordances for time series (LATEST BY, SAMPLE BY).
- Doesn't support explicit GROUP BY or HAVING clauses.
Instead, questdb implicitly assumes GROUP BY or HAVING based on presence of aggregation functions in the select clause.
- Small standard library of functions: only 4 text functions.
Based on these characteristics it seems the quest db is well positioned against Influx. It's probably faster than Timescale DB but significantly less flexible given that Timescale has all of Postgres behind it. Quest DB might eventually compete with clickhouse but it's long ways out given that it's not distributed and implements a much smaller subset of SQL.
I'd love to get any insight into join performance. Quite a few column stores handle large joins poorly (clickhouse, druid).
Regarding the theme of joins, I would just add that QuestDB also has a concept called SYMBOL which allows to store repetitive strings into a table as an integer in the background, while letting the user manipulate strings. This means that instead of storing something like 'sensor_id' and then looking up the corresponding 'sensor_name' using a join, you can manipulate 'sensor_name' directly and skip using joins.
You can try asof join:
SELECT pickup_datetime, cab_type, trip_type, tempF, skyCover, windSpeed FROM trips ASOF JOIN weather;
This joins 1.6B to 130K roughly.
Generally we provide both row and column based access despite being column store. So anything a traditional database does we can do too.
Does quest DB use the same physical data layout for the row based access? My understanding was providing an entire row of data from a columnar file is expensive because you have to assemble the row from num_columns different slices of the columnar file.
Shame generate_series() isn't a valid function here - joining to an artificial table is a classic way to break the pre-aggregated values, and force a re-calc.
Redshift does something similar IIRC.
In the end, I'm probably going to give up on trying new db ideas unless their pitch absolutely blows me away. I've gotten much better ROI just doubling down on optimizing our code to use these popular offerings (which çontinue to add features every day) to our use cases and just throw a bit more money at it.
For example, we actually query a 115 billion row, 1.5 TB table in snowflake and barring some cluster warmup time, queries that return less than a million rows run in a couple hundred ms max. Until the day we need this to speed up to just a few ms, I'll probably stick here.
Couple comments / questions:
- Correct that there’s no GROUP BY support?
- EXPLAIN or similar would be nice, both to get a peek at how your engine works & to anticipate whether a query is using parallel / SIMD execution and will take milliseconds vs. minutes.
EXPLAIN would be quite boring in our case, we use a myriad of methods to make query fast. We would love to keep up the performance so that no one ever needed to be interested in query plan.
I can't tell from the output whether there is an implicit "group by" going on over the non-aggregated columns, but eg. this query produced some results without any errors:
select windDir, avg(windSpeed) from weather
SELECT * FROM trips
JOIN gas_prices ON gas_prices.timestamp = trips.pickup_datetime;
Something about how timestamp comparisons are handled?
This takes ~86 seconds. Ran it multiple times.
SIMD is one of the ingredients for better query performance, but NOT THE ONLY ONE. See this for more info: https://www.youtube.com/watch?v=xJd8M-fbMI0
select cab_type, payment_type, count() from trips;
The data itself is memory mapped. Columns are kept as primitives so that they take as much memory as their unit size times rows.
select count(*) from trips where fare_amount > 0
Presumably this is Hacker News load?
WHERE queries are slow indeed. This is not because of the load but rather because WHERE clauses are not yet implemented under the fast path. Currently they run under an old model which is single-threaded which is why they are painfully slow.
We will make these as fast as the rest soon.
But some of the more top level queries are quite fast.
select * from trips where pickup_latitude < -74 order by pickup_longitude desc limit 10;
Databases can be daunting but for a variety of reasons. In my experience, it's not so much because of the sheer amount of features and function that need to be implemented. It's because you have to make sure that the every component is fast and reliable. Ultimately, the software can only be as fast as the slowest part of the stack. So you are going to spend a lot of time hunting for the slow part and looking for ways to accelerate it.
More generally, the way I approached this was to break down the project in small pieces, making each as fast as can be. This makes the problem more approachable. Also, you have to know and love how hardware works.
Asked a different way: If I were a lead of IT, director-level executive, CTO, etc. how and why should I begin to evaluate the QuestDB service / solution?
Thank you for sharing, bluestreak and team, very interesting and exciting work!
The main value proposition is raw performance which translates into hardware efficiency and ultimately, costs at scale.
To get a rough idea of how we compare, you could try to running `select cab_type, count() from trips` and `select passenger_count, avg(total_amount) from trips`. They are the equivalent of queries (1) and (2) of this benchmark . In this benchmark, BigQuery took 2 seconds for each query.
Our server runs both in hundreds of milliseconds. And this is actually slow because this is the same server which is currently being hammered by HN users. Also, we are scanning more rows. Our dataset contains 1.6 billion rows while the benchmark attached runs on 1.1 billion. Lastly, we only use one 24-core CPU from one server while the top of this benchmark is clusters or GPUs.
Of course this is only approximation but I hope to be given the chance to feature in this benchmark when we are ready.
1. "Query 1.6B rows in milliseconds, live" is just like "sum 1.6B numbers from memory in ms".
In fact, if not full SQL functionalities supported, a naive SQL query is just some tight loop on top of arrays(as partitions for naive data parallelism) and multi-core processors.
So, this kind is just several-line benchmark(assumed to ignore the data preparing and threading wrapping) to see how much time the sum loop can finish.
In fact again, this is just a naive memory bandwidth bench code.
Let's count: now the 6-channel xeon-sp can provide ~120GB/s bandwidth. Then sum loop with 1.6B 4-byte ints without compression in such processors' memory could be finished about ~1.6*4/120 ~= 50ms.
Then, if you find that you get 200ms in xxx db, you in fact has wasted 75% time(150ms) in other things than your own brew a small c program for such toy analysis.
2. Some readers like to see comparisons to ClickHouse(referred as CH below).
The fact is that, CH is a little slow for such naive cases here(seen at web been pointed by guys).
This is because CH is a real world product. All optimizations here are ten- year research and usage in database industry and all included in CH and much much more.
Can you hold such statement in the title when you enable reading from persistent disk? or when doing a high-cardinality aggregation in the query(image that low-cardinality aggregation is like as a tight loop + hash table in L2)?
1. We do plan to support full SQL functionalities, we have a pretty good subset already . I think that what we do is more than "naive memory bandwidth bench code", however I am happy to listen and when the time is right implement functionalities/features that you think we are missing.
2. "Can you hold such statement in the title when you enable reading from persistent disk?" We already persist to disk, the numbers you see imply reading from disk. We do this by using memory mapped files .
Storage is an AWS EBS volume.
We do not implement 100% of the protocol yet, but the plan it to do so or at least implement all the features that people need.
I'm working on a...spreadsheet thing or something that requires PostgREST and it'd be really nice to simply direct the PSQL_URI configuration underneath the hood to whatever database has my data. Then there's no need for ETLing back and forth.
I was really liking the product up until that point.
As stated some comments below, the code literally says ‘disable back button’
select * from trips; /* completed quickly */
select * from trips where tip_amount = 0.56; /* stalled */