Gonna throw this out there: main thing I'm looking for from an embedded DB is better on disk compression; I've been toying with RocksDB, but it's hard to tune optimally & it's really too low level for my needs.
 > ipython
import numpy as np
duckdb = [0.855, 0.179, 0.151, 0.197, 0.086, 0.319, 0.236, 0.351, 0.276, 0.194, 0.086, 0.137, 0.377]
sqlite = [8.441, 1.758, 0.384, 1.965, 1.294, 2.677, 4.126, 1.238, 1.080, 5.887, 1.194, 0.453, 1.272]
print((np.quantile(sqlite, q=[0.1, 0.9]) / np.quantile(duckdb, q=[0.1, 0.9])).round())
DuckDB is a column based db, so you are going to see a throughput increase for queries that only use a handful of columns.
I'm manually doing zlib compression for large text columns when there's an obvious opportunity, basically DIY toast . Doing that allowed one SQLite DB to go from about 205 GiB to 35 GiB. And I haven't really felt any performance impact when working with the data; but definitely feel the coding overhead. And there's still so much missed opportunity for compression.
Largest RocksDB is +1 GiB/day (poorly tuned with zlib compression). I just couldn't use SQLite for that one, lots of small rows, but they compress extremely well. I never wrapped up the compression experiments on that, but look at some rough notes snappy was 430G, and lz4 level 6 was 86G. Unfortunately using RocksDB has made coding more difficult.
I think one day I'm just going to snap and build a ZipFS-like extension , until them I'm just trying to keep an eye out, and putting out this call for help. :3
As the implementation is block-based it is also faster than the naive approach of just zipping your data files.
That said, for a specific case like this, I think this could be really compelling, esp going from compressed source material, to a postgres instance running on loopback block device with btrfs on top, could be really amazing for text analysis like you said. You could actually see performance improvements due to this. 4GB/s for PCIe 4, then due to massive core counts and large L3 caches, you could probably boost that to 20GB/s after decompression.
If you can use compression algorithms that allow for regex search in the compression domain, it could be an effective search speed of 10s to hundreds of GB/s.
I am going to have to play with btrfs again.
The blog post contains a few examples how this can be used, for example, to partially query Parquet files over the network.
E.g. just visit shell.duckdb.org and enter:
select * from 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parqu...' limit 10;
If so, I have another cool community to follow!
Like, my workflow is either I query an exiting remote corporate DB and do my initial data munging there, or get givne a data dump that I either work on directly in Pandas, or add to a local DB and do a little more cleaning there. Not at all clear how Duck DB would hel
Our tagline is “SQLite for analytics”, as DuckDB is an in-process database system similar to SQLite that is geared towards these types of workloads.
DuckDB has a flexible query engine, and also has support for directly running SQL queries (in parallel!) on top of Pandas  and Parquet  without requiring the data to be imported into the system.
What's stopping the other leading brands from implementing columnar storage, queries, and such with a COLUMN MAJOR table attribute?
And we may not want to upload the data into postgres or another database. We can just work with parquet files and run in-process queries.
DuckDB is often faster than Pandas, and it can handle larger than memory data. Plus, if you already know SQL, you don't have to become a Pandas expert to be productive in Python data munging. Pandas is still good, but now you can mix and match with SQL!
DuckDB is to Clickhouse, TimescaleDB, Redshift, etc as SQLite is to Postgres, MySQL, SQL Server.
Columnar technology in SQL Server only became usable in SQL 2016 (it existed in 2012 but was too restrictive — I know because I tried to use it).
In 2016 and above, you can either create a columnar index (non clustered column store) or convert an entire table into a columnar table (clustered column store). The technology is actually pretty impressive and I’ve used it in production where I have row stores coexisting with column stores within the same database.
duckdb will be very useful here, instead of having to use Redshift Spectrum or whatever.
Now I've reduced it to a previously answered question: what's the use case for SQLite? ;)
It's targeting use cases where you want to push analytical computation away from servers into the client (browser).
Lets me sketch 2 examples:
A) You have your data sitting in S3 and the user-specific data is in a browser-manageable area.
(E.g. this paper from Tableau research actually states dataset sizes that should fall into that category: https://research.tableau.com/sites/default/files/get_real.pd...)
In that scenario, you could eliminate your central server if your clients are smart enough.
B) You are talking about larger dataset sizes (GB) and want to explore them ad-hoc in your browser.
Uploading them is unrealistic and installing additional software is no longer ad-hoc enough.
Parquet files are intended to store data from a single table and they are intended to be written-once, where you write the file and then never change it again. If you want to change anything in a Parquet file you re-write the file.
DuckDB's storage format is intended to store an entire database (multiple tables, views, sequences, etc), and is intended to support ACID operations on those structures, such as insertions, updates, deletes, and even altering tables in the form of adding/removing columns or altering types of columns without rewriting the entire table or the entire file.
Tables are partitioned into row groups much like Parquet, but unlike Parquet the individual columns of those row groups are divided into fixed-size blocks so that individual columns can be fetched from disk. The fixed-size blocks ensure that the file will not suffer from fragmentation as the database is modified.
The storage is still a work in progress, and we are currently actively working on adding more support for compression and other goodies, as well as stabilizing the storage format so that we can maintain backwards compatibility between versions.
I’ve been thinking of using absurd-sql for it since I saw https://news.ycombinator.com/item?id=28156831 last week
Here's 1,000,000 rows example I just threw together for you
and a CSV example, you try yours right now
Querying CSV files is particularly painful over the network since we still have to read everything for a full scan.
With Parquet, you would at least only have to read the columns of group by keys and aggregate arguments.
Try it out and share your experiences with us!
Otherwise, it's great, but obviously still a wip.
For those wondering, I have a helper function for soql queries for salesforce that follows the structure object.field
Referring to a tablealias.[column.name] or quotes instead of brackets was a no go.
I got a data sets probably not suitable for loading into a memory table (close to 1000M rows CSV).
I did split it into 20M rows chunks, read one by one into a DuckDB temporary table and exported as parquet.
SELECT using glob prefix.*.parquet where mycolumb=foobar does work but can be a bit faster. Apart from sorting the input to parquet CSVs, what can he done?
The CSV chunks were already sorted.
Regarding multi-tab usage: Not today. The available filesystem apis make it difficult to implement this right now. We're looking into ways to make DuckDB-Wasm persistent but we can only read in this release.
The size probably makes it prohibitive to use DuckDB when your dataset is small and download size matters but we hope that future improvements in WebAssembly can get the size down.
We faced many of them ourselves but paused any further work on an IndexedDB-backend due to the lack of synchronous IndexedDB apis (e.g. check the warning here https://developer.mozilla.org/en-US/docs/Web/API/IDBDatabase...).
He bypasses this issue using SharedArrayBuffers which would lock DuckDB-Wasm to cross-origin-isolated sites. (See the "Multithreading" section in our blog post)
We might be able to lift this limitation in the future but this has some far-reaching implications affecting the query execution of DuckDB itself.
To the best of my knowledge, there's just no way to do synchronous persistency efficiently right now that wont lock you to a browser or cross-origin-isolation. But this will be part of our ongoing research.