Hacker News new | past | comments | ask | show | jobs | submit login
DuckDB-Wasm: Efficient analytical SQL in the browser (duckdb.org)
213 points by ankoh 82 days ago | hide | past | favorite | 58 comments



Cool! This is the first time hearing about DuckDB, exciting as I heavily use SQLite. And these benchmarks are showing it's 6-15 times faster than sql.js (SQLite) [1], along with another small benchmark I found [2]. I usually just slather on indexes in SQLite tho, so indexed queries may not stand up as well; and might not be as fast when it's on storage, as this is comparing in memory performance (I think?), but I'll give it a spin!

Gonna throw this out there: main thing I'm looking for from an embedded DB is better on disk compression; I've been toying with RocksDB, but it's hard to tune optimally & it's really too low level for my needs.

[1] > ipython

    import numpy as np
    duckdb = [0.855, 0.179, 0.151, 0.197, 0.086, 0.319, 0.236, 0.351, 0.276, 0.194, 0.086, 0.137, 0.377]
    sqlite = [8.441, 1.758, 0.384, 1.965, 1.294, 2.677, 4.126, 1.238, 1.080, 5.887, 1.194, 0.453, 1.272]
    print((np.quantile(sqlite, q=[0.1, 0.9]) / np.quantile(duckdb, q=[0.1, 0.9])).round())
[2] https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html


Do you need compression to get more bandwidth or are you trying to save money on storage costs?

DuckDB is a column based db, so you are going to see a throughput increase for queries that only use a handful of columns.


Saving money definitely, but also delaying having to scale the hardware as long as possible. I'm at 3.5 TiB NVMe (raid 1), and it would cost another $52 USD/mo to add an additional raid of 1.5 TiB NVMe @ Hetzner, not cool. Generally I'm seeing around 6:1 compression, so going from 3.5 TiB to 21 TiB is a big deal for me.

I'm manually doing zlib compression for large text columns when there's an obvious opportunity, basically DIY toast [1]. Doing that allowed one SQLite DB to go from about 205 GiB to 35 GiB. And I haven't really felt any performance impact when working with the data; but definitely feel the coding overhead. And there's still so much missed opportunity for compression.

Largest RocksDB is +1 GiB/day (poorly tuned with zlib compression). I just couldn't use SQLite for that one, lots of small rows, but they compress extremely well. I never wrapped up the compression experiments on that, but look at some rough notes snappy was 430G, and lz4 level 6 was 86G. Unfortunately using RocksDB has made coding more difficult.

I think one day I'm just going to snap and build a ZipFS-like extension [2], until them I'm just trying to keep an eye out, and putting out this call for help. :3

[1] https://www.postgresql.org/docs/9.5/storage-toast.html

[2] https://www.sqlite.org/zipvfs/doc/trunk/www/index.wiki


For my bachelor thesis, I used Postgres on a compressed btrfs partition. For my text-heavy dataset, this gave excellent results without compromising on ergonomics.

As the implementation is block-based it is also faster than the naive approach of just zipping your data files.


That is really cool, I tried btrfs a couple times and I had weirdness, sorry for being unspecific, not trying to do a drive by hit job on btrfs, but I moved on.

That said, for a specific case like this, I think this could be really compelling, esp going from compressed source material, to a postgres instance running on loopback block device with btrfs on top, could be really amazing for text analysis like you said. You could actually see performance improvements due to this. 4GB/s for PCIe 4, then due to massive core counts and large L3 caches, you could probably boost that to 20GB/s after decompression.

If you can use compression algorithms that allow for regex search in the compression domain, it could be an effective search speed of 10s to hundreds of GB/s.

I am going to have to play with btrfs again.


I believe compression is still a work in progress in duckdb. My current use is with parquet files which compress well but don't lend themselves to updates like a db.


DuckDB is one of my favorite projects ive stumbled on recently. I've had multiple use cases pop up where i wanted to do some pandas type work, but sqlite was a better fit so its really come in handy for me.


There was neat post https://news.ycombinator.com/item?id=27016630 a while ago about about using sqlite on static pages with large datasets that wouldn't have to be loaded entirely. Does duckdb do something similar with arrow/parquet files or its own format?


Yes we do! DuckDB-Wasm can read files using HTTP range requests very similar to the sql.js-httpvfs from phiresky.

The blog post contains a few examples how this can be used, for example, to partially query Parquet files over the network.

E.g. just visit shell.duckdb.org and enter:

select * from 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parqu...' limit 10;


NIIIICE! Data twitter was pretty excited about that cool SQLite trick - now you can turn it up a notch!


Is data twitter == #datatwitter, like Econ Twitter is #econtwitter?

If so, I have another cool community to follow!


It would be really cool to load duckdb files too. sql.js-httpvfs seems convenient because it works on everything in database so you don't have to create indexes, or setup keys and constraints in the client.


I agree! DuckDB-Wasm can already open DuckDB database files in the browser the very same way.


That's really neat! Can you control the cache too?


DuckDB-Wasm uses a traditional buffer manager and evicts pages using a combination of FIFO + LRU (to distinguish sequential scans from hot pages like the Parquet metadata).


I'm still not sure I "get" the use case for DuckDB. From what I understand, it's like a nifty, in-memory SQL, but why is that better than just running PostGRES or Microsoft SQL server locally, where your data structures and tables and stuff have a lot more permanence?

Like, my workflow is either I query an exiting remote corporate DB and do my initial data munging there, or get givne a data dump that I either work on directly in Pandas, or add to a local DB and do a little more cleaning there. Not at all clear how Duck DB would hel


DuckDB developer here. DuckDB is a regular RDBMS that has persistent ACID storage, but is tuned towards analytical workloads, i.e. read-heavy workloads with aggregates that require full scans of the data. Any data you write to tables is stored persistently on disk, and not all your data needs to fit in memory either.

Our tagline is “SQLite for analytics”, as DuckDB is an in-process database system similar to SQLite that is geared towards these types of workloads.

DuckDB has a flexible query engine, and also has support for directly running SQL queries (in parallel!) on top of Pandas [1] and Parquet [2] without requiring the data to be imported into the system.

[1] https://duckdb.org/2021/05/14/sql-on-pandas.html

[2] https://duckdb.org/2021/06/25/querying-parquet.html


Maybe this is a silly question: Why is the A/B choice between a row-major database and a column-major database, instead of between row-major tables and column-major tables within a flexible database?

What's stopping the other leading brands from implementing columnar storage, queries, and such with a COLUMN MAJOR table attribute?


Some databases do offer both, but it is much more involved than just changing the storage model. The entire query execution model needs to adapt to columnar execution. You can simulate a column store model in a row-store database by splitting a table into a series of single-column tables, but the performance benefits you will capture are much smaller than a system that is designed and optimized for column store execution.


SQL calculations on columnar data are quite different from row-based databases, so its effectively a different database engine. You can take multiple advantages of columnar data store, because it usually employs a form of vocabulary compression. For instance, obtaining distinct values of a field in a columnar DB is much faster because it's typically just the vocabulary of the field, so it doesn't even require a full table scan. Many other columnar computations such as filtering or aggregation can be done on compressed data without decompression.


I work heavily with pandas and dask (when you want to use multiple cores), using parquet files for storage. We see a lot of benefits in selectively bringing in duckdb into the mix. For instance, the joins are extremely slow with both pandas and dask and require a lot of memory. That's a situation where using duckdb reduces the memory needs and speeds things up a lot.

And we may not want to upload the data into postgres or another database. We can just work with parquet files and run in-process queries.


Check out this post for some comparisons with Pandas.

https://duckdb.org/2021/05/14/sql-on-pandas.html

DuckDB is often faster than Pandas, and it can handle larger than memory data. Plus, if you already know SQL, you don't have to become a Pandas expert to be productive in Python data munging. Pandas is still good, but now you can mix and match with SQL!


DuckDB is columnar, so in theory a lot faster than Postgres or SQL server for Analytical workloads.

DuckDB is to Clickhouse, TimescaleDB, Redshift, etc as SQLite is to Postgres, MySQL, SQL Server.


From where do you get that sql server does not support columnar? That is a wrong claim.


I don’t think the OP said that SQL Server doesn’t support columnar, only that by analogy SQL Server is primarily a row store (which for most of its history was true).

Columnar technology in SQL Server only became usable in SQL 2016 (it existed in 2012 but was too restrictive — I know because I tried to use it).

In 2016 and above, you can either create a columnar index (non clustered column store) or convert an entire table into a columnar table (clustered column store). The technology is actually pretty impressive and I’ve used it in production where I have row stores coexisting with column stores within the same database.


Quite interesting read about how timescaledb turns a row store (psql) into a column store.


The simple answer noone else seems to have mentioned: SQLLite has quite a low limit on the number of columns it supports, which is a problem for data analytics which often prefers wide over long.


Not just in-memory. It's pretty convenient if you have a set of Parquet files with common schema. Fairly snappy and doesn't have to fit in memory.


I'm using duckdb for querying parquet files as well. It's an awesome tool, so nice to just "look into" parquet files with SQL.


Many enterprises are coming up with patterns where they replicate the data from the database (say Redshift) into parquet files (data lake?) and directing more traffic including analytical workloads onto the parquet files.

duckdb will be very useful here, instead of having to use Redshift Spectrum or whatever.


To me, the use case is really obvious: when you reached for SQLite but now want something with Moar Powah.

Now I've reduced it to a previously answered question: what's the use case for SQLite? ;) https://www.sqlite.org/whentouse.html

That being said, I don't see the point, and shudder at the idea of a web page's javascript doing anything which needs noticeable amounts of the CPU, but I'm a non-standard user...


I think local CPU is underutilized in these cloudy days. For many SaaS, the size of all your tenant data minus media is quite small. Most cloud apps have terrible experience with large tables, searching, and general perf. We trade this away for collaboration features.


I don't fully get the use case either, but it's in a different category than Postgres or Microsoft SQL because it runs in the browser and can be made part of your web app.


DuckDB-wasm is targeting the browser so it's not directly competing with Pandas (that's the job of native DuckDB).

It's targeting use cases where you want to push analytical computation away from servers into the client (browser).

Lets me sketch 2 examples:

A) You have your data sitting in S3 and the user-specific data is in a browser-manageable area.

(E.g. this paper from Tableau research actually states dataset sizes that should fall into that category: https://research.tableau.com/sites/default/files/get_real.pd...)

In that scenario, you could eliminate your central server if your clients are smart enough.

B) You are talking about larger dataset sizes (GB) and want to explore them ad-hoc in your browser.

Uploading them is unrealistic and installing additional software is no longer ad-hoc enough.


Anyone have a good benchmark comparing DuckDb to Parquet/Avro/ORC etc.? Super curious to see how some of those workflows might compare. Obviously at scale its going to be different, but using a single parquet file/dataset as a db replacement isn't an uncommon thing in DS/ML work.


Why compare DuckDB to Parquet when you can use DuckDB and Parquet [1] :)

[1] https://duckdb.org/2021/06/25/querying-parquet.html


Does DuckDB also use a PAX-like format like Parquet? Without going into code, the best I could find with a little googlefu is the HyPer/Data Blocks paper - is this a relevant read?


DuckDB's storage format has similar advantages as the Parquet storage format (e.g. individual columns can be read, partitions can be skipped, etc) but it is different because DuckDB's format is designed to do more than Parquet files.

Parquet files are intended to store data from a single table and they are intended to be written-once, where you write the file and then never change it again. If you want to change anything in a Parquet file you re-write the file.

DuckDB's storage format is intended to store an entire database (multiple tables, views, sequences, etc), and is intended to support ACID operations on those structures, such as insertions, updates, deletes, and even altering tables in the form of adding/removing columns or altering types of columns without rewriting the entire table or the entire file.

Tables are partitioned into row groups much like Parquet, but unlike Parquet the individual columns of those row groups are divided into fixed-size blocks so that individual columns can be fetched from disk. The fixed-size blocks ensure that the file will not suffer from fragmentation as the database is modified.

The storage is still a work in progress, and we are currently actively working on adding more support for compression and other goodies, as well as stabilizing the storage format so that we can maintain backwards compatibility between versions.


I should have clarified table storage specifically, to gain intuition on what happens when data is copied between DuckDB and Arrow/Parquet. Was much faster to just look at the code bridging the two and back track from there. Thanks!


Interesting.. Would this be effective at loading a remote CSV file with a million rows, then performing basic GROUP BY COUNTs on it so I can render bar charts?

I’ve been thinking of using absurd-sql for it since I saw https://news.ycombinator.com/item?id=28156831 last week


I contribute to https://perspective.finos.org/ , supports all of this and quite a lot more.

Here's 1,000,000 rows example I just threw together for you

https://bl.ocks.org/texodus/3802a8671fa77399c7842fd0deffe925

and a CSV example, you try yours right now

https://bl.ocks.org/texodus/02d8fd10aef21b19d6165cf92e43e668


It depends.

Querying CSV files is particularly painful over the network since we still have to read everything for a full scan.

With Parquet, you would at least only have to read the columns of group by keys and aggregate arguments.

Try it out and share your experiences with us!


I got excited about duckdb recently too. Used it yesterday for a new project at work and immediately ran into a not implemented exception for my (awful) column naming structure and discovered there is no pivot function.

Otherwise, it's great, but obviously still a wip.

For those wondering, I have a helper function for soql queries for salesforce that follows the structure object.field

Referring to a tablealias.[column.name] or quotes instead of brackets was a no go.


Not really DuckDB-Wasm question but DuckDB:

I got a data sets probably not suitable for loading into a memory table (close to 1000M rows CSV). I did split it into 20M rows chunks, read one by one into a DuckDB temporary table and exported as parquet.

SELECT using glob prefix.*.parquet where mycolumb=foobar does work but can be a bit faster. Apart from sorting the input to parquet CSVs, what can he done? The CSV chunks were already sorted.


Similar(?): https://sql.js.org/ (SQLite in wasm)


Yes but DuckDB is optimized for analytics (columnar data and vectorized computation). Take a look at the comparison in https://shell.duckdb.org/versus.


Does DuckDB support multi tab usage? How big is the wasm file that must be loaded?


The WebAssembly module is 1.6 - 1.8 MB brotli-compressed depending on the Wasm feature set. We're currently investigating ways to reduce this to around 1 MB. We further use streaming instantiation which means that the WebAssembly module will be compiled while downloading it. But still, it will hurt a bit more than a 40KB library.

Regarding multi-tab usage: Not today. The available filesystem apis make it difficult to implement this right now. We're looking into ways to make DuckDB-Wasm persistent but we can only read in this release.


On https://shell.duckdb.org/versus, we have a comparison with related libraries. The WASM bundles currently is 1.8 MB but it can be instantiated while it's streaming in.

The size probably makes it prohibitive to use DuckDB when your dataset is small and download size matters but we hope that future improvements in WebAssembly can get the size down.


How does this compare/relate to https://jlongster.com/future-sql-web (if at all)?


The author outlines many problems that you'll run into when implementing a persistent storage backend using the current browser APIs.

We faced many of them ourselves but paused any further work on an IndexedDB-backend due to the lack of synchronous IndexedDB apis (e.g. check the warning here https://developer.mozilla.org/en-US/docs/Web/API/IDBDatabase...). He bypasses this issue using SharedArrayBuffers which would lock DuckDB-Wasm to cross-origin-isolated sites. (See the "Multithreading" section in our blog post)

We might be able to lift this limitation in the future but this has some far-reaching implications affecting the query execution of DuckDB itself.

To the best of my knowledge, there's just no way to do synchronous persistency efficiently right now that wont lock you to a browser or cross-origin-isolation. But this will be part of our ongoing research.


this looks so cool. is there pre-loaded a demo page loaded with tables so people can try out queries right away?


If you head over to the shell demo, you can run a query like the one below!

https://shell.duckdb.org/

select * from 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parqu...' limit 10;


Can I connect to Duckdb with a sql ide? E.g. dbeaver?


Yes! DuckDB (not WASM DuckDB) has a JDBC connector that works with DBeaver.


thank you


Awesome!


I wish it supports ORC.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: