Hi, DuckDB devrel here. DuckDB is an analytical SQL database in the form factor of SQLite (i.e., in-process). This quadrant summarizes its space in the landscape:
It works as a replacement / complementary component to dataframe libraries due to it's speed and (vertical) scalability. It's lightweight and dependency-free, so it also works as part of data processing pipelines.
Hello, I'd love to use this but I work with highly confidential data. How can we be sure our data isn't leaking with this new UI? What assurances are there on this, and can you comment on the scope of the MotherDuck server interactions?
I'm a co-author of the blog post. I agree that the wording was confusing – apologies for the confusion. I added a note at the end:
> The repository does not contain the source code for the frontend, which is currently not available as open-source. Releasing it as open-source is under consideration.
I have observed the Makefile effect many times for LaTeX documents. Most researchers I worked with had a LaTeX file full of macros that they have been carrying from project to project for years. These were often inherited from more senior researchers, and were hammered into heavily-modified forks of article templates used in their field or thesis templates used at their institution.
This is a great example of an instance of this "Makefile effect" with a possible solution: use Markdown and Pandoc where possible. This won't work in every situation, but sometimes one can compose a basic Beamer presentation or LaTeX paper quickly using largely simple TeX and the same Markdown syntax you already know from GitHub and Reddit.
That won’t solve any problem that LaTeX macros solve. Boilerplate in LaTeX has 2 purposes.
The first is to factor frequently-used complex notations. To do this in Markdown you’d need to bolt on a macro preprocessor on top of Markdown.
The second one is to fine-tune typography and layout details (tables are a big offender). This is something that simply cannot be done in Markdown. A table is a table and if you don’t like the style (which is most of the time inadequate) then there is no solution.
Yes, if you break the file into parts with GNU Parallel, you can easily beat DuckDB as I show in the blog post.
That said, I maintain that it's surprising that DuckDB outperforms wc (and grep) on many common setups, e.g., on a MacBook. This is not something many databases can do, and the ones which can usually don't run on a laptop.
Re the original analysis, my own opinion is that the outcome is only surprising when the critical detail, highlighting how the two are different, is omitted. It seems very unsurprising if it is rephrased to include that detail: "DuckDB, executed multi-threaded + parallelized, is 2.5x faster than wc, single-threaded, even though in doing so, DuckDB used 9.3x more CPU".
In fact, to me, the only thing that seems surprising about that is how poorly DuckDB does compared to WC-- 9x more CPU for only 2.5x more improvement.
But an interesting analysis regardless of the takeaways-- thank you
Hi – DuckDB Labs devrel here. It's great that you find DuckDB useful!
On the setup side, I agree that local (instance-attached) disks should be preferred but does EBS incur an IO fee? It incurs a significant latency for sure but it doesn't have a per-operation pricing:
> I/O is included in the price of the volumes, so you pay only for each GB of storage you provision.
Can’t remember anymore, but it’s either (a) the gp2 volumes were way too slow for the ops or (b) the IOPs charges made it bad. To be clear I didn’t do it on duckdb but hosted a Postgres. I moved to light sail instead and was happy with it (you don’t get attached SSD in ec2 until you go to instances that are super large).
DuckDB supports partial reading of Parquet files (also via HTTPS and S3) [1], so it can limit the scans to the required columns in the Parquet file. It can also perform filter pushdown, so querying data in S3 can be quite efficient.
Rete is an elegant algorithm that is also quite flexible: one can make non-distributive aggregations, outer joins and recursive traversal work. However, it is quite memory-hungry due to the large state of certain nodes (e.g., join operators). I tried to work around this by using distributed execution, but that made the system very inefficient and complex to operate. I also looked into the TREAT algorithm but found that it is more limited in its functionality than Rete.
My research went in similar directions although I was looking at RETE for rule evaluation as part of a general purpose workflow engine with rules as just one part.
Incremental view maintenance is different enough from rule composition and evaluation that the model diverges to be more optimal. Collections of tuples and instead of DAGs lots of cyclical loops to continue computation of the diffs.
There are deep and intrinsic space or time trade offs so many of the modern approaches moved toward natural dataflow concurrency, and streaming semantics where space or time trade offs can be chosen at runtime through batching and data context opposed to early RETE variations which were very OOP and eagerly evaluated instead of lazy (all in memory in the same place instantiated and mutated).
TL;DR: if your workload is aggregation-heavy, Parquet is pretty good. If you have many joins, the lack of fine-grained statistics will likely result in bad join plans.
One typical way to work around this limitation is to create a new table, copy the data there, and drop the old table. Copying from table to table is in general very fast.
Primary keys and foreign keys are rarely used in data science workloads. In our performance guide, we recommend users to avoid using primary keys unless they are absolutely necessary [1]. In workloads running in infrastructure components, keys and constraints are more important, and they are supported by DuckDB (but currently some of them can only be added when the table is created).
DuckDB runs fine on laptops, including x86 and ARM architectures.
It runs on macOS, Windows and Linux.
It's not yet available as a conda package but can be installed from conda-forge (or from PyPI via pip). Additionally, it also works in R (via CRAN) and has bindings to many other languages.
Honestly, duckdb looks fine to me (I haven't played with it yet, though articles like this would put me off of it, and make me question how long its going to be around), it's the the tone of the article, which sets up what appears to me to be a strawman, and then plows on.
Also, my naive and quick look at your docs gives me the impression this isn't actually a storage engine (which didn't come across in the article), but instead an in-memory query engine? The FAQ doesn't really clarify this (e.g. is there an actual storage format, how does this compare to using one of the more traditional dbs, how stable it is (c.f. sqlite being recommended by the library of congress), does this assume all the data can be loaded into memory, or can I stream the data in).
https://blobs.duckdb.org/slides/goto-amsterdam-2024-duckdb-g...
It works as a replacement / complementary component to dataframe libraries due to it's speed and (vertical) scalability. It's lightweight and dependency-free, so it also works as part of data processing pipelines.