Hacker News new | past | comments | ask | show | jobs | submit | szarnyasg's comments login

Hi, DuckDB devrel here. DuckDB is an analytical SQL database in the form factor of SQLite (i.e., in-process). This quadrant summarizes its space in the landscape:

https://blobs.duckdb.org/slides/goto-amsterdam-2024-duckdb-g...

It works as a replacement / complementary component to dataframe libraries due to it's speed and (vertical) scalability. It's lightweight and dependency-free, so it also works as part of data processing pipelines.


Hello, I'd love to use this but I work with highly confidential data. How can we be sure our data isn't leaking with this new UI? What assurances are there on this, and can you comment on the scope of the MotherDuck server interactions?


I'm a co-author of the blog post. I agree that the wording was confusing – apologies for the confusion. I added a note at the end:

> The repository does not contain the source code for the frontend, which is currently not available as open-source. Releasing it as open-source is under consideration.


Some people work in serious work environments, on heavily regulated data. Thanks for another software landmine !

Make it opt-in, or not installed by default please, it’s so hazardous.


I have observed the Makefile effect many times for LaTeX documents. Most researchers I worked with had a LaTeX file full of macros that they have been carrying from project to project for years. These were often inherited from more senior researchers, and were hammered into heavily-modified forks of article templates used in their field or thesis templates used at their institution.


This is a great example of an instance of this "Makefile effect" with a possible solution: use Markdown and Pandoc where possible. This won't work in every situation, but sometimes one can compose a basic Beamer presentation or LaTeX paper quickly using largely simple TeX and the same Markdown syntax you already know from GitHub and Reddit.


> use Markdown and Pandoc where possible.

That won’t solve any problem that LaTeX macros solve. Boilerplate in LaTeX has 2 purposes.

The first is to factor frequently-used complex notations. To do this in Markdown you’d need to bolt on a macro preprocessor on top of Markdown.

The second one is to fine-tune typography and layout details (tables are a big offender). This is something that simply cannot be done in Markdown. A table is a table and if you don’t like the style (which is most of the time inadequate) then there is no solution.


A much better solution would be to use Typst, but that still might not work in all situations.


I am the author of the original post and I also wrote a followup blog post on it yesterday: https://szarnyasg.org/posts/duckdb-vs-coreutils/

Yes, if you break the file into parts with GNU Parallel, you can easily beat DuckDB as I show in the blog post.

That said, I maintain that it's surprising that DuckDB outperforms wc (and grep) on many common setups, e.g., on a MacBook. This is not something many databases can do, and the ones which can usually don't run on a laptop.


Your follow-up post is helpful and appreciated!

Re the original analysis, my own opinion is that the outcome is only surprising when the critical detail, highlighting how the two are different, is omitted. It seems very unsurprising if it is rephrased to include that detail: "DuckDB, executed multi-threaded + parallelized, is 2.5x faster than wc, single-threaded, even though in doing so, DuckDB used 9.3x more CPU".

In fact, to me, the only thing that seems surprising about that is how poorly DuckDB does compared to WC-- 9x more CPU for only 2.5x more improvement.

But an interesting analysis regardless of the takeaways-- thank you


Hi – DuckDB Labs devrel here. It's great that you find DuckDB useful!

On the setup side, I agree that local (instance-attached) disks should be preferred but does EBS incur an IO fee? It incurs a significant latency for sure but it doesn't have a per-operation pricing:

> I/O is included in the price of the volumes, so you pay only for each GB of storage you provision.

(https://aws.amazon.com/ebs/pricing/)


Can’t remember anymore, but it’s either (a) the gp2 volumes were way too slow for the ops or (b) the IOPs charges made it bad. To be clear I didn’t do it on duckdb but hosted a Postgres. I moved to light sail instead and was happy with it (you don’t get attached SSD in ec2 until you go to instances that are super large).


DuckDB supports partial reading of Parquet files (also via HTTPS and S3) [1], so it can limit the scans to the required columns in the Parquet file. It can also perform filter pushdown, so querying data in S3 can be quite efficient.

Disclaimer: I work at DuckDB Labs.

[1] https://duckdb.org/docs/data/parquet/overview#partial-readin...


Does DuckDB supports partial reading of .duckdb files hosted externally?


oooh, cool


I used Rete in my PhD work, where I designed incremental view maintenance techniques for the openCypher graph query language (https://szarnyasg.github.io/phd/szarnyasg-phd-dissertation.p...).

Rete is an elegant algorithm that is also quite flexible: one can make non-distributive aggregations, outer joins and recursive traversal work. However, it is quite memory-hungry due to the large state of certain nodes (e.g., join operators). I tried to work around this by using distributed execution, but that made the system very inefficient and complex to operate. I also looked into the TREAT algorithm but found that it is more limited in its functionality than Rete.

For incremental view maintenance, the state of the art has now moved on to techniques such as differential dataflow or DBSP (https://www.vldb.org/pvldb/vol16/p1601-budiu.pdf).


My research went in similar directions although I was looking at RETE for rule evaluation as part of a general purpose workflow engine with rules as just one part.

Incremental view maintenance is different enough from rule composition and evaluation that the model diverges to be more optimal. Collections of tuples and instead of DAGs lots of cyclical loops to continue computation of the diffs.

There are deep and intrinsic space or time trade offs so many of the modern approaches moved toward natural dataflow concurrency, and streaming semantics where space or time trade offs can be chosen at runtime through batching and data context opposed to early RETE variations which were very OOP and eagerly evaluated instead of lazy (all in memory in the same place instantiated and mutated).

It'll be interesting to see where these differential dataflow approaches go as we head into local-first constraints where central authority on data isn't possible and long divergence of synchronization occurs. Lots of CRDTs in this future for sure. E.g. https://github.com/RhizomeDB/rs-rhizome / https://fission.codes/blog/fission-reactor-dialog-first-look...


I wrote a page about this: https://duckdb.org/docs/guides/performance/file_formats

TL;DR: if your workload is aggregation-heavy, Parquet is pretty good. If you have many joins, the lack of fine-grained statistics will likely result in bad join plans.

(Disclaimer: I work at DuckDB Labs)


DuckDB also does not (yet) support adding foreign keys to a table, see: https://duckdb.org/docs/sql/statements/alter_table#add--drop...

One typical way to work around this limitation is to create a new table, copy the data there, and drop the old table. Copying from table to table is in general very fast.

(Disclaimer: I work at DuckDB Labs)


That's weird. I thought you would have to have FKs to be a RDBMS, otherwise there isn't much R-MS in it.


Primary keys and foreign keys are rarely used in data science workloads. In our performance guide, we recommend users to avoid using primary keys unless they are absolutely necessary [1]. In workloads running in infrastructure components, keys and constraints are more important, and they are supported by DuckDB (but currently some of them can only be added when the table is created).

[1] https://duckdb.org/docs/guides/performance/schema#microbench...


DuckDB runs fine on laptops, including x86 and ARM architectures.

It runs on macOS, Windows and Linux.

It's not yet available as a conda package but can be installed from conda-forge (or from PyPI via pip). Additionally, it also works in R (via CRAN) and has bindings to many other languages.

(Disclaimer: I work at DuckDB Labs)


Honestly, duckdb looks fine to me (I haven't played with it yet, though articles like this would put me off of it, and make me question how long its going to be around), it's the the tone of the article, which sets up what appears to me to be a strawman, and then plows on.

Also, my naive and quick look at your docs gives me the impression this isn't actually a storage engine (which didn't come across in the article), but instead an in-memory query engine? The FAQ doesn't really clarify this (e.g. is there an actual storage format, how does this compare to using one of the more traditional dbs, how stable it is (c.f. sqlite being recommended by the library of congress), does this assume all the data can be loaded into memory, or can I stream the data in).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: