fanyang01's comments

fanyang01 · on Dec 3, 2024

DuckDB is a more convenient and performant tool for analyzing large CSV/JSON files. Shameless plug: if you’re interested in combining DuckDB with Claude using the recently announced Model Context Protocol (MCP), check out the MyDuck Server project (https://github.com/apecloud/myduckserver) that I’m working on. You can follow this guide https://github.com/apecloud/myduckserver/blob/main/docs/tuto... to get started. The guide includes a conversation history showcasing how the free-tier Claude Haiku model successfully analyzed the Our World in Data Energy dataset. MyDuck Server adds a Postgres frontend for DuckDB, allowing seamless interaction with DuckDB via the official Postgres MCP server.

fanyang01 · on Nov 30, 2024

Thanks for your interest! TL;DR: pg_duckdb is an extension that runs on the same server as Postgres, while MyDuck operates as a standalone replica that can run on a separate server.

Details:

pg_duckdb uses DuckDB as an execution engine while keeping the data stored in Postgres in its usual row-oriented format. It identifies queries that can be optimized using DuckDB and executes them there. This is achieved by registering a replacement scan in DuckDB, allowing it to directly scan Postgres data, converting it into an analytics-optimized columnar format on the fly. MyDuck, on the other hand, functions as a standby server replicating changes from a primary Postgres server. The replicated data is stored in DuckDB’s native columnar format. This approach leverages both DuckDB’s optimized columnar storage and its execution engine, generally resulting in faster query performance compared to scanning Postgres data directly. Additionally, pg_duckdb is currently limited to full table scans — it doesn’t yet support skipping unnecessary data even with restrictive WHERE conditions (though there are plans to add index-based scans: https://github.com/duckdb/pg_duckdb/issues/243). In contrast, DuckDB’s native storage, used by MyDuck, includes extensive support for data skipping (a.k.a. filter pushdown).

Hope this clarifies the differences! Feel free to reach out with any more questions.

fanyang01 · on Nov 29, 2024

*Update from the Authors:*

We have successfully tested this project with the official Model Context Protocol (MCP) server for Postgres. For those interested in integrating this project with LLMs, you can find the guide here: https://github.com/apecloud/myduckserver/blob/main/docs/tuto...

We were pleasantly surprised that the integration worked seamlessly on our first attempt. Moreover, we observed some fascinating behavior: DuckDB's clear and suggestive error messages proved incredibly helpful for the model. During our testing, the free-tier Claude Haiku model initially hallucinated table and column names, but it impressively corrected them based on DuckDB's suggestions.

This integration highlights the conveniences offered by this project:

- DuckDB’s native SQL support: We can create a table directly from a CSV URL without manually determining the schema.

- Standard Postgres protocol support: The official Postgres MCP server works out-of-the-box.

We’d love to hear your thoughts and questions about LLM + MyDuck integration!

fanyang01 · on Nov 29, 2024

Great question! Improving the speed of writing updates to DuckDB has been a significant focus for us. Early in the project, we identified that DuckDB is quite slow for single-row writes, as discussed in this issue: https://github.com/apecloud/myduckserver/issues/55

To address this, we implemented an Arrow-based columnar buffer that accumulates updates from the primary server. This buffer is flushed into DuckDB at fixed intervals (currently every 200ms) or when it exceeds a certain size threshold. This approach significantly reduces DuckDB's write overhead.

Additionally, we developed dedicated replication message parsers that write directly to the Arrow buffer, minimizing allocations.

A community contributor has validated & enhanced the effectiveness of our approach: https://github.com/apecloud/myduckserver/pull/207

We plan to publish benchmarks on replication latency and query performance in the coming weeks. Stay tuned!