Yesterday, I migrated my web worker codes from Comlink to CapnWeb. I had extensive experience with Cloudflare Worker bindings, and as mentioned in the original post, they were quite similar.
Everything appears to be functioning smoothly, but I do miss the ‘transfer’ feature in Comlink. Although it wasn’t a critical feature, it was a nice one.
The best aspect of CapnWeb is that we can reuse most of the code related to clients, servers, and web workers (including Cloudflare Workers).
Duckdb is an excellent choice for this task, and it’s incredibly fast!
We’ve also added vector search to our product, which is really useful.
OpenAI’s official examples of embedding search use cosine similarity. But here’s the cool part: since OpenAI embeddings are unit vectors, you can just run the dot product instead!
DuckDB has a super fast dot product function that you can use with SQL.
In our product, we use duckdb-wasm to do vector searches on the client side.
What library do you use to compute embedding right now? I'm wondering if it's possible to do both embedding and vector search client-side for a local-first app.
This is the exact reason we applied duckdb and duckdb-wasm into our service.
Our team is currently building a form builder SaaS. Most forms have responses under 1,000, but some of them would have more than 50,000 responses.
So, when user tries to explore through all responses in our “response sheet” feature, usually they could be loaded via infinite scrolling (load as they scroll).
This uses up to 100MB of network in total if they had to get object arrays of 50,000 rows of data with 50 columns.
That was where duckdb kicked in : just store the responses into S3 as parquet file(in our case Cloudflare R2).
Then, load the whole file into duckdb-wasm into client. So when you scroll through sheet, instead of getting rows from server, you query rows from local db.
This made our sheet feature very efficient and consistent in terms of their speed and memory usage.
If network speed and memory is your bottle neck when loading “medium” data into your client, you definitely should give it a try.
I'm currently rewriting https://github.com/ownstats/ownstats to this model, with a slight difference that I stream Arrow data from a AWS Lambda Function URL into DuckDB WASM in the frontend... Works great.
But the data is still remote (in object storage) right? If I understand correctly, this works then the first solution because parquet is a much more efficient format to query?
Long story short, you could either 1) query specific columns using s3-parquet-duckdb stack 2) load parquet file through network, and put it inside local duckdb-wasm instance so that you can do queries from client side
My comment was a bit ambiguous. So, for sheets where we have to load all data, we would load all columns at once as a parquet file. (I will leave comment for the advantage of this approach in the next comment)
On the other hand, let’s say we have to draw a chart from a column. The type chart could be changed by user - they could be Pie charts, means, time series chart, median, table or even dot products. To achieve this goal, we would bring just a column from s3 using duckdb, and apply sql queries from client side, rendering adequate ui.
The advantages of loading “parquet” in “client side” are that 1) you only have to load data once from server and 2) the parquet files are surprisingly well zipped.
1) If you load once from server, no more small network requests while you are scrolling a table. Moreover, you could use the same duckdb table to visualize data or show raw data.
2) Sending whole data as a parquet file is faster through network than receiving data as json in response.
I wonder how much of the benefit is from just not using json vs reducing round trips. I guess if you had a large table you could stream (smaller than normal) row groups from parquet? Not sure how much work that would be though.
I'm not sure what the optimal response size is for an http response, but probably there are diminishing efficiency returns above more than a MB or two, and more of a latency hit for reading the whole file. So if you used row groups of a couple of MB and then streamed them in you'd probably get the best of both worlds.
We have also tried arrow js or parquet wasm, and they were much lighter than duckdb wasm worker.
DuckDb however was useful in our case, considering our nature as form builder service, we had to provide features for statistics. It was cool to have OLAPS inside a webworker that could handle (as far as we checked) more than 100,000 rows at ease.
Great work. What are the best options for data table with large amount of data? Sometimes storing a mass data into a state could be inefficient. So our team is using duckdb-wasm and virtuoso to render current rows.
I tried Perspective tho.
We use our Windows workstations as WSL SSH tunnels, protected with email verification (only for our domain), and it’s been working perfectly.
I’m curious, though, about how we can expose Docker services. It would be fantastic to have a remote build server set up with Cloudflare Tunnel.