Hacker Newsnew | past | comments | ask | show | jobs | submit | youngbum's commentslogin

Big fan of Cloudflare Tunnel here, too.

We use our Windows workstations as WSL SSH tunnels, protected with email verification (only for our domain), and it’s been working perfectly.

I’m curious, though, about how we can expose Docker services. It would be fantastic to have a remote build server set up with Cloudflare Tunnel.


pi-thon


Yesterday, I migrated my web worker codes from Comlink to CapnWeb. I had extensive experience with Cloudflare Worker bindings, and as mentioned in the original post, they were quite similar.

Everything appears to be functioning smoothly, but I do miss the ‘transfer’ feature in Comlink. Although it wasn’t a critical feature, it was a nice one.

The best aspect of CapnWeb is that we can reuse most of the code related to clients, servers, and web workers (including Cloudflare Workers).


Duckdb is an excellent choice for this task, and it’s incredibly fast!

We’ve also added vector search to our product, which is really useful.

OpenAI’s official examples of embedding search use cosine similarity. But here’s the cool part: since OpenAI embeddings are unit vectors, you can just run the dot product instead!

DuckDB has a super fast dot product function that you can use with SQL.

In our product, we use duckdb-wasm to do vector searches on the client side.


>> In our product, we use duckdb-wasm to do vector searches on the client side.

Curious, what is your product?

Edit: Nevermind, your recent post explained it quite well: https://news.ycombinator.com/item?id=40520073


What library do you use to compute embedding right now? I'm wondering if it's possible to do both embedding and vector search client-side for a local-first app.


I think Transformers.js can do that client-side, e.g. see the documentation for calculating embeddings with Sentence Transformers models: https://huggingface.co/docs/transformers.js/api/pipelines#mo...


This is the exact reason we applied duckdb and duckdb-wasm into our service.

Our team is currently building a form builder SaaS. Most forms have responses under 1,000, but some of them would have more than 50,000 responses.

So, when user tries to explore through all responses in our “response sheet” feature, usually they could be loaded via infinite scrolling (load as they scroll).

This uses up to 100MB of network in total if they had to get object arrays of 50,000 rows of data with 50 columns.

That was where duckdb kicked in : just store the responses into S3 as parquet file(in our case Cloudflare R2).

Then, load the whole file into duckdb-wasm into client. So when you scroll through sheet, instead of getting rows from server, you query rows from local db.

This made our sheet feature very efficient and consistent in terms of their speed and memory usage.

If network speed and memory is your bottle neck when loading “medium” data into your client, you definitely should give it a try.

PS. If you have any questions, feel free to ask!

PS. Our service is called Walla, check it out at https://home.walla.my/en


I'm currently rewriting https://github.com/ownstats/ownstats to this model, with a slight difference that I stream Arrow data from a AWS Lambda Function URL into DuckDB WASM in the frontend... Works great.

An improvement could be having pre-calculated DuckDB database files that are directly attached from the DuckDB WASM frontend, see https://duckdb.org/docs/guides/network_cloud_storage/duckdb_...


> you query rows from local db.

But the data is still remote (in object storage) right? If I understand correctly, this works then the first solution because parquet is a much more efficient format to query?


Long story short, you could either 1) query specific columns using s3-parquet-duckdb stack 2) load parquet file through network, and put it inside local duckdb-wasm instance so that you can do queries from client side


My comment was a bit ambiguous. So, for sheets where we have to load all data, we would load all columns at once as a parquet file. (I will leave comment for the advantage of this approach in the next comment)

On the other hand, let’s say we have to draw a chart from a column. The type chart could be changed by user - they could be Pie charts, means, time series chart, median, table or even dot products. To achieve this goal, we would bring just a column from s3 using duckdb, and apply sql queries from client side, rendering adequate ui.


It's probably part of it, but also overhead from small requests and latency from round trips.


Great point.

The advantages of loading “parquet” in “client side” are that 1) you only have to load data once from server and 2) the parquet files are surprisingly well zipped.

1) If you load once from server, no more small network requests while you are scrolling a table. Moreover, you could use the same duckdb table to visualize data or show raw data.

2) Sending whole data as a parquet file is faster through network than receiving data as json in response.


I wonder how much of the benefit is from just not using json vs reducing round trips. I guess if you had a large table you could stream (smaller than normal) row groups from parquet? Not sure how much work that would be though.

I'm not sure what the optimal response size is for an http response, but probably there are diminishing efficiency returns above more than a MB or two, and more of a latency hit for reading the whole file. So if you used row groups of a couple of MB and then streamed them in you'd probably get the best of both worlds.


You wouldn't need DuckDB for this, you can simply store the parquet file in S3 and read them using a parquet NPM package.


Exactly.

We have also tried arrow js or parquet wasm, and they were much lighter than duckdb wasm worker.

DuckDb however was useful in our case, considering our nature as form builder service, we had to provide features for statistics. It was cool to have OLAPS inside a webworker that could handle (as far as we checked) more than 100,000 rows at ease.


I'm still unconvinced.

A regular JavaScript array can also handle 100k object rows very fast.


this is legit awesome!


So you have duckdb running on the server (e.g. node.js) and duckdb-wasm running on the client? Or are you hitting S3 directly with duckdb-wasm?


Wow this is indeed cool. Figma has been tightening up our usage. What other options or alternatives would there be?


Great work. What are the best options for data table with large amount of data? Sometimes storing a mass data into a state could be inefficient. So our team is using duckdb-wasm and virtuoso to render current rows. I tried Perspective tho.


Cool


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: