More

vsroy · 2024-10-08T14:43:05 1728398585

Is the thing that's going on here that softmax can't push a value to 0, but by subtracting 2 softmax maps we can output 0s?

vsroy · 2024-10-08T20:17:06 1728418626

Follow-up question is: Isn't it extremely unlikely to output 0?

pkoird · 2024-10-08T15:05:50 1728399950

Or negatives

vsroy · on Sept 24, 2023

Thanks for the help. After upgrading my instance, all the query times dropped dramatically. The thing is, I did not even upgrade the instance that much (micro -> medium).

I'm guessing some important bottleneck was being hit, but I have no idea what it was (maybe 1GB RAM was causing spillover to disk in a query or something? like that?).

It seems important to understand what bottleneck I hit -- but frankly, I have no idea.

vsroy · on Sept 22, 2023

It seems like the modern solution is to use something like temporal.io -- (related to windmill). Alas, surely people have been solving this problem for ages now without temporal.io

vsroy · on May 5, 2023

This has a context window of 65K for the storywriter version.

vsroy · on May 5, 2023

Server actions is amazing. I finally will never need to write a single POST/GET request, it's the perfect RPC.

vsroy · on Feb 16, 2023

Thanks for the response.

One thing to clarify: It's a 300K x 300K similarity matrix, which means I have 300K embeddings. Each embedding itself only has dimension 512.

In other words, the similarity matrix is the similarity between each embedding & every other embedding in the 300K set of embeddings.

Regardless, I think Dask will be useful here.

vsroy · on Feb 9, 2023

I imagine with these types of things the vast majority of the work is writing integrations. Could you explain how this makes writing integrations easier?

obi1kenobi · on Feb 9, 2023

Right now, to be perfectly honest, it doesn't yet. This is by design: the purely integrations-oriented side is a crowded space, and I'm not ready to jump in there yet. Stay tuned :)

The focus instead has been in a place where it's easier to stand out: deeper query ability over the data providers than any competing solution, with stronger optimization guarantees. The goal is that any query optimization you could implement by hand, you should be able to implement within Trustfall -- while having the benefit of purely declarative type-checked queries, with integrated docs and autocomplete. If a query is too slow, you don't have to rewrite it -- you can just use Trustfall's APIs to tweak how it executes under the hood and speed it up by using caching, batching, prefetch, or an index.

For a real-world demo of that, I wrote a blog post about speeding up cargo-semver-checks (a Rust linter for semantic versioning) by 2272x without changing any query, just by making some indexes available to the execution of the existing queries. This is awesome because it empowers anyone to write queries, knowing that either their query will either already be fast or it can be made fast in the future in a way that is entirely invisible to the query itself. Link: https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...

cormacrelf · on Feb 9, 2023

I read that yesterday and felt a bit tricked. You said there was a one line diff, which to me suggested you had made a query optimiser and added that to trustfall so that consumer applications could transparently benefit without writing any new code. But really, as it turned out, you just added APIs for for indexing a table and using those indexes, and then used those APIs to do select manual optimisations in a “fast path” in the trustfall-rustdoc-adapter crates.

What does it matter that some crates are called trustfall adapters and some are not? You still had to optimise the execution of the query manually. I can see how it’s cool you didn’t need to change the text of the query, but people like SQL because the execution engines are smart enough to optimise for you. They will build a little hash table index on the fly if it saves a lot of full table scanning. The expectations re smart optimisation in the market you’re competing in are very high. If you say it was a one line upgrade to the trustfall executor then people will believe you.

The net result is better than what most GraphQL infrastructure gives you. GraphQL doesn’t give you any tools to avoid full table scans, it just tells you “here’s the query someone wants, I will do the recursive queries with as many useless round trips as possible unless you implement all these things from scratch”. At least your API has the concept of an index now. But I think you’re trying to sell it as being as optimisable as SQL while trying to avoid telling users the bad news that it’s going to be them who has to write the optimiser, not you.

obi1kenobi · on Feb 9, 2023

Trustfall isn't trying to compete with SQL; doing so would be suicidal and pointless. If the dataset being queried already has a statistics-gathering query optimizer, it's just the wrong call to not use that optimizer. If one wrote a Trustfall query that is partially or fully served by data in a SQL database, the correct answer is to use Trustfall to compile that piece of the query to SQL and run it as SQL (perhaps by adding some auto-pagination/parallelization before compiling to SQL, but that's beside the point).

Most uses of data don't have anything like SQL / any kind of a query language, let alone an optimizer. No tool I know of other than Trustfall can let one have optimization levers (automatic or human-in-the-loop) where one can optimize access to a REST API, a local JSON file, or a database -- all separately from how you write the query.

With Trustfall, I'm not promising "magical system that will optimize your queries for you without you having to lift a finger" -- at least not for a good long while. But I can promise, and deliver, "you can write queries over any combo of data sources, and if need be optimize them without rewriting them from scratch." This means that you can have product-facing engineers write queries, and infra-facing engineers optimize them, with both sides isolated from the other: product doesn't care if there's a cache or an index or a bulk API endpoint vs item-at-a-time endpoint, and infra has strong guarantees on execution performance and optimizability so they aren't that worried about a product query getting out of hand and wrecking the system. Trustfall buys operational freedom and leverage across your entire data footprint.

You can see this effect in play in cargo-semver-checks. We use lint-writing as an onboarding tool, because anyone can write a query, and we know we can optimize them later if need be. Both Trustfall and the adapters will get better over time, so queries get faster "naturally". We get efficient execution over many different rustdoc JSON formats simultaneously, without version-specific query logic. And while the hashtable indexing optimizations required some manual work that I didn't time exactly, it was limited to ~1-2h tops and made all queries in the repo faster automatically with no query changes. Rolling out the optimization would be operationally very simple: it's trivial to test, and thanks to the Trustfall engine, I wouldn't have to test it with every combination of filters and edge operations -- if the edge fetch logic is correct, the engine guarantees the rest. Put simply, nobody else needed to know that I made the optimization -- the only observable impact to any other dev on the project is that queries run faster now.

cormacrelf · on Feb 9, 2023

I know all that. I just thought you might like to do another pass editing your piece. It is your marketing material at this stage. It would be nice if it gave people a clearer impression of what the capabilities are and where trustfall is positioned in relation to SQL, GraphQL, and other stuff. I came away a bit suspicious of your claims because I didn’t understand them when I first read it.

My only question about the actual code is whether you can write these indices to do hash lookups across data sources. Can I avoid table scans when joining two data sets from different adapters?

obi1kenobi · on Feb 9, 2023

I appreciate it! Writing is hard (especially not in my native language) and I'm always looking to improve, so feedback like this is valuable.

To be honest, that blog post was targeted at cargo-semver-checks users and r/rust readers, to give them a sense of how cargo-semver-checks is designed and why, with a motivating example of speeding up queries while supporting multiple rustdoc versions. It wasn't really meant to be "Trustfall's entrance on the world stage" even though it kind of ended up being that...

I plan to write more blog posts (and code!) about Trustfall's specific capabilities (and things it can't/shouldn't do) in the future, so hopefully those will come up first in people's searches and give folks the right impression.

Re: multiple adapters, yes, that's the plan. I have some prototype code for turning multiple adapters into a single adapter over the union of the datasets + any cross-dataset edges, and it supports the new optimizations API so the same kind of trick should work in the same way. In general, Trustfall is designed to be highly composable like this: you should be able to put Trustfall over any data source, including another instance of Trustfall, and have things keep working reasonably throughout.

srcreigh · on Feb 9, 2023

that’s cool, does this API support index nested loop and hash joins? Or is it just for filtering?

obi1kenobi · on Feb 9, 2023

Trustfall's Adapter API that data providers implement leaves the joining to the adapter implementation. It provides an iterator of rows and an edge (essentially a join predicate), and asks for an iterator of (row, iterator of rows to which it is joined) back.

The adapter is free to implement whatever join algorithm makes sense, together with any predicate pushdown, caching, batching, prefetching, or any other kind of optimization.

At a high level, Trustfall's job is to (1) perform any optimizations it can do by itself, so you don't have to do them by hand, and (2) empower the adapter implementation to add any other optimizations you wish to have, without getting in the way.

vsroy · on Feb 3, 2023

I'm hosting an open-source S3 (SeaweedFS) on hetzner, and I'm doing a direct download from hetzner => the other cloud provider.

I'm not aware of speed test sites that allow you to select the location (i.e doing a speed test from cloud provider A to Hetzner).

I'll look into what's going on by running some CLI speed tests on hetzner.

sargstuff · on Feb 4, 2023

note: some providers require the following 'tests/checks' be handled by the provider support center, and NOT by the customer/customer account.

iftop provides traceroute / MTR information. which may show where the 'slow down' is (may not be because of Hetzner or the receiving cloud provider).

wget -O /dev/null 100mbsized.test.file

iperf & iperf3 allows for active measurements of the maxpossibile bandwidth with support for tuning various parameters related to timing, buffers and protocols. this would allow for changing/tuning sysctl.conf internet settings as appropriate. iperf/iperf 3 server on herzer and client on other cloud provider.

vsroy · on Oct 18, 2022

Since no one seems to be commenting; this is very cool. Can you explain what you mean by TS-to-bytecode? Presumably; you're not interpreting JS. So, what are you interpreting; the types themselves? I guess this actually makes a lot of sense, since Typescript's type system is its own programming language.

vsroy · on July 4, 2022

This is super nice! I tried to build something similar (PDF reader x Stack Overflow; as you read you can see highlights that link to Stack Overflow style posts): https://chimu.sh (site is broken right now), but it didn't work out.

The thesis was to make annotated textbooks, so if 100 people read a book then the 101th person could benefit from the knowledge of the previous 100 people.

I hope you guys consider the multiplayer aspect of note-taking!

pj747 · on July 4, 2022

Oh one hundred percent! That’s definitely part of our vision. Couldn’t have put it across better. So we already have a small pilot of that running with any documents you get from the Desklamp store - these have the multiplayer aspect in the form of public notes. The idea is that if you make useful notebooks like chapter summaries or some insight on a book, you can publish those for all readers of the book to see. Didn’t promote that here because we don’t have licensed content yet but you can check it out on any of the consulting books.