The math is quite difficult to do right, and there's a billion corner cases to m...

auggierose · on Oct 20, 2023

I've written some custom code for computational geometry (like computing the offset for a geometric object using Apple Metal). It was a lot of fun, but also quite hard, and a lot of edge cases I just didn't deal with because I had a particular use case and speed was paramount.

Maybe the idea of a "kernel" is the problem here. A kernel the size of a browser is not a kernel.

I think what's really needed is a full-blown integration with a theorem proving system (which has an easier to define kernel of its own).

pca006132 · on Oct 20, 2023

I doubt if it is possible to have a practical small kernel. Take mesh processing in https://github.com/elalish/manifold for example, we encountered a lot of problems when trying to deal with inexact floating point arithmetic. Using exact arithmetic can probably result in much simpler code (doesn't make sense in our case because the point of the library is about dealing with inexact arithmetic), but exact means slow. Also, a lot of code is about fast-paths that occurs frequently, data structures and algorithms that cut down complexity but are difficult to implement. Can we remove those fast-paths and complex algorithms? Probably yes, but this will slow things down a lot.

auggierose · on Oct 20, 2023

I completely agree. You want to be able to be fast and correct, where correct also depends on some assumptions you can make depending on what you apply your algorithm to. That's why I think a theorem prover kernel is needed, which connects the various algorithms, and the theory behind them that is needed to prove correctness. Until you get there, you will have various different algorithms that don't really fit together, and where it is not clear how to make them work together and still be confident about the result.

pca006132 · on Oct 20, 2023

While this would be nice, I don't think enough people working on these algorithms are trained to use formal theorem proving tools.

auggierose · on Oct 20, 2023

Yes, you are right. Time for better theorem proving tools! A theorem proving tool shouldn't add to your time spent designing and implementing, it should help you to get your stuff done faster and with (much) higher quality.

namibj · on Oct 21, 2023

I can't wait for Idris2 (or such? It's quite good for programming and having e.g. core data structure implementations verified to uphold their API invariants, without making you prove everything around it.) to get a good fine tuned LLM attached to replace the human in the interactive theorem proving steps/parts.

I'd expect this to be an LLM application that's unusually suited to automatic-feedback reinforcement learning.

Especially because much of the interactive proving task/process is quite straightforward nondeterministic (Turing) machine stuff: You try (with enough randomization/temperature to not get stuck in non-creative stupidity) a prove attempt/step/hint, and get feedback after a moment of calculation from the proof assistant. Then you try further, until eventually hopefully getting "success" as the assistant's feedback.

Once you got it to succeed once, or seeded with originally human-made step sequences to teach some basic sense into the model, you know an upper bound on the _required_ step count and assistant calculation time to prove the theorem at hand. Thus you can let the LLM auto-play/train with a step limit and computation timeout close to the known upper bound, rewarding for expected total runtime/effort to combine "spamming cheap proof tactics and hope something sticks" and "elaborate careful proof process likely to succeed but always expensive/(semi-exhaustive) to go through".

Perhaps even with a GPT-4 like multi-agent LLM to specialize into the various approaches and have some way of rating/predicting each agent's expected efficiency/cost each "chat message":

Turns out, interactive theorem proving is literally a (beyond-)NP heuristics-guiding sampler (traditionally a human with trained gut feeling based problem-solving brainstorming creativity) chatting with a non-creative algorithmic oracle. At the start, if initiated not by the "human", the oracle would info-dump the theorem and appropriate context along a description of what "today's" task is:

This may be: 1) "I expect the theorem to be False because: [reason for what caused the expectation]." 2) "I expect the theorem to be True because: [reason for what caused the expectation]."

3) "I expect a weaker (in it's implications, so less general) form of that theorem to be sufficient in the proof for this situation here. It could make proving them for data types we'd like to use with this implementation/specialization cheaper than the general API's demands, potentially (though not the reason for today's task!) allowing weaker data types to be used here than in the general API. In particular, there (maybe) are cheaper (to implement) yet weaker (in effects) variants of the functions we call on the supplied data type, that still suffice in our algorithm (like weaker demands on a comparison function when only stable sorting is used; dropping distinctiveness demands for either hash function or comparison function if only the other is relied on for Set/Map data structure efficiency). Confidence of it to be True is at X (confidence measure/score the LLM can feel natively) and here's what certain outcomes would be worth (either enumerated weaker forms or a scoring function (in a form the LLM agent comparer can utilize to plan what to aim for, when to pivot, and when to declare defeat))."

4) "I expect a weaker (in it's demands, so more general) form of that theorem to be sufficient in the proof for this situation here. It would allow us to use this implementation/specialization with more data types than the general API promises. Confidence of it to be True is at X (confidence measure/score the LLM can feel natively) and here's what certain outcomes would be worth (either enumerated weaker requirements for the theorem or a scoring function (in a form the LLM agent comparer can utilize to plan what to aim for, when to pivot, and when to declare defeat))."

5) "Find (and codify!) sufficient invariants demanded from (functions on) data types so we can uphold our API's promised invariants. [Optionally, aim for something that matches this natural language description of what (parts/aspects of) the data type's function's purpose[s]/semantics are supposed to be.]"

6) "Prove this is constant-time/constant-memory-access-pattern w.r.t. that part of data (even in ways where the data in question may be in chunks behind some memcpy-reducing indirection), or e.g. that this key here affects nothing persistent other than that ciphertext/plaintext/signature/hash/success-flag."

7) "Prove time/memory complexity of this implementation. Here's what certain bounds are worth: split between various kinda of bounds, e.g. lower bounds, upper bounds, average (w.r.t. that data) complexity, best/worst case (w.r.t. that data), handle those parameters by enumerating over those and proving with their values fixed (because a general equation may be too complicated/weak, or even just too hard to derive)."

8) A classic: "Prove these two implementations produce identical (under that comparison/test-sampling method) results."

9) "Find bounds on when (conditions) and/or how much (some appropriate supplied measure) these two implementations differ."

10) "Find a faster implementation along with proven limits on it's inaccuracy. Combining the two (+) dimensions of candidate quality from the obvious pareto frontier into a single number score is according to this: [formula in useful format for directing search/exploration]."

11) "Cough up an implementation limited to those numerical primitives there, along with proof of it complying with these accuracy requirements, for this implementation that uses (inherently computationally-unsuitable) real numbers. Speed/memory performance importance: [scoring function suitable for directing where to aim, when to stop, and when to give up]."

And afterwards, the "human" would ask/explore the oracle about context, suggest/try proof tactics, and in some cases write/transform code in both LLM-style and by commanding ("textbook"/library/archive, or even freshly written) rewrite rules.

That process could then be trained with reinforcement learning, even if intermediate states have no useful score function defined, as the presence of certain results after certain amounts of expended effort is directly useful as a score for/of the solver/agent itself. The multi-agent suggestions applicability/efficiency predictor/arbiter should be amenable to more normal (stochastic) backpropagation at a completed-chat granularity, as the final efficiency/score will be known, and if it were a perfect predictor, it'd have predicted that exact score for the entire time along the path that was taken. The intermediate predictions on how much effort the chosen agent's suggestion actually took to complete is also easily recorded for training the per-step cost predictions as a more fine-grained aspect of the final-score-when-taking-this-branching-path-now machinery.