More

aSanchezStern · 2024-09-12T21:48:19.000000Z

Uhh, endless recursion doesn't cause your typechecker to run indefinitely; all recursion is sort of "endless" from a type perspective, since the recursion only hits a base case based on values. The problem with non-well-founded recursion like `main = main` is that it prevents you from soundly using types as propositions, since you can trivially inhabit any type.

remexre · 2024-09-13T02:05:42.000000Z

The infinite loop case is:

    loopType : Int -> Type
    loopType x = loopType x

    foo : List (loopType 3) -> Int
    foo _ = 42

    bar : List (loopType 4)
    bar = []

    baz : Int
    baz = foo bar

Determining if baz type-checks requires evaluating loopType 3 and loopType 4 to determine if they're equal.

HelloNurse · 2024-09-13T12:03:39.000000Z

Given line "loopType : Int -> Type", how can line "loopType x = loopType x" mean anything useful? It should be rejected and ignored as a tautology, leaving loopType undefined or defined by default as a distinct unique value for each int.

remexre · 2024-09-13T18:13:24.000000Z

What makes it ill-defined is that it computes infinitely -- that's why you need a totality checker (or a total language).

aSanchezStern · 2024-05-21T02:41:04.000000Z

Err, only if humans have to be turing oracles to have that intuition, right? Plus, the halting problem only says that you can't have a procedure which will be 100% right on 100% of programs about whether or not they halt. Relax either of those 100%s, and it becomes possible again, just like how human intuition can be right about things 90% of the time, even if those things can't possibly be predicted 100% of the time.

zer00eyz · 2024-05-21T03:18:31.000000Z

Are people Turing machines is a question...

> Plus, the halting problem only says that you can't have a procedure which will be 100% right on 100% of programs about whether or not they halt.

Your missing it still...

"Given a computer program and an input, will the program terminate or will it run forever?"

We both know that the collatz conjecture likely wont ever halt. We understand that. WE are reasonable oracles. Now write me an oracle that can detect this. Dont feed a list of problems to avoid, write a program that detects these types problems and skips them... Hint, you likley cant. You could simulate one, but that would be cheating...

A real AI needs this, or needs to be able to do it, or it risks working on the collatz conjecture till the heat death of the universe, or till it consumes the universe in a paperclip style problem...

The implication is that real ai might not come out of a Turing machine ever...

roywiggins · 2024-05-21T05:05:49.000000Z

The implication of the article isn't that you can't produce a Turing oracle, it's that it's mathematically impossible to ever replace programmers with computers at all.

The latter claim doesn't follow: programmers aren't Turing oracles either. A programming machine that can emulate whatever humans are doing when they program, would be able to program too.

af3d · 2024-05-21T06:08:30.000000Z

FWIW the Collatz conjecture actually asserts the exact opposite, that the sequence always terminates.

oasisaimlessly · 2024-05-21T05:10:25.000000Z

> We both know that the collatz conjecture likely wont ever halt.

Speak for yourself. I know that the Collatz Conjecture is an unsolved problem and so wouldn't make any strong statements about it.

aSanchezStern · 2024-05-21T02:37:39.000000Z

The claim that "there can be no explainability of such models" is also completely unsupported. We know that some simple networks can be explained, and explanation is a human notion, not a formal one, so we can't know how much explainability is possible. There's active research in explainability of neural networks and progress is being made. I don't know how far it'll go, but it hasn't hit any sort of theoretical boundary yet. I think the author is making a similar sort of mistake that they do in invoking the halting problem here, confusing a "not forall" for a "forall not". There are certainly neural network models that can't be explained ("not every model can be explained" is true), but that doesn't mean that there aren't ones that can ("every model can not be explained" is not true).

As part of the argument on explainability, the author says "we have known for some time these models cannot represent or model symbolic structures". There might be some particular weird way of defining things where this is true, but it's certainly not true in the most basic reading. In fact, we have very strong theoretical results that a sufficiently wide neural network can compute any function, including those that are "symbolic". Whether they can be trained using gradient optimization to compute every function is an open problem, but the idea that there are functions they can't represent is provably false.

The third point, that LLM's aren't as step towards AGI. They again make the claim that there are functions a DNN can't compute (or approximate arbitrarily well) (see https://en.wikipedia.org/wiki/Universal_approximation_theore... for the theorem that this isn't true).

The rest of this point, and the fourth point, are basically just about how current LLM's are actually pretty dumb in a lot of situations. This is the only argument that's actually compelling to me; there's a lot of hype around LLMs and their abilities, but a lot of that might have to do with our brains being happy to paper over flaws to anthropomorphize things. And the fact that we haven't yet learned what to look for in artificial output, like we spent decades doing for other machine outputs before LLMs. Recall that when dumb pattern matching conversation bots were invented in the 60s (https://en.wikipedia.org/wiki/ELIZA), people thought they couldn't possibly be artificial and must really be human, even though they are obviously artificial by todays standards.

So, I agree that we don't know if LLM's are the first step towards AGI, and they probably aren't in a sense more than the fact that inventions tend to build on each other. But we don't have enough information to say definitively that they aren't that first step.

aSanchezStern · 2024-05-21T02:09:37.000000Z

This is a particular philosophical conjecture, not a proven scientific fact. We don't understand enough about the human brain to prove whether it is fundamentally different from a very complex computer.

roywiggins · 2024-05-21T05:14:29.000000Z

My intuition is that we probably should assume they're different until proven otherwise, but also that brains are probably not Turing oracles either.

aSanchezStern · 2024-03-25T19:07:48.000000Z

Uhhhh, I'm pretty sure Lean already allows for formal verification of code. In the meta theory that tools like Lean and Coq operate, proofs and programs are very intertwined; it's not really possible to build a proving system in this metatheory without also allowing for program verification. Maybe you're talking about gaining support for formal verification of code not written in Lean? Like, being able to verify C? In that case, it's actually just a library issue. Nothing needs to be changed about the core Lean language, someone just needs to formalize the semantics of the target language semantics in Lean, and then add a verified compiler and verified extraction if you want to make the guarantees extend directly to the compiled assembly. Fuzzers have actually combined well with theorem proving in these kinds of tools in the past; Coq, the proof assistant that heavily inspired Lean and has been in constant use since the mid 80's, has a library QuickChick (based on QuickCheck from the Haskell world) which takes an arbitrary property and generates test inputs to test it, and has very mature support for extending it with new datatypes and invariants.

ykonstant · 2024-03-25T21:40:43.000000Z

Lean allows formal verification, but it does not make it easy. As staunton remarked, the devs are focusing on classical mathematics, and the ergonomics for verifying code are currently horrible. I do suspect this will change with time, though.

aSanchezStern · 2024-02-27T16:21:10.000000Z

Wow, didn't expect this tool to be on Hacker News six years after publication! Herbgrind author here, ask me any questions you like! Also, if you're interested in this stuff, the Herbie project (which I also worked on) for numerical program synthesis is also really helpful for writing numerical code, and has had a lot of development over the years.

zokier · 2024-02-27T21:27:33.000000Z

What do you think programming languages/libraries/tools could do to make floats less scary for the common people? I feel there is lot of superstition around floats, some of it more well deserved than others, but it leads people to approach fp with great distrust.

Of course herbie is already good step here, but it is still somewhat niche.

a1369209993 · 2024-02-28T02:32:34.000000Z

> to make floats less scary for the common people?

The obvious thing would be making sure that it's easy to get a unambiguous, natural representation of what a float value actually is. (Similar to what C printf %a specifiers produce, rather than something like Ryu float-to-decimal, which is useful for what a float value means for human-readable purposes.)

   0X1.999999999999AP-4
  +0X1.999999999999AP-3
  =0X1.3333333333334P-2
  ≠0X1.3333333333333P-2

is ugly (and could be improved[0]), but far more understandable (and less superstition-inspiring) than:

  0.1 + 0.2 = 0.30000000000000004 ≠ 0.3

0: At the very least, please make it easy to get "0xA.BCDEFp+1", with capital "ABCDEF" and lowercase "xp", so the digits look like digits and the punctuation looks like punctuation.

zokier · 2024-02-28T10:05:24.000000Z

I agree, although strictly speaking I think conventional decimal representation should be unambiguous, i.e. there is always one float that is closest to the decimal value

a1369209993 · 2024-02-28T21:29:28.000000Z

Strictly speaking, unambiguous decimal representation - such as Ryu and its less-efficient predecessors - is unconventional, but I agree that it's correct and should be the default. (That's the "0.1 + 0.2 = 0.30000000000000004 ≠ 0.3" in my comment.) My point is that it should also be easy to get a natural representation that corresponds closely to the actual bits of the float, in rouchly the same way that 0xAAAB corresponds closely to the actual bits of (int16_t)-21845, and so gives you a chance to notice that something weird is going on when, for example, you encounter a multiplication by -21845. (Something like hex(*(uint64_t*)&x) can work, but makes printf %A look beautifully readable by comparison. Also most languages - especially interpreted ones - manage to be inferior to C in this respect.)

aSanchezStern · 2024-02-28T17:51:06.000000Z

Huh, good question. I think having a herbie-style error visualization graph built into an IDE would go a long way. In practice I think dynamic sampling is pretty good at capturing error behavior, and immediate visual feedback on the floating point code you're writing would make it a lot easier to trust the working code you have, and understand the problems with the error-prone computations. Something like https://github.com/herbie-fp/odyssey, but analyzing the fp expressions in the code files you're editing.

phkahler · 2024-02-27T20:00:04.000000Z

Is it feasible to run this on something large like Solvespace[1] (CAD) which is ~5MB executable? Or would we just get an insanely long list of issues?

[1] https://solvespace.com/index.pl

There are hundreds of numerical algorithms in there, and we have some bugs that might be related to this kind of implementation error.

aSanchezStern · 2024-02-27T20:54:06.000000Z

It should be possible. Herbgrind is designed under the philosophy that it's not bad numerics themselves that are significant, it's how they affect the outputs of your program. So the reports are organized around program outputs (either through print calls, or you can annotate values you care about with macros), and errors are reported in terms of how they are affecting your outputs. That should allow you to skip any numerical errors that are insignificant in the final output, and prioritize the ones that are affecting the outputs you care about most greatly.

But Herbgrind is a heavyweight dynamic binary analysis using the Valgrind framework, so the slowdown it imposes on the runtime might be prohibitive for debugging some programs which take a while to compute their results.

99112000 · 2024-02-27T19:14:14.000000Z

I've tried out HerbGrind but it often gets stuck and reports an error E_STONED.

aSanchezStern · 2024-02-27T20:47:02.000000Z

I believe that's a PEBCAK error

aSanchezStern · 2023-10-27T18:52:58.000000Z

Sounds like you've stumbled into the wonderful world of machine-learning guided proof synthesis! While I don't think the full system you're describing has been built yet, many similar systems and pieces have. In terms of the initial phase of supervised learning on existing proofs to prove new ones, there's TacticToe (https://arxiv.org/abs/1804.00596), Tactician (https://arxiv.org/pdf/2008.00120.pdf), CoqGym/ASTactic (https://arxiv.org/abs/1905.09381), Proverbot9001 (https://arxiv.org/abs/1907.07794), and Diva (https://dl.acm.org/doi/10.1145/3510003.3510138#sec-terms), among others. Most of these have some sort of language model within them, but if you're specifically looking for the LLM's that have been big recently, there's GPT-f (https://arxiv.org/abs/2009.03393), Baldur (https://arxiv.org/abs/2303.04910), and COPRA (https://arxiv.org/abs/2310.04353), though currently these models don't seem as effective as the specialized non-LLM tools. In terms of using reinforcement learning to learn beyond human written proofs, there's TacticZero (https://openreview.net/forum?id=edmYVRkYZv), this paper from OpenAI (https://arxiv.org/pdf/2202.01344.pdf), rlCoP (https://arxiv.org/abs/1805.07563), the HOList line of work (https://arxiv.org/pdf/1905.10006.pdf), and HyperTree Proof Search (https://arxiv.org/abs/2205.11491), as well as some in progress work I'm working on with a team at the University of Massachusetts.

cubefox · 2023-10-27T23:12:25.000000Z

Wow, that's a great overview. While most of it goes over my head, it is interesting that quite a lot of research has been done in this direction.

The system I outlined above was inspired by this intriguing paper:

https://arxiv.org/abs/2207.14502

It is superficially about a different topic, namely generating and solving "programming puzzles", basically a type of unit test, that are then verified by the python interpreter. This system seems quite analogous to one where the programming puzzles are replaced by conjectures in a language like Lean, the proposed puzzle solutions are instead proposed proofs, and the python interpreter is replaced by the proof assistant.

I just discovered that there seems to be at least one guy working on applying a similar system to formal mathematics: https://www.cs.cmu.edu/~jlaurent/pdf/reports/thesis-proposal... He actually cites the paper above.

aSanchezStern · 2023-10-27T18:39:13.000000Z

Additionally, formal verification usability is an area of constant research, and the set of software for which it is the best ROI increases over time.

aSanchezStern · 2023-10-17T19:39:24.000000Z

Not directly, but COPRA which they compare to (and show about 3% more theorems proved than) is based on GPT-4 using an agent framework. And in the COPRA paper, they compare to using GPT-4 directly as a one-shot, and find it can only get 10% of theorems in miniF2F, as opposed to 23% with their agent based approach. So if the eval line up correctly, we would see GPT-4 one-shot proving 10.6% of theorems in miniF2F, as opposed to Llemma-7b proving 26.23%, a pretty significant improvement. Still not as good as other specialized tools though, especially when you look at tools in other theorem proving languages (see my other comment for more detail about a cross-language comparison).

aSanchezStern · 2023-10-17T19:00:55.000000Z

Note that this still doesn't seem as good at solving proofs as some of the specialized prover models at formal theorem proving that aren't LLM-based. In particular, they show a 3% increase in proves solved over COPRA on the MiniF2F Lean dataset, but in COPRA's own paper they show that they prove about 18.5% fewer theorems than Proverbot9001 on the CompCert dataset (their pitch is that they are faster, but not better at proving, than other tools). Assuming COPRA isn't just way better at Lean than Coq (there should be some amount of difference, but given that it's the same algorithm run on a uniform interface over Coq and Lean it's unlikely to be large), Llemma should be proving 10-15% fewer proofs than Proverbot9001's algorithm. There's a bit of trickiness to trying to do this kind of cross-benchmark comparison, and COPRA picks a bit of a weird dataset to compare to Proverbot9001 on to save costs, but they also use a slightly older version which has worse results. So it probably still washes out to the best LLM-based provers being about 10-15% worse at proving theorems than the best specialized models.

EDIT: To be clear, that's 10-15% of the total theorems in a test set, not a relative 10-15% improvement. Given that solve rates for tools are in the 10-30% range, that's much more than a relative 10-15% improvement.

Vetch · 2023-10-17T21:19:57.000000Z

I don't think the right way to think about its utility is as a replacement. I see it in terms of a NNUE for Stockfish type augmentation, where a small neural network supercharges search. Small neural network because no LLM, not even GPT4, is good enough that the time lost evaluating with them is gained in disproportionally less search done.

Other uses are: better autocomplete from comments for Coq and Lean VScode envs than generalist tools.

Translation from NL sketch to formal proof. This is different from autocomplete in that it should generate long attempts as an automated spitballer. Leave it running a a long time to see if it finds anything when you're stuck. This works for formal and informal proofs but the latter gets no feedback (this is imagining future tunes able to make better use of interactive prover feedback).

Translating formal proofs to natural language.

Combine it with RAG and have it pull in and summarize references from your personal paper collection and the internet. Particularly useful in place of search when you don't have the vocabulary for a concept. As a basis for non-code based autocomplete of mathematical work.

I see unbounded potential for this combination. And a natural setting where between cost of search and the symbolic prover doing the heavy lifting, it's one of those areas that naturally lends itself to specialist Open over API models.

aSanchezStern · 2023-10-17T21:46:25.000000Z

Oh, I think you might misunderstand what I'm comparing it to. The other tools, like Proverbot9001, are exactly the NNUE scenario you describe, where a small neural network guides a search procedure to find proofs; they are more effective at finding formal proofs than Llemma. For other tasks, like non-formal proof generation, Llemma has novel results as far as I know; it's just in terms of producing formal proofs that it currently seems to lack as compared to the state of the art.

Vetch · 2023-10-18T01:25:05.000000Z

Ah you're right. That makes sense. The autocomplete, informal proofs, translation or autoformalization, reference, search, feedback interactive assistant use-cases do seem promising though.

jychang · 2023-10-18T02:18:15.000000Z

When do you think we’ll have something that can do “verify this proof of the ABC conjecture” and it would check the proof?

schoen · 2023-10-18T05:51:33.000000Z

If the proof is written in a formal language, we have that now! Even several directly competing efforts (for better and worse).

https://www.cs.ru.nl/~freek/100/

Some people greatly hope that fully formal proofs become a routine part of math research and communication in the future.

aSanchezStern · 2023-10-18T17:16:32.000000Z

1979 :) https://en.wikipedia.org/wiki/Logic_for_Computable_Functions