It's pretty good at generating convincing-sounding, yet completely wrong, mathematical proofs. Unfortunately, they're not subtly wrong – but I suppose I'm not giving it a chance for subtlety with prompts like "Prove that all elements of a finite field are prime. (Show your working.)".[0]
> 1. Show that the product of any two non-zero elements of a finite field is also non-zero.
> 2. Let α be an element of a finite field. Prove that α is prime if and only if α-1 is non-zero.
> 3. Show that if α is a prime element of a finite field then α-1 is also prime.
> 4. Deduce that every element of a finite field is prime.
That's more or less what I would expect from a the best language model: things that look very close to real but fail in some way a smart human can tell.
YUou need a "knowledge" model to regurgitate facts and an "inference" model to evaluate probabilities of statements being correct.
Yes - that's the whole point of language models ... to model the language, and not the content.
Similar for image generation - the model addresses what looks acceptable (like other images), not what makes sense. It's amazing that they get such interesting results, but merely shocking that we humans interpolate meaning into the images.
I disagree. If you prompt an image generation model with a prompt like "an astronaut riding a horse," you get a picture of an astronaut riding a horse. If you ask this model for a mathematical proof, it does not give you a mathematical proof.
For "an astronaut riding a horse" the system is filtering/selecting but nowhere does it understand (or claim to understand) horses or astronauts. It's giving you an image that "syntactically" agrees with other images that have been tagged horse/riding/astronaut.
The amazing bit is that we are happy to accept the image. Look closely at such images - they're always "wrong" in subtle but important ways, but we're happy to ignore that when we interpret the image.
I suspect that the issue arises from the difference in specificity about the desired result. When we say "astronaut riding a horse" we may have preconceptions but any astronaut riding any horse will likely be acceptable while asking for a specific proof of a result in mathematics has only a very few and very specific solutions. Effectively it is like the concept in math where the area of even a large number of points is effectively zero while even small polygons or regions is nonzero. Specific things like proofs are point like knowledge while the picture of an astronaut riding a horse is a surface.
The situation you describe is exactly the "Chinese room" argument. I don't want to get too far into the weeds here, but the DALLE / Stable Diffusion models are cool because they do what you ask, even if they do so imperfectly. This model from Facebook cannot accurately answer a single thing I've asked it.
I often hear the claim "AI does not really understand" but when you can ask it to draw an armchair in the shape of an avocado or an astronaut riding a horse on the Moon, and it does it (!!?), it's not like the "Chinese room" had any specific rules on the books on these questions. What more do people want to be convinced?
My art/design students employ AI to generate ideas.
Possibly this project can be used is a similar manner... as a way to start a brainstorming session or suchlike. In my experience of working with research engineers, I see the reluctance to playfully ideate as one of their weakness.
I get why you think that would be vastly more useful it actually is.
Suppose you did the million monkey with typewriter thing and asked not for Shakespeare but any good book. The overwhelming majority of it would be trash but chances are you’re going to find many good books before the exact works of Shakespeare in the order they where written.
Basically the more specific stuff you want the less likely you are to get it. So those million monkeys would type out a decent haiku relatively quickly, any you would get someone’s correctly done tax return more quickly than your tax return.
A math paper that actually solve some problem are much closer to your specific tax return than just anything. Unlike art what’s valid is very constrained so tossing out essentially random stuff and seeing what sticks just isn’t that helpful.
These language models would at least be using random mathematical or scientific ideas so it’s better than the monkey’s but just not by enough to be useful.
> 1. Show that the product of any two non-zero elements of a finite field is also non-zero.
> 2. Let α be an element of a finite field. Prove that α is prime if and only if α-1 is non-zero.
> 3. Show that if α is a prime element of a finite field then α-1 is also prime.
> 4. Deduce that every element of a finite field is prime.
[0]: https://galactica.org/?max_new_tokens=400&prompt=Prove+that+...