How much of this is just "AI is bad at everything", but in the math case, it's easier for the lay person to tell.
It's all just passable garbled nonesense that the reader (goes to lengths) to interept based on their prior knowledge, which is not expressed in the syntax of what these systems output.
In the case of mathematics, we're far less willing to "BS away" the interpretive failures. But if we were equally demanding, likewise, all prose generated by these systems isnt AI "getting" anything either.
Pass a film reel thru' a shredder and an art student would still call it a film. Pass math thru' and a mathematician wont. This says more about our ability and inclination to make sense out of nonesense when in apparent communicative situations (since, when speaking to a person, this actually improves our mutual understanding).
So, how much of AI is just hacking people's cognitive failures: (1) people's willingness to attribute intention; (2) people's willingness to impart sense "at all costs" to apparent communication; and (3) "hopeium".
Have you ever used Github CoPilot? It does a lot of useful work, automating away rote typing in programming. Have you tried Dall-E or Stable Diffusion? They make good looking images. This comment seems completely unmoored from where the state of the art is right now.
Math follows a completely different approach with respect to how machine-learning AIs do their thing.
Reason derives its strength in having a few primitives and creating new assertions through the transformation of symbols by following precise rules (which is how algorithms work).
In ML-based AIs, everything is imprecise and probabilistic, and this kind of generation gets its strength from building recognizable from utterly imprecise inputs and training - quite the opposite of how logic and reason evolve. Now, "classic" AI was a powerful way to derive new knowledge, and automatic theorem proving is a strong discipline; but the recent breakthroughs in AI are not directly applicable to classic techniques.
Do you know what machine-learning AIs could be good for? Generating "insight" in problem solvers for guiding the theorem demonstrations through the proof search space, trying to find the best sub-spaces to explore. If there's a way to create human-like general AI, it will likely combine both kinds of generation - the rational methods of symbolic logic and the "irrational" statistical methods of ML.
For a type theory we might want to reason about, there’s a diagram (in category theory) which represents the same semantic content. These diagrams turn out to have recurring and common structures.
You can represent those diagrams as adjacency matrices, where those structures have a particular “shape” in the entries. Which if you squint hard looks like an image completion problem, ie, finding missing part of the matrix which represents a proof.
Who's using that topos theory, and is it well known in automated theorem proving?
I hadn't heard about it before, although I have some notions of both category theory and theorem proving.
And is your description of “complete and label the diagram” using image generation a thing that actually exists or something that has potential to be created? That could be a breakthrough in applying formal methods to real-world problems.
Topos theory is used by people researching foundations, eg Michael Shulman.
> And is your description of “complete and label the diagram” using image generation a thing that actually exists or something that has potential to be created?
Somewhere between — it’s a topic being researched, but results are very early (basically, just shapes and groups).
7 kittens, none well defined. Because the loss function it is optimising taps out once it has got "close" and close for a multiple object prompt is _a lot_ further away than for a single subject prompt (here is the 1 kitten version https://labs.openai.com/s/1aCOUxNT19kbMZZtEBG7CoFY - this is basically witchcraft it is so good, the group shot is a joke).
I suppose you could massively reduce the loss amount you are willing to accept but that doesn't guarantee dalle with optimise the correct part of the pictures - maybe I'd have just ended up with really, really good floors.
The other thing Dalle is bad at is backgrounds, and once again this is due to "optimising an error score". https://labs.openai.com/s/U1Vo2fxThuXmQZzIwLQ4g9Ai nothing about this is right. At a superficial glance it looks like the view over a city but it's a random splatching of building cutouts and when you look at the detail of the builds they are a blur of pixels that kind of approximate doors and windows but are nothing of the sort. They are super fuzzy and dream like. Because it's trying to generate an image that looks like a cityscape from its memory of Glasgow cityscapes. THere's no coherance because it's trying to covert random pixels into a cityscape not for buildings out of components that humans know go to makeup buildings.
I'm not an expert on AI, but your complaints sound like minor versions of the major problems that these image generation AI's had a couple of years ago. It used to be that they could only create a mishmash of textures reminiscent of the subject and style, and struggled creating distinct objects at all.
Now, your examples simply show some slight artifacts and lack of details on specific things. You're presenting remaining shortcomings on these metrics as "fundamental to how it works not a flaw that can be iterated away", when in fact they have mostly been iterated away over the past few years.
I haven't used copilot because I'm not sure I'm allowed, but I'll try it on a personal project eventually.
I'm hoping it's not as bad as Dall-E and Stable Diffusion - I've tried to use those to generate some generic product looking stock photos for a demo and it's spectacularly bad. The only context I see it get praised is fantasy style art - and that is visually appealing nonsense by definition.
If the code generated by copilot has the same "looks convincing but is fundamentally flawed" quality then it sounds like an insidious bug generator.
sure, but co-pilot is mostly just copying code (see, for example, the issue with it producing quake source code).
If you think of AI as a dial from sample(data) to mean(data), then as the dial is turned towards the mean() you get more "generic" results, but also more garbled ones.
Copilot is more like a search engine, having turned the dial more towards sample().
The real invention of the NN is simply to provide that dial in a trainable way.
The only change to the "state of the art" is the size of the weights, and how long they take to train. This "advancement" is no more impressive than google indexing more webpages.
There has been no step-change advancement in AI in, perhaps, 50 years. All we see today is a product of hardware, in GPU/CPUs able to compress TBs of data into c. 300GB of weights. And likewise, the internet to provide it and SSDs to hold it.
The "magic" of AI is no more the magic of wikipida, here: copilot is good only because million+ programmers made github good.
> It's all just passable garbled nonesense that the reader (goes to lengths) to interept based on their prior knowledge, which is not expressed in the syntax of what these systems output.
> It's still little more than a fancy search.
I feel like the goalposts have been moved between your two comments. CoPilot is obviously not producing garbled nonsense, and it's also not just printing the top result from StackOverflow. It is producing code that references my variables, does the right thing 50% of the time, and usually compiles.
One of the nice little things is error messages- when I type `if (!foo) { throw ... ` CoPilot is able to complete a nicely formatted and descriptive error message from its understanding of my code. It's not garbled nonsense, and it's not just a search engine.
Does AI deserve the hype it sometimes gets? Not yet. But I think you're going to have to start digging a little deeper for your commentary.
Even if AI got to the point of perfectly passing every expert-level Turing test your degree of rigor as to what "thinking" is would never truly permit any belief of AI having struck the golden nugget of intelligence.
Imagine if we were all self-replicating computers, and certain members of this silicon race began experimenting with making creatures with carbon macro-molecules to create organic intelligence, you could make the same claim in the other direction:
"There has been no step-change advancement in Organic Intelligence in, perhaps, 50 years. All we see today is a product of cell count, in neurotransmitter chemistry able to compress TBs of experiences into c. 300B neurons."
I think you are missing the conditional, contextual nature of language models. They mix things in coherent ways, they adapt to the request. Google doesn't create new things when they don't exist, and the pre-written code examples on the internet will never adapt to your needs.
But I agree with you that everything they do seems intelligent because 'intelligence' was in the training data. Not much different from us, if you raise a human removed from society (take his intelligent training data away) he will not accomplish almost anything on his own.
I agree. It's possible to point out the clear limitations of current AI without being oblivious to the huge, indisputable advances that have occurred.
People thought it might take centuries for a computer to defeat a top human in Go. Then deep learning showed up and a few years later it's the opposite.
A lot of the things deep learning methods are doing now are things no one had any idea how long research would take to achieve, or if they were even possible.
Personally, I think we are currently hitting some walls that might take a while to climb before we get to AGI, but I am very impressed at the recent progress.
> How much of this is just "AI is bad at everything"
"AI Language Models" are not touted as some general AI that is smart at everything, like a clever person with multiple intellectual skills integrated into one.
AI language models are for modeling language, not for math problem solving, or anything else. People good at language aren't always good at math.
DeepL produces very good, correct translations for "Alice has five more balls than Bob, who has two balls after he gives four to Charlie. How many balls does Alice have?" into numerous languages, even though it doesn't offer a solution.
I have little doubt that an AI system could be trained to translate word problems like this into systems of equations, which could be dumped into a some decades-old CAS to obtain a solution, which the AI could map back into the verbal domain through the identities between the math variables like x and Alice's apples.
"Hey look, that human who is supposedly good at math can't produce a painting of the Grand Canyon in the style of Monet, even if given eight months to do it, and is easily defeated in chess."
> How much of this is just "AI is bad at everything", but in the math case, it's easier for the lay person to tell
Honestly, even as someone generally pretty dismissive of the AI hype, I'm not sure you can go that far. The whole reason we have specific mathematical notation is that human languages often are not super great at dealing with it, and English in particular is pretty abysmal for being both unambiguous and precise (and I'd be surprised if language models didn't end up suffering from biases analogous to how many image recognition AI models have been found to not deal well with a diverse set of human appearances). We don't teach math the same way we teach English, and we certainly don't expect people to be experts at teaching both, so why would we expect an AI model designed for language to be able to do math?
Because there is an algorithm for it. Convert the strings into floating point numbers, add them, convert them back to strings. It’s a leetcode medium question. It should be learnable.
The article talks about abstract math questions, but even arithmetic is hard for language models.
Language models aren't built for math. Their improvement/training cycles aren't sensitive to the exactness and rule-based nature of mathematical language, plus there are probably a lot of bad/misleading examples of math in the source data.
You'd have to be unrealistically pessimistic to call what GPT-3 and other huge language models produce "nonsense".
It's not that they were not built for math, but more like verification is hard. But it's hard for humans as well. A large generative model + a fast verifier could do wonders.
AlphaGo was built on that - the model can propose moves, but you can verify who won in the end. There are some code generation models that write their own tests as well, or use externally provided tests to verify their solutions. The DeepMind matrix multiplication algorithm was also "learning from verification" of generated solutions, because it's trivial to do that. In general verification remains an open problem.
I disagree. It is that they were not built for math. While brain analogies are shittier than most people assume, this is like trying to do math in your head without being allowed to think through calculations.
Brains weren't built for math either, just for surviving. And the "trying to do math in your head" is true if you use naive question answering, but if you ask "step by step" or "chain of thought", or "supporting questions", any of them will allow for flexible time steps. There are some solutions called "Language Model Cascades" that compose language models calls to simulate arbitrary complex reasoning chains including recursion. There is no reason to think language models are unfit for math, they are fit for generating possible solutions that need to be verified somehow.
> Brains weren't built for math either, just for surviving.
Brains were, however, built for language processing, in addition to many other tasks.
> There is no reason to think language models are unfit for math, they are fit for generating possible solutions that need to be verified somehow.
This is just a dumb idea though. Guess and check based on semantically well positioned answers in the ambiguity that is the embedding space until you find something that's not wrong is not the same thing as defining an algorithm and then executing it, which is how people do math.
Sure, you could probably get it to a pretty good working state, but it seems pretty dumb to me.
> There are some solutions called "Language Model Cascades" that compose language models calls to simulate arbitrary complex reasoning chains including recursion.
If you're creating the reasoning chains yourself, you're arguably doing the hard part for the model and giving credit to the language part. If you're able to do get the model to define the chains, then you've already solved the hard part of the problem and could likely use something very different from language models altogether to greater effect.
True, but that's how "inspiration" works in humans as well: generate stupid ideas until you stumble upon a great one.
That's why I said we only need verification. It's the artist-critic model, we got the artist we need the critic. Sometimes it's easy (in games, code, math) and other times we don't have a good way to verify.
> True, but that's how "inspiration" works in humans as well: generate stupid ideas until you stumble upon a great one.
I don't agree with this, but even if it were true, "inspiration" is not how we do math.
> That's why I said we only need verification. It's the artist-critic model, we got the artist we need the critic. Sometimes it's easy (in games, code, math) and other times we don't have a good way to verify.
That's not even how the typical language model works.
I think if we replaced "AI" with "taking averages over subsets of historical examples", then there'd be no mystery for when "AI" will be good or bad at anything.
Would we expect a discrete melodic structure to be expressible as averages of prior music? No.
Pretty sure the first continuation is a famous piece with a few notes messed up. Can't remember the name. Honestly it only sounds marginally better than the old markov chain continuations.
Isn’t that as good as it gets? The whole point of the continuations is that given a short leading prompt from a real piece that it should continue it realistically.
It didn’t get to train on the test set, if that’s what you’re implying, and I find it hard to believe the assertion that continuations are copies of the train set (if that’s your claim).
Wow, good find! They definitely sound similar but it’s not a facsimile. I wonder if this holds for the other samples.
I guess in retrospect we asked it to continue the music in a likely way, not be novel. And it definitely convinced me enough to be impressive. An NN that composes completely fresh music, whatever that means (I’m sure most modern human music has a hefty dose of cross song sampling), would certainly be a good next goal post.
Indeed, there is lots of denial or ignorance in this thread (ignorance in the technical sense). AudioLM already produced impressive results and it's a tiny fraction of what is already possible because performance simply improves with scale. One can probably solve music generation today with a ~$1B budget for most purposes like film or game music, or personalized soundtracks. This is not science fiction.
What's more interesting and concerning - listen carefully to the first piano continuation example from AudioLM, notice the similarity of the last 7 seconds to Moonlight sonata: https://youtu.be/4Tr0otuiQuU?t=516
I'm afraid we will see a lot of this with music generation models in the near future.
There are quite simple tricks to avoid repetition/copying in NNs, e.g. by (1) training a model to predict the "popularity" of the main model's outputs and penalizing popular/copied productions by backpropping through that model so as to decrease the predicted popularity, or (2) by conditioning on random inputs (LLMs can be prompted with imaginary "ID XXX" prefixes before each example to mitigate repetitions), or (3) by increasing temperature or optimizing for higher entropy. LLM outputs are already extremely diverse and verbatim copying is not a huge issue at all. The point being, all evidence points to this not being a show stopper if you massage these evolutionary methods for long enough in one or more of the various right ways.
I'm not sure what you mean by "backpropping through that model so as to decrease the predicted popularity". During training, we train a model to literally reproduce famous chunks of music exactly as they are in the training set. We can also learn to predict popularity at the same time, but we can't backpropagate anything that will reduce popularity, because this would directly contradict the main loss objective of exact reproduction.
Having said that, I think the idea of predicting popularity is good - we can use it for filtering already generated chunks during post-training evaluation phase.
I don't think the other two methods you suggest would help here, we want to generate while conditioning on famous pieces, and we don't want to increase temperature if we want to generate conservative, but still high quality pieces.
It's true that we (humans) are less sensitive to plagiarism in the text output, but even for LLMs it is a problem when it tries to generate something highly creative, such as poetry. I personally noticed multiple times a particular beautiful poetry phrases generated by GPT-2 only to google it and find out they were copied verbatim from a human poem.
What I had in mind was kind of like a reward model that is trained by on longer outputs that have a very high similarity to training examples. Something similar has been done to prevent LLMs from using toxic language. You'd simply backprop through that model like in GANs. And no it does not contradict the overall training objective completely because the criterion would be long verbatim copies and it would not affect shorter copies of sound fragments and the like which you would want a music model to produce in order for it to sound realistic and natural.
Oh OK, so you mean training the model after it has already been trained on the main task, right? Like finetuning. Yes, I think the GAN-like finetuning is a good idea. Though it's less clear where the labels would come from, it seems like some sort of fingerprint would need to be computed for each generated sequence, and this fingerprint would need to be compared against a database of fingerprints for every sequence in the training set. This could be a huge database.
It doesn't surprise me that an AI model for language can't grok maths or music. I can't see how a language model can map to maths. Hell, I don't even know how to describe music in words. It's possible to articulate some maths in words, but that often involves using words with unexpected definitions.
MIDI is extraordinarily expressive and is likely used to sequence a large majority of music produced within the last three decades. A lot of the instruments you hear are synthesizers or samplers running directly from MIDI. There is a lot more to what MIDI can do, and is used for, than the conception most people have from "canyon.mid" or old website background music. If an AI can do MIDI just fine then it's an extremely small leap to doing audio just fine.
If an AI can do MIDI just fine then it's an extremely small leap to doing audio just fine.
Unfortunately this is not true. It takes a huge amount of human effort to make MIDI encoded music sound good. The difference between MIDI and raw audio music generation is the same as the difference between drawing a cartoon and producing a photograph.
To clarify, yes MIDI can be expressive, but what's being generated when people say "AI generates MIDI music" is basically a piano roll.
I'm not familiar enough with existing implementations of such systems to dispute it, but there's no fundamental reason algorithmic composition systems could not include modulation parameters of all kinds (pitch/breath/effects/synthesizer controls/etc) in their output. I am envisioning a DAW set up with several VST's and samplers with routing and effects in place, then using some combination of genetic algorithms and other methods to "tweak the knobs" in the search for something pleasing.
The search space is absolutely enormous, though, so I don't dispute that it's very difficult, but I wouldn't go so far as to say that it can't be done. In such a space there are "no wrong answers" so to speak. I have a python script which creates randomized sequences of notes/rhythm and gives each one a different combination of LP/HP filters and random envelopes - it's not music but it takes on a much less mechanical quality by emulating different attacks and timbres over time, even though it's completely random.
I would go so far as to say I'd be genuinely surprised if algorithmic composition and production hasn't been used to some extent significantly greater than "basically a piano roll" in at least some of the past decade's top 40 music on the radio.
there's no fundamental reason algorithmic composition systems could not include modulation parameters of all kinds (pitch/breath/effects/synthesizer controls/etc) in their output
There is such a reason - lack of training data. Very few high quality detailed MIDI samples exist to train machine learning models like AudioLM.
For state of the art in MIDI generation, take a look at what https://aiva.ai/ produces (it's free for personal use). There you can compare raw MIDI output to an automatically generated mp3 output (using "VST's and samplers with routing and effects in place, then using some combination of genetic algorithms and other methods to "tweak the knobs" in the search for something pleasing.")
mp3 version will sound much better than raw MIDI, but (usually) significantly worse than music recorded in a studio and arranged/processed by a human.
As a clasically-trained pianist who then got into electronica and synthesis, it was mind blowing to me that people could wrangle expression and phrasing from a MIDI sequencer.
That particular niche has had some pretty amazing successes already. It's coming.
We can't produce arbitrary media streams with many "stack layers" of meaning and detail yet, but we can do a lot of specific instrumental transformations...
That’s what a musician does. They make short loops and loop them.
This reads like someone who knows sheet music and theory but does not listen to music. It’s repetition of short phrases over and over.
I’m not really sure what people expect of general AI trained on human generated outputs. It can’t make up anything anything “net new” only compose based upon what we feed it.
I like to think AI is just showing us how simple minded we really are and how our habit of sharing vain fairy tales about history makes us believe we’re masters of the universe.
Those models are not trained on short loops. They are trained on whole songs just like image generation models are trained on whole images. And yet they struggle to repeat sections, modulate to a different key, create bridges, intros and outros. After a few seconds of hallucinating a melodic line they simply abandon the idea and migrate to another one. There is no global structure whatsoever.
We're trying to train a full composer AI without allowing to learn about different instrument sections independently at first. The human composer will have a good idea of the different parts and know how to merge them in harmony.
I think we might get better results training separate AI systems on percussions, strings, vocals etc. then somehow create connections between them so they learn together. A band AI if you will.
We could try a BERT for each, with the generator learning to output logical sequences of sounds instead of words.
Musicians don’t spit out an album in one sitting and they’re highly trained in theory. They get bored and tired of a process and take breaks. They come up with an album of loops composed together over time.
AIs state will forever be constrained to the limits of human cognition and behavior as that’s what it’s trained on.
I read published research all year. Circular reasoning. Tautology. It’s all over PhD thesis.
There’s no “global structure” to humanity. Relativity is a bitch.
Seeing the world through the vacuum of embedded inner monologue ignores the constraints of the physical one. It’s exhausting dealing with the mentality some clean room idea we imagine in a hammock can actually exist in a universe being ripped asunder by entropy.
It’s living in memory of what we were sold; some ideal state. Very akin to religious and nation state idealism.
I think it's deeply depressing that AI has been sold as something even capable of modelling anything humans do; and quite depressing that this comment exists.
"AI" is just taking `mean()` over our choice of encodings of our choice of measurements of our selection of things we've created.
There is as much "alike humans" in patterns in tree bark.
AI is an embarrassingly dumb procedure, incapable of the most basic homology with anything any animal has ever done; us especially.
We are embedded in our environments, on which we act, and which act on us. In doing so we physically grow, mould our structure and that of our environment, and develop sensory-motor conceptualisations of the world. Everything we do, every act of the imagination or of movement of our limbs, is preconditioned-on and symptomatic-of our profound understanding of the world and how we are in it.
The idea that `mean(424,34324,223123,3424,....)` even has any revelance to us at all is quite absurd. The idea that such a thing might sound pleasant thru' a speaker, irrelevant.
This is a product of i dont know what. On the optimist side, a cultish desire to see Science produce a new utopia. On the pessimisst side, a likewise delusional desire to see Humans as dumb machines.
I lack your confidence, and find it a bit religious.
> The idea that `mean(424,34324,223123,3424,....)` even has any revelance to us at all is quite absurd.
Most of what I say to anyone is exactly this.
When I'm about to give anyone any information, I look back at all of the relevant past information that I can recall (through word and sensory association, not by logic, unless I have a recollection of an associated internal or external dialog that also used logical rules.) I multiply those by strength of recollection and similarity of situation (e.g. can I create a metaphor for the current situation from the recalled one?). I take the mean, then I share it, along with caveats about the aforementioned strength of recollection and similarity of situation.
This is what it feels like I actually do. Any of these steps can be either taken consciously or by reflex. It's not hidden.
> I think it's deeply depressing that AI has been sold as something even capable of modelling anything humans do
This is a bizarre position. All computers ever do is model things that humans do. All a computer consists of is a receptacle for placing human will that will continue to apply that will after the human is removed. They are a way of crystallizing will in a way that you can sustain it with things (like electricity) other than the particular combination of air, water, food, space, pressure, temperature, etc. that is a person. An overflow drain is a computer that models the human will. An automatic switch/regulator is the basic electrical model of human will, and a computer is just a bunch of those stitched together in a complementary way.
You're an animal. You've no idea what you do, and you're using machines as a model. Likewise, in the 16th C. it was brass cogs; and in anchient greece, air/fire/etc.
You're no more made of clay & god's breath, as you are sand and electricy.
You're an oozing, growing, malluable organic organism being physiologically dynamically shaped by your sensory-motor oozing. You're a mystery to yourself, and these self-reports, heavily coloured by the in-vogue tech are not science, they're pseudoscience.
If you want to study how animals work, you'd need to study that. Not these impoverished metaphors that mystify both machines and men. No machine has ever acquired a concept through sensory-motor action, nor used one to imagine, nor thereby planned its actions. No machine is ever at play, nor has grown its muscles to be better at-play. No machine has, therefore, learned to play the piano. No machine has thought about food, because no machine has been hungry; no machine has cared, nor been motivated to care by a harsh environment.
An inorganic mechanism is nothing at all like an animal, and an algorithm over a discrete sequence of numbers with electronic semantics, is nothing like tissue development.
What you are doing is not something you can introspect. And you arent really doing that. Rather, you've learned a "way of speaking" about machine action and are back-projecting that onto yourself. In this way, you're obliterating 95% of the things you are.
This isn't really responsive. Not only am I not using machines as any sort of model for human behavior, I'm trying to think about weird things you could do to a machine to make it ape a human.
> these self-reports, heavily coloured by the in-vogue tech are not science, they're pseudoscience.
I simply don't know what you're referring to. If you're referring to retrieving memories through associations, there's mountains of empirical evidence for that. If you're referring to wondering if I remember things, and being unsure of the information I'm recalling when I have less recall of that, or wondering if past situations compare well to current situations, well you got me. It's my personal belief that conscious thought is an epiphenomenon that is a rationalization of decisions already made.
But the rest of this is nonsense. Vivid imagery is not an argument for exceptionalism, no matter how much I say things drip or ooze. This is just association in action. You're trying to create a distinction for life (or rather what you recognize as life) life oozes and has viscera, so using a bunch of words that feel wet and organy can substitute for reason contra the robots.
That solution has a compressed representation of half the internet.
NNs are "garabled nonesense" insofar as they try to generalise; insofar as they are search engines, they provide apparent sense by just repeating something in their database (= weights).
Oo this reminds me. One of my favourite sci-fi novels is The Moon is a Harsh Mistress.
In it, it depicts the growth of a nascent AI from its attempts at understanding humor. The AI befriends a technician and gets the human to rate its own crafted jokes.
Eventually the AI gets really good at telling jokes, and becomes sentient as a result.
It was a very fun take on AI gaining sentience, highly recommended!
This might make sense as a response solely to the title of the article, but I have to admit I find it puzzling as a reaction to its content. Notwithstanding the title, the article mentions a model called Minerva that scored fully 50% on the MATH dataset of high-school/undergrad mathematical problems. For comparison, a human computer science PhD student scored 40%. [1]
For context, Minerva came out this July. When it was tested on a national math exam, it scored higher than that year's class of graduating high school seniors. [2] A mere eight months (!) earlier, OpenAI had announced [3] they'd trained a language model that solved math word problems almost as well as an average middle-schooler. So even if you believe — rightly or wrongly — that current capabilities aren't very impressive, it's worth remembering that your understanding of current capabilities might not be entirely accurate, even if it's only a few months out of date.
Incidentally, it may be worth looking at some examples of these models' outputs before deciding what they can or can't do. Here's Minerva solving some math problems, for example:
I'll admit I find it challenging to interpret these results as "passable garbled nonesense [sic]", though perhaps I'm not being demanding enough. At any rate, when these models go from beating 10-year olds at math to beating 18-year olds at math in the span of 8 months, one does start to wonder how much of the hype is really due to over-interpretation — and what the next 8 months have in store.
Current sequence models don't have the right structures to represent math. Even if they use floating point internally, they can't really float the point because the nonlinearity in the model has a certain scale.
A system that processes language can take advantage of the human desire for closure
The problem is that human language is approximate and correct math is not, so pattern matching on prose text is doomed. AI trained on exact math does a lot better. But that's not fully generic so fails the weird GPT goal of modeling all of human intelligence through prose. That's not how people solve math at all.
GPT's "Superficially plausible but wrong" math is actually pretty good match for non-expert bad-at-math average human behavior.
Mwell, the article claims, and points to work that also claims, that large language models can actually be made to perform arithmetic well. They need fine-tuning, verification, chain of thought prompting and majority voting to be combined but the linked Google blog says that Minerva hit 78.5% accuracy (on the GSM8K benchmark).
For me the problem is that we can look at the output and say if it's right or wrong, but we know what language models do, internally: they predict the next token in a sequence. And we know that this is no way to do arithmetic, in the long run, even though it might well work over finite domains.
Which is to say, I'm just as skeptical as you are, and probably even more, but I think it's useful to separate the claim from what has actually been demonstrated. Google claims its Minerva model is "solving maths problems" but what it's really doing is predicting solutions to problems like the ones it's been fine-tuned on, and those problems are problems stated at least partly in natural language, not "naked" arithmetic operations. In the latter, language models are still crap because they can't use the context of the natural language problem statement to help them predict the solution.
Btw, "chain of thought prompting" if I remember correctly is a process by which an experimenter prompts the language model with a sequence of intermediary problems. So it's not so much the model's chain of thought, as the experimenter's chain of thought and the experimenter is asking the model to help him or her complete their chain of thought. I have a fuzzy recollection of that though.
That's interesting, I hadn't made the connection between executive function and intelligence.
I went through a burnout in 2019 that felt like having a stroke. My brain finally reached such a level of negative reinforcement after years of failure that it wouldn't let me work anymore. I'd go to do very simple tasks, everything from brushing my teath to writing a TODO list, and it was like the part of my brain that performed those tasks wasn't there anymore. Or at least, it no longer obeyed if it perceived a potential reward involved. It was like my motivation got reversed. I had to relearn how to do everything, despite knowing that no reward might come for a very long time, which took at least 6 months before I began recovering. The closest answer I have is that my brain healed through faith.
I only bring it up because executive function may be associated with a subjective experience of meaning. If there's truly no point to anything, then it's hard to summon the motivation to string together a sequence of AI tasks into something more like AGI.
I guess that's another way of saying that nihilism could be the final hurdle for AGI to overcome. It's like the human philosophical question of why there's something instead of nothing. Or why angels would choose to be incarnate on Earth to experience a life of suffering when it's so much easier to remain dissociated.
Translate to what? The next likely string of characters? How would this executive even interact with it? Sibling comment of yours mentioned extracting low level steps from high level tasks but it needed another language model (no kidding!) to map to the «most likely» of the admissable actions. I mean, this shit is half baked even in theory.
that solves word problems using the methods of the old AI. The point is that is is efficient and effective to use real math operators and not expect to fit numbers through the mysterious bottleneck of neural encoding.
> language models just need to translate problems into code of some kind that can be run to get the answer
A huge "just"! Isn't this the magic step? Translating ambiguous symbols to meaning and combining them in meaningful ways is a big deal which, apparently, these AI models cannot do. They can just parrot things.
> Translating ambiguous symbols to meaning and combining them in meaningful ways is a big deal which, apparently, these AI models cannot do.
Plenty of AI models do exactly this. Very clear examples include question answering models and code generation. In both cases novel, meaningful responses are generated.
> They can just parrot things.
That isn't true. While language models can parrot things it is generally special conditions that make them do it. Specifically, the conditional probability of the next character (or BPE or word depending on the model) has to be much higher than anything else which happens when the thing being parroted is unique text.
If you ask most Americans or a language model what word comes next in this: fourscore and seven year.. they'll give the same answer, for the same reason.
So is in your opinion General AI solved? Because reliably turning symbols into meaning, outside narrow or special cases, is General AI.
In my opinion, it's not solved. GPT-3 is not General AI, it's a more clever mechanism for parroting back text it cannot truly understand. Comparisons to ways humans confuse themselves are a red herring in my opinion: the old ELIZA program could reply like a very confused or trollish human would, but nobody would argue ELIZA was a general AI.
It's just that GPT is a fascinating and more convincing illusion than ELIZA. Unlike ELIZA, it can also be used for meaningful purposes.
I don't think "general intelligence" is a bright-line, but instead is a continuum, and I don't agree with your definition (although I appreciate you do at least give a definition).
I think that in general most human decision making is just pattern matching (plenty of evidence for this - read "thinking fast and slow" for an overview).
I think the extrapolation that ML models can do is a form of intelligence. I also think that the compression and encoding of inputs into a lower dimensional space is exactly the "turning symbols into meaning" that you call for in your definition.
We obviously disagree. I don't think we are near general AI (and yes, I know the objection that everyone who says this is simply moving the goalposts-- regardless, I'm unconvinced). I think GPT et al are very interesting tricks, but still not general AI; and that the path to it doesn't lie in this direction.
I subscribe to the view we think of the human mind as a pattern matching computer simply because this is the current major tech, much like people in the past thought of "humors" or "steam machines". I think some of the analogies are useful, to a point, but I don't think there's hard evidence the mind is like a neural net (irony notwithstanding) or a pattern marching GPT-like algorithm.
Re: Thinking Fast and Slow, I see there are serious doubts about the validity of the book's foundations and conclusions, and that it's been challenged.
Nor do I. But I don't agree with your definition of general AI at all.
I think by your definition we are well on the way towards it. I'd note that you didn't address the idea that compression of input concepts into a lower dimensional space is exactly the "reliably turning symbols into meaning" idea you suggest.
The fact the latent representations of concepts can be manipulated in ways that make logical sense is a good indication that the symbols have meaning. The classic Word2Vec experiments showing how the relationships Paris->France ~= London->England and King - Man + Woman = Queen show this well. Modern large language models are much more complicated of course, but the principles remain.
> I subscribe to the view we think of the human mind as a pattern matching computer simply because this is the current major tech, much like people in the past thought of "humors" or "steam machines". I think some of the analogies are useful, to a point, but I don't think there's hard evidence the mind is like a neural net (irony notwithstanding) or a pattern marching GPT-like algorithm.
I don't think the computational approach is an interesting question. No one thinks brains operate like software neural networks (not sure what "pattern marching GPT-like algorithm" means - it is just a neural network). That doesn't matter because all computational methods are ultimately equivalent.
I think outcomes on metrics like benchmarks is important, and while some benchmarks have issues that some approaches exploit I think things like Chollet's "On the Measure of Intelligence" (https://arxiv.org/abs/1911.01547) are reasonable frameworks for discussion of progress.
> Re: Thinking Fast and Slow, I see there are serious doubts about the validity of the book's foundations and conclusions, and that it's been challenged.
These issues don't detract from the overall theme of the book about the two systems of decision making (rational and reflex) and how often we use the reflex decision system but convince ourselves we are using the rational system.
If you can point at any additional doubts about the books conclusions I'd appreciate a reference.
>> I don't think the computational approach is an interesting question.
> Great, so we agree then!
Maybe? I think all forms of computation are equivalent, so if the brain is implemented in the same way as a neural network is uninteresting. They can do the same thing (which is interesting).
> yet we both agree we are not near general AI.
Sure. My disagreement is with the idea with your original statement:
"Translating ambiguous symbols to meaning and combining them in meaningful ways is a big deal which, apparently, these AI models cannot do. They can just parrot things."
Modern AI systems can do this, and do not just parrot things.
They are capable of novel outputs.
This is because the models have sufficient "understanding" to manipulate "things" (latent vector representations, which you call symbols) to output novel but meaningful outputs.
It's unclear what you mean by "solved". Even a human can't turn every arbitrary problem into code to solve, but we still consider humans "generally intelligent".
GPT3 can't turn as many problems into code as I can, but it can do some, and GPT4 (or whatever) will be able to do more, etc.
I'm not so sure about that. Of course computers can do arithmetic operations, but this is not the same as solving math problems, proving theorems, etc.
Even mathematical objects are approximated up to an approximation error in a computer (like a differentiable manifold or a real number).
> Of course computers can do arithmetic operations, but this is not the same as solving math problems, proving theorems, etc.
Computers can solve math problems and prove theorems; this remains a significant subfield of Computer Science with lots of industrial use cases. However, pure machine learning based approaches toward these problems remain subpar.
> Even mathematical objects are approximated up to an approximation error in a computer (like a differentiable manifold or a real number).
Only because it caught on (and in the case of non-computationally-intensive applications, for purely historical reasons). For example, Mathematica has Reals and even functionality for Reals that is literally impossible to implement for integers [1,2]. There are also precise characterizations of objects in differential geometry [3]. You could imagine applying LLMs to these types of programs a la Copilot, but when you do this you will find yourself agreeing with Paul Houle's observation that math is harder to fake than eg art, language, or even glue code for web apps.
> Computers can solve math problems and prove theorems
But the specification of the problem must be done by a human, translating to a formalized system that the software can understand. And if there's a problem in the formal specification, it's mostly up to the human to notice and fix; the computer will happily output garbage or crash or enter an infinite loop.
So it seems this translation, going from an exploration of the problem statement, usually in ambiguous terms, to a formal specification, and the awareness to possibly detect whether the answers make sense and the specs were right, is uniquely human.
you just don't hear about it much because the technology is not so fashionable today. Also it is more clear what the limits are, I mean, Turing, Godel, Tarski and all of those apply to neural networks as well any other formal system but people mostly forget it.
Knuth wrote a really fun volume of The Art of Computer Programming about advances in SAT solvers which are the foundation for theorem provers
Everybody is aware that neural network techniques have improved drastically in performance, it's much more obscure that the toolbox of symbolic A.I. has improved greatly. Back in the 1980s production rules engines struggled to handle 10,000 rules, now Drools can handle 1,000,000+ rules with no problems.
The wiki article on automated theorem proving is quite bad as an overview of the active field; it's more a historical article about the mid to late 20th century. Most of the interesting things in automated reasoning have happened since the naughts, and that article kind of stops in the 90s
SMT solvers have gotten quite good over the past couple decades, there are tons of domain-specific tools (eg in software and hardware verification), tons of niche applied decidable or semi-decidable theories (eg various modal and description logics), a lot of progress on the proof assistant ("non-fully-automated theorem proving") paradigm, and so on.
It's clear that commonsense reasoning needs to deal with modals, counterfactuals, defaults, temporal logic, etc.
It's not hard to add some extensions to logic for a particular application but a very hard problem to develop a general purpose extended logic.
I look at the logic-adjacent production rules systems which never really standardized some of the commonly necessary things such as agendas, priorities, defaults, etc.
Computers are much much better at all that stuff than almost everyone too. Try asking Wolfram Alpha to solve something. Computers have gotten really good at proving things in the last couple of decades and formal verification methods are becoming increasingly popular.
I think sharemywin is probably on to something. It's going to be really hard for an AI to prove that e.g. x>0 && x+y <= 1 && y>1 is unsatisfiable, but it's trivial for an SMT solver. On the other hand it probably isn't that much of a leap to make an AI that can feed that problem into an SMT solver.
Well, you don't need anything else than basic arithmetic to encode the entirety of, say, ZFC, enumerate every proposition in it, and halt iff you find a proof of whatever theorem you're after. It just might take a while…
That's because they're not modelling anything. The shocking thing about current AI models is that just sort of repeating and copying from memory what you've heard and seen gets you 97% of the way to imitating a person.* They still need to generate actual models somewhere to create consistency; so many generated images with one eye completely different from the other, or three arms, or fingers that grow into their cellphones.
If you solve this, you've probably solved almost anything in the simulation field. I have no confidence that the solution will even be complicated. Information consumed needs to be used to add to some sort of model, and that model always needs to be used as part of input. The complicated part would be to make that base model able to modify itself reasonably based on input, to tolerate constant inconsistency, and to constantly refine itself towards consistency i.e. ruminate.
I think a huge difference (which I think was approached through theories of embodied cognition) is that people start with a model (or the ability to create a model) of themselves. We can apply that model to other things and use it both to change how we ourselves behave, and how we speculate about the invisible states of other things. It's not for nothing that we can (and must) anthropomorphize anything.
-----
* Which was huge towards the confirmation of my belief that this is all people do 97% of the time.
This is factually wrong, both in terms of quantity and quality.
Current AI models are not "just sort of repeating and copying from memory". This is just an incorrect characterization of how they work and how they perform.
AI skeptics often say things like this then backpedal with something like "Well they aren't really repeating what they heard, but their generative model is just a slightly more sophisticated version of repeating what they've heard." But this weaker claim is also true of humans. It's certainly the case that >97% percent of what humans say is "just repeating and copying" in the same sense.
> Current AI models are not "just sort of repeating and copying from memory". This is just an incorrect characterization of how they work and how they perform.
You say this, but don't explain how. Because this is exactly what they are doing.
> AI skeptics often say things like this
I'm not really an AI skeptic. I think that we're very close to AI being indistinguishable from people. There are clearly problems that need to be solved, but I think the hardest problem was accepting the fact that humans are largely just copying and realizing that would be enough to get you 97% of the way there, especially if you gave a machine far more to copy than a human could consume.
> then backpedal with something like "Well they aren't really repeating what they heard, but their generative model is just a slightly more sophisticated version of repeating what they've heard." But this weaker claim is also true of humans. It's certainly the case that >97% percent of what humans say is "just repeating and copying" in the same sense.
Maybe I'm not expressing myself clearly, but it seems that you're just repeating my comment with a sneer. Agreeing angrily?
I'm disagreeing with the language you are using to characterize models. "copying from memory" implies that there is something being copied, and a memory that you are copying it from. I am pointing out that LLMs do not do this. It's not how they work.
If you polled 1M random English speakers randomly and asked them whether or not a system that "just sort of repeating and copying from memory" could produce completely novel answers in response to completely novel questions, I suspect that the overwhelming majority would respond by saying no.
Similarly if you asked 1000 people working on LLMs whether they work by "copying from memory", I suspect nearly all would say no. It would be accurate to say they are "generating text via a probabilistic model of language, which is encoded in the weights of a neural network", but there really is just no sense in which the models are "copying" anything.
That being said, these models do "copy" some text in the sense that they can reconstruct some strings from their training input. For example every LLM I have played with can recite the first few paragraphs of A Tale of Two Cities verbatim. But that's a capability they have _in spite of_ their actual design, not because of it.
> I'm disagreeing with the language you are using to characterize models. "copying from memory" implies that there is something being copied, and a memory that you are copying it from. I am pointing out that LLMs do not do this. It's not how they work.
Then we're arguing about the semantics of the word "copy." That is not an interesting argument when you know exactly what I mean and can express it clearly.
edit: If it helps, either substitute your description in whenever I say 'pretty much copy' or change the word "copy" to whatever word you want to use. But even though I can't reproduce the opening paragraph to A Tale of Two Cities verbatim, I can certainly write something that is "copying" it without doing that, and anyone who was familiar with the book and read my paragraph would agree with me.
It is semantics, but that was your whole point no?
> That's because they're not modelling anything
If we agree on "how LLMs work", then how can you claim that they aren't modeling anything? They are modeling language, and while it's unlikely current paradigms will be proving new mathematical truths, it's completely plausible to me that bigger models will be able to handle simple math word problems like those in the article, precisely because LLMs can model the "Alice", "Apple", and "Bob" entities.
I disagree that they are modeling language.* I think that not only bigger models but same-sized or much smaller models will be able to handle arbitrarily complicated word problems if they're eventually supplemented with some explicit model-building process.
-----
* ...and that would be a completely semantic argument to have. I don't care whether it's called modeling, other than the fact that when I'm talking about modeling, I'm not talking about language probability, I'm talking about categories. But discussing what current AI is (a language model, copying?) is a waste of time, because I absolutely agree with your description of how it works, so we're talking about exactly the same thing.
What is the difference between this and describing a human brain the same way? The brain is the model, you are "just" copying things from the memory of your brain to words that you speak or write?
I don't think it's pedantic to say that an argument is wrong because it's making an incorrect claim. The claim here is that there is something different or missing between a true "model" and LLMs, and that missing thing has something to do with "copying". But that's not true, the missing thing is the complexity of the table, or the size of the table. The fact that it's copying in some incredibly abstract sense doesn't matter.
i think humans are just copying every time we give an answer without much thought. we're either regurgitating something we previously thought/solved or something we heard somewhere.
when you have to put your head down and actually think for awhile, then maybe you're doing something new or at least not within your brain-training data. I don't think AI can do this yet. it can only copy pieces of its training data out to look like something new, but it isn't really. like when i was describing a video game idea i had to a friend and he called me out on just stealing bits of other games and mashing them together. he was right. it wasn't original. and this is all the AIs can do right now.
Current LLMs are "modeling" something according to pretty much any sense of the word "model".
In the technical, computational linguistics sense, LLMs are language models that give a conditional posterior distribution over sentences. Given some (constrained) context, the model tells you the posterior distribution over sentences in or around that context.
In the nontechnical, layman sense of the word, they are a system that is used as an example of language. LLMs imitate language by generating new sentences. They are a "model" in the same way that an architectural model is a model, or in the same way that a statue is a model of a human.
The other point I disagreed with is the characterization that LLMs "just sort of repeat and copy from memory". I went into more detail about that in other replies.
A more layman way to describe it that avoids too much over simplification is that these learning models try to group things and apply probabilities to sequences of groupings.
E.g.; A word is a grouping of letters, try to find the sequences of letters with the highest probabilities.
A phrase is a grouping of words, with punctuation marks.
Try to find the sequences of words with the highest probabilities.
A sentence is a grouping of phrases. Try to find the highest probability sequences.
A paragraph is a sequence of sentences. And so on and so on.
Within very narrow domains (specific writing styles, say technical or legal writing), these models can be very accurate, since the sequencing of words into phrases, and phrases into sentences, and sentences into paragraphs etc., are very predictable. People call this kind of predictable sequencing a 'style', and it aids us in understanding text more quickly. More generally across all domains, it's much harder to accurately predict these sequences, because AI identify the 'style' of a text, purely from the text itself. No context surrounding the text is given to the AI, and so it guesses.
For example:
a political press release, will be written in one style of writing. And a company marketing press release will be written in a slightly different style of writing. As humans, we can easily distinguish between what is commercial marketing, and what is political, because we are given that information upfront. In latin (the choice language for some mathematicians and logicians for historical reasons), we have the information 'a priori'. A learning algorithm, isn't given that information up front, and must determine only from the text itself, whether it is more likely to be a marketing release selling some product, and therefore it should adopt a certain language style, or a political release selling an ideology and therefore should adopt a slightly different language style.
When we don't know the right answer, and have no way to determine it, the solution that most computers are programmed to adopt is a minimax solution, i.e., minimise the maximum possible error. It does this by sort of mixing and matching both marketing and political styles.
When a human reads it, sometimes it looks very strange and funny. Usually this is because it has some distinguishing feature, that we can immediately recognise as placing it as either a political or marketing document, i.e., a company name, a political party, a corporate or political letterhead, a famous person's name etc. The computer naturally doesn't know who Donald Trump is, since we haven't taught it who or what a Trump is, so it doesn't give it any precedence over any other word on the page. Actually, in the case of Donald Trump, I bet if you took the dates off of all of his tweets, even humans would have a hard time distinguishing if they were political or commercial in nature.
> “I think there’s this notion that humans doing math have some rigid reasoning system—that there’s a sharp distinction between knowing something and not knowing something,” says Ethan Dyer, a machine-learning expert at Google. But humans give inconsistent answers, make errors, and fail to apply core concepts, too. The borders, at this frontier of machine learning, are blurred.
This part resonates with me. There was a time when I could calculate congruent modulo problems with exponents, but I couldn’t do it step by step, I could only “hallucinate” in a fuzzy way to arrive at the solution, somehow like recalling the solution from memory.
When we have to explain our reasoning we can’t think the same way. It’s like thinking with a debugger attached.
Language models can generate a Python function that does the math perfectly.
I bet you would get better results if you tweaked the prompt to say "Generate a Python program that solves X math problem" and then just ran the resulting Python script.
That could only generate constructivist [0] proofs, and there are many things done in modern maths which are not constructivist. Maybe a better approach would be to use Curry-Howard [1] correspondence to directly get proofs from generated programs
Exactly, we need computer-equipped neural nets. Models need to use traditional UIs (including programming languages) and then we can talk about how to stop them. :)
It’s wishful thinking that I myself have once fallen for. I don’t trust our society to transition to a world with powerful machine intelligence safely, so would prefer a world in which ML progresses at a glacier’s pace.
Are there any general purpose models that are good at learning math? I mainly know basic feed-forward neural nets, but I don't think they do well outside their training region. Math, of course, has an infinite training region.
From my (limited) experience with the advanced ML models, they can "do basic math", but they make amateur mistakes with basic things - which indicates they don't actually know addition, but they are good at looking at patterns in existing language.
I would assume that state-of-the-art ML models could "convert a word problem into an equation", then feed that equation into a 30 year-old graphing calculator to "do the math"
The fact that no one has done this is an indicator that "there are more important things to work on", and it is just a matter of time that someone connects the two together
This seems so much like humans that it makes me think lots of people are learning math with an ML-like approach instead of… whatever the heck people like engineers and mathematicians are doing.
Anyone can do higher level math, the problem is that math education is generally done by people who see math as a tool for computation, rather than a study of deep connections bordering on philosophy, and beautiful insights resembling poetry. I've been in arguments before where someone didn't believe me that the underpinnings of modern philosophy are essentially the same as math!
If the teachers don't love math, how can we expect students to?
I wonder how these language models would do if we tried to teach them maths the way schools do: Feed them explanations first, then endless sequences of toy problems, see which they got wrong and feed them corrected examples back in.
I'm not at all surprised they don't do well at maths, because while there are maths texts online, I doubt there is enough material to give these models the same experience of repetition and reinforcement to help sufficiently generalise an understanding of the underlying rules.
I don't think it's so much a refusal, as that it's not been a sufficient priority for anyone before. As the article points out there are now a few training sets which includes math problems, and models which do well on them. But the remaining problems seems to be with basics which humans tends to learn to do consistently with a lot of repetition, and it'd be interesting to see those datasets extended to the very simple.
I attempted to create a general purpose model for the exact version of the "what comes next problem." It enumerated primitive recursive functions, trying them out as it went. The limitation to primitive recursive functions was convenient because they always terminate. I didn't have to filter out the functions that ran for too long. (or do I?)
The enumeration inherently includes functions of several variables, so I wasn't restricted to examples such as 1->1, 2->4, 3->9, 4->16 etc.
I could try it out on examples such as (1,2)->3 (2,1)->3 (0,2)->2, etc. Perhaps with enough it would "learn to add" = find a primitive recursive function that did addition.
I got as far as finding the first problem. The enumeration technique that I used was effectively doing a tree recursion, like that function for computing Fibonacci numbers that bogs down because Fib(10) is computing Fib(5) lots of times. I had a lot of numbers that coded for the identity function, lots of numbers that coded for the first few functions, making the whole thing bog down, trying the same few functions over and over under different numerical disguises.
I thought that I could see my way to fixing this first problem. Have some way of recognizing numbers that give forms that give the same function. I guessed that I could approximate this by saying that if two functions give the same value on a variety of arguments they are probably the same. Then I parameterise this criterion and tune. That opens the way to creating a consolidated enumeration, analogous to fixing the tree recursive fibonacci function by memoization, except trickier.
But my health is poor and I ran out of energy.
Also, I have a guess for the second problem. What happens if I fix the first problem and my enumeration reaches decently complicated primitive recursive functions. While they will all terminate, some might run for far too long, causing the process to bog down. Rejecting them on the basis of limiting the run time might work well. We are happy to only learn reasonably effect functions for doing maths.
It is a fun idea and I encourage others to have a go.
There is "LODA", which uses genetic algorithms, that continuously mutates existing math programs until discovering something new. It uses OEIS as training data, around 350k known integer sequences, such as primes/fibonacci. Around 100k programs have been mined so far.
> “When multiplying really large numbers together … they’ll forget to carry somewhere and be off by one,” says Vineet Kosaraju, a machine learning expert at OpenAI. Other mistakes made by language models are less human, such as misinterpreting 10 as 1 and 0, not ten.
So the expert has never seen a seven year old struggling in adding two single digit numbers together? Did the expert learn 1 and 0 being 10 first and learn to speak second?
> The MATH group found just how challenging quantitative reasoning is for top-of-the-line language models, which scored less than 7 percent. (A human grad student scored 40 percent, while a math olympiad champ scored 90 percent.)
Is this that surprising? How would our ieee editor score on the same problem set?
The situation is actually much worse for science, or any moving field. This models are by design and necessity historical. So that if, for example, the FDA issues a drug approval overnight, The model camp follow sudden changes in a “reasoned” why.
This is incorrect, and unclear why people think this.
The whole point of a good ML system is that it doesn't parrot training data. A good system can extrapolate novel answers from things it has seen. That is very far from "parroting".
Why instead of expecting a language to get math, don't we use the language model to generate code, run the code and use the result?
If a language model basically is the equivalent of a dumb human, how can we expect it to be better than us at math?Most humans use calculators even for simple equations.
I've been thinking about giving access to a search engine and a command line to a GPT-3 based AI, so that it can choose to run code it wrote or to expand its knowledge, I think that's a good way to expand its capabilities, even if that's probably how we're going to get skynet in the end.
More generally, they struggle to get thing right. They’re great at grammatical confabulation, but when you need a correct answer, or a correct drug recommendation, ask an expert.
It is a great sign that we are building AI in the right direction. Before building artificial human intelligence, it makes sense to get to the intelligence level of a mosquito or fly, then go to more intelligent animals in later iterations.
As most of the human knowledge is encoded in videos, getting better at understanding / generating videos will clearly get us closer to make computers understand the world.
I genuinely wonder if we will find there are some inherent tradeoffs to knowledge and understanding such that if we ever have machines that can “think like humans” they would in practice run into human-like cognition limits: ie such machines would be “bad at math” in the same way humans are “bat at math” compared to conventional computers.
Indeed. I posit that as we get closer and closer to simulating how the human brain works in the pursuit of artificial intelligence, we're going to start seeing more and more of the same "bugs" that humans have (logical fallacies, susceptibility to illusions, mental illness, etc.)
You think your job sucks now, just wait until you're dealing with the general AI over on the UX team that's trying to get your ass fired because it's fostering a 3 year old grudge over that time you said Chappie was stupid.
At first, I thought it was surprising that a language model with a restricted vocabulary (e.g. banning the letter "E") acts significantly more "mentally ill", and then I thought about how I would come across if forced to use that constraint all the time, and I realized that maybe I'd appear mentally ill too!
That's an interesting thought. However it's not cognitive limits that make humans bad at math, it's just a "hardware" issue: a human with a piece of paper is much better at math.
Even if neural networks were fundamentally incompatible with conventional computation, I don't see why you couldn't augment a neural network with a conventional ALU to do the numerical computations. This is exactly what humans do with pencil and paper - it's just a bit too slow.
Either the language model would need to know what it's doing or the host program would have to know what the AI is doing. Both seem out of reach. The latter seems more doable since you could hack something up for simple scenarios, but you'd effectively have to match the capabilities of the neural network in a classical way to handle every case (which would render using a neural net moot).
Btw, here's an example of how even a very simple zero-shot/prompt-engineering attempt to introduce a bit of system 2 reasoning into a language model can improve results.
It's a language model; why would we expect it do math or try to somehow shoehorn math into the model? Do the language centers of our brain do math?
If something approximating AGI is going to happen, it's going to be a lot of models tied together with an executive function to recognize and send things to the area that's good at working with them.
> It's a language model; why would we expect it do math or try to somehow shoehorn math into the model?
Language models can do math, or anyway arithmetic. That's because language models are trained to predict the next token in a sequence and an arithmetic operation can be represented as a sequence of tokens.
The only problem is that language models are crap at arithmetic because they can only predict the next token in a sequence. That's enough to guess at the answer of an arithmetic problem some of the time but not enough to solve any arithmetic problem all of the time.
More generally, the answer to your question is in the same Figure 3.10 I've referenced above. OpenAI (and others) have claimed that their large language models can do arithmetic. So then people tested the claim and found it to be a bag of old cobblers.
Hence the article above. Nobody's trying to "shoehorn" anything anywhere. It's just something that language models can do, albeit badly.
Right, but what you're describing is 'not being able to do math'. Like, if I've memorized a multiplication table and can give you any result that's on the table but can't multiply anything that wasn't on the table, I can't do multiplication.
It depends on how you see it. I agree with you, generally, but in the limit, if you memorised all possible instances of multiplication, then yes, you could certainly be said to know multiplication.
I've not just come up with that off the top of my head, either. In PAC-Learning (what we have in terms of theory, in machine learning) a "concept" (e.g. multiplication) is a set of instances and a learning system is said to learn a concept if it can correctly label each of a set of testing instances by membership to the target concept with arbitrary probability of error. Trivially, a learner that has memorised every instance of a target concept can be said to have learned the concept. All this is playing fast and loose with PAC-Learning terminology for the sake of simplification.
The problem of course is that some concepts have infinite sets of instances, and that is the case with arithmetic. On the other hand, it's maybe a little disingenuous to require a machine learning system to be able to represent infinite arithmetic since there is no physical computer that can do that, either.
Anyway that's how the debate goes on these things. I'm on the side that says that if you want to claim your system can do arithmetic, you have to demonstrate that it has something that we can all agree is a recognisable representation of the rules of arithmetic, as we understand them. For instance, the axioms of Peano arithmetic. Which though is a bit unfair for deep learning systems that can't "show their work" in this way.
What are some (non-nefarious) applications of generative language models that produce language which isn't constrained by some sort of rationality or directed by some sort of high-level goal?
The point isn't the math. The point is that, in math and similar disciplines, it's harder to get away with producing mostly undirected gibberish that happens to have some imputed meaning. The point is "use language to do something where it's easy to verify correctness and generating infinite amounts of synthetic data is trivial"
If a language model can't even do high school algebra, then I have a lot less confidence that it will ever be useful for customer service applications or any other number of potential applications outside of propaganda, advertising, and spam.
But if it's rational and has a sense of truth, then it's AGI. Which I don't think is impossible or even unattainable within a reasonable amount of time, but we're .001% of the way there, not 50% or 75%.
These models are fascinating, but the problem 'a lot of the things this model generates lack any semantic meaning' is inherent and likely insurmountable without connecting the model to other, far more complex models that haven't been built yet.
We are at the level where our models can consistently generate blocks of text with full sentences in them that make grammatical sense. Which is pretty cool.
But the next step is being able to consistently generate full sentences that make grammatical sense and usefully convey information. And while the current models do that a lot of the time, they don't do that all of the time because they don't and can't know the difference without essentially being a different thing. Because to do that consistently, we need an "understanding what things mean" model. Which is many orders of magnitude larger and more difficult than a text generator.
Language models aren’t even terrible at math. The Minerva paper provides a devastating counterexample. It will soon be replaced by more powerful linguistic-mathematical systems. Within the next twelve months we may well see 100% performance on all major benchmarks.
It's all just passable garbled nonesense that the reader (goes to lengths) to interept based on their prior knowledge, which is not expressed in the syntax of what these systems output.
In the case of mathematics, we're far less willing to "BS away" the interpretive failures. But if we were equally demanding, likewise, all prose generated by these systems isnt AI "getting" anything either.
Pass a film reel thru' a shredder and an art student would still call it a film. Pass math thru' and a mathematician wont. This says more about our ability and inclination to make sense out of nonesense when in apparent communicative situations (since, when speaking to a person, this actually improves our mutual understanding).
So, how much of AI is just hacking people's cognitive failures: (1) people's willingness to attribute intention; (2) people's willingness to impart sense "at all costs" to apparent communication; and (3) "hopeium".