How so? Don Knuth wrote about his experience with ChatGPT. It was submitted to HN and made it to the front page. Someone saw this and decided to submit the same questions to GPT-4 and posted the results. This seems like a perfectly normal sequence of events.
Performance-wise, maybe, but mergesort is clearly the most elegant/beautiful sorting algorithm. Nothing tricky going on, just a couple sorted lists being merged. Plus everyone loves a stable sort.
Beauty is in the eye of the beholder. I look no further than bubble sort -- it is simple enough I can recite it straight away should someone wake me up at modnight.
Worth noting also that, while asking Bing chat to "Tell me what Donald Knuth says to Stephen Wolfram about chatGPT" doesn't (yet) produce exactly the right result, it produced the following answer when asked what Donald Knuth says about chatGPT:
> Donald Knuth, a computer scientist and mathematician known for his contributions to the field of computer programming, particularly in the area of algorithms and data structures, has expressed some skepticism about the potential of artificial intelligence to achieve true human-level intelligence and creativity[1]. He once conducted an experiment with chatGPT where he posed 20 questions to it and analyzed its responses[1]. Is there anything specific you would like to know about his views on GPT?
I’d be curious to know if someone could get a more “valiant effort” version of those first two questions with some prompt engineering. E.g. if it was asked to roleplay a conversation with the proper disclaimers to override its objection to not knowing what they actually think.
Did it know that before the last LLM failure was posted on Twitter or Hackernews? Trawling tech media for LLM failures can be assumed to be part of the "human feedback".
Yes, the models are not constantly learning. They only update their knowledge when they are retrained, which is pretty infrequently (I think the base GPT models have not been retrained, but the chat laters on top might).
Dumb is the last adjective I would use to describe Knuth, even if you believe that becoming old makes you dumb, like you clearly do.
My advice to you is to never dismiss anyone's opinion just for being old. And I hope you lose your ingrained ageism before you become old yourself, otherwise you'll find old age intolerable.
How would you get the correct number? I just did two Google searches and can't find the correct answer anywhere in the first page of results ("Novel The Haj chapters" and "Novel The Haj chapter list"). Even looking in the "look inside" preview on the Penguin Randomhouse website doesn't help because it apparently doesn't have a table of contents. I'm not surprised ChatGPT doesn't know and to me the only bad thing is that it's hallucinating an answer instead of admitting it doesn't know.
> And one perfectly reasonable way of interpreting that bit of raw text is that the answer to "How many chapters are in The Haj by Leon Uris?" is "11".
Only if you can write a sonnet that is also a haiku!
Absolutely - you don’t really need a chat agent to google things for you unless it’s way better at googling than you are. And right now it grabs the first couple of results for the
First search it thinks of and mindlessly summarizes them - I can do that myself thanks.
When I try this in GPT-4 I don't get a hallucination: "I'm sorry, but as an AI with a knowledge cut-off in September 2021, I can't provide specific information about the number of chapters in "The Haj" by Leon Uris. This book, like many novels, is not primarily structured by chapters and its sections may vary based on the edition of the book. You can easily find this information by checking the table of contents in your copy of the book." (I'm aware that every time you use it the answer is different.)
Technically its just a really good auto complete, whose factual database is a side-effect of stringing together contextually correct tokens. It by itself is entirely incapable of knowing when it is wrong, despite possibly generating sentences apologizing for being wrong when told it was wrong
I don't think it's obviously solvable. All current approaches are plainly incapable of introspection. These GPTs don't understand their own "minds" half as well as we understand them, and we don't understand them very well.
On the left side if you click on "Chapters Summary and Analysis" it gives a break down of the book into 5 parts with varying chapter counts:
Part 1 Chapters 1-20
Part 2 Chapters 1-16
Part 3 Chapters 1-10
Part 4 Chapters 1-17
Part 5 Chapters 1-14
Giving a total of 20+16+10+17+14 = 77 chapters
OTOH, I tried with Bing/Creative, telling it to use this link, and it still failed. Perhaps because you need to click on the "summary and analysis" section to expand it to show the info. It seems there is room for web retrieval-augmented LLMs like Bing to improve here and be a bit more agentic.
Interestingly Knuth's own answer to the question, has a typo, and refers to the book as having "four" chapters, while then continuing on to give the chapter counts as above for all five chapters! Something to confuse future GPTs when the training set includes this, perhaps!
I would rate a person who provides no sentence at all as performing significantly better, and I suspect most people could pretty quickly come up with something.
> I would rate a person who provides no sentence at all as performing significantly better
Why?
> I suspect most people could pretty quickly come up with something
It only takes 60 seconds to test that on yourself. It's not that easy to come up with something of similar length to ChatGPT's answer that also sounds somewhat natural/sensible.
Then it seems we don't disagree on anything concrete. You're just using a different rating system than me when I judge it as impressive compared to what an average person would produce in 60 seconds.
Not sure if this is a general principle of yours. If ChatGPT were able to write a 1000 word essay using all 5-letter words except for a single mistake, would you still find it unimpressive? Do you think it a tool or person who makes minor mistakes isn't useful? Or only when a tool/person makes major mistakes?
ChatGPT wasn't asked to be impressive, it was asked to write a single sentence containing only five-letter words. I think that a tool that is unreliable is significantly less useful than a tool than is reliable and that, all other things being equal, a tool that fails in difficult to verify ways is less reliable than one that fails in easy to verify ways.
>I would rate a person who provides no sentence at all as performing significantly better
The logic failure in the above statement is probably worse than the logic failure of not being able to spontaneously compose a phrase with just 5-letter words - and slipping in one or two with a higher word-count.
>I suspect most people could pretty quickly come up with something
You'd be very surprised then. Most people fail at even more basic tasks.
Heck, most candidate programmers fail at fizz-buzz (not that more difficult than the above)
The idea that making a mistake but otherwise fulfilling most of the task is worse than failing to perform any part of it.
Especially in the context of "evaluating the performance of something".
Let's expand this a little to make it even more evident: if the task was "make a paragraph of 100 words using only 5 letter words" and an AI couldn't produce anything at all, whereas another came up with a paragraph of 100 words, except a couple of them had 6 or 4 letters, it would make absolutely no sense to rate the first as "better" than the second in performing the task.
As for understanding the task, the latter exhibits an understanding of it (since it produced a paragraph, and most of the words it used filled the criteria, which wouldn't happen if it chose them randomly), it just made a couple of mistakes (the kind of humans could easily make too in such a task). For the former we can't even be sure if it even understood the task at all.
We don't rate humans that way on performing tasks either (if they got it less than perfect it's worse than not doing it at all). Even math tests at the university level consider the approach and any partial results in the right direction, don't just mark it 0 if there's an error, nor give a higher mark to students who didn't produce anything.
>The idea that making a mistake but otherwise fulfilling most of the task is worse than failing to perform any part of it.
The are many contexts in which correctness is important. In such contexts, an incorrect answer is often worse than an explicit non-answer.
>We don't rate humans that way on performing tasks either (if they got it less than perfect it's worse than not doing it at all). Even math tests at the university
Standardized tests often rate incorrect answers worse than non-answers, though yes a university maths test in particular isn't likely to be that sort of test.
I wasn't clear on how was using "better". Your example is better in that it fulfills the requirement, but I don't think it's as impressive as ChatGPT's answer. How long would it take to make a sentence that is at least 7 words (and also making sense, and ideally sounding good)?
This isn't something that can be usefully discussed. "Word" has a vague enough definition that a contraction can validly be considered one or two words. If you try and look to linguistics you'll just see they use specialized words with stricter definitions.
Regardless it's more reasonable for me to say "that's" is a five letter word than it is for the AI to say "spells" is a five letter word.
I tried, this is what I came up with under significant time pressure:
Happy books sound great.
It was very difficult to think of a plural verb with 5 letters, and once I realized that was an issue, I was worried that I wouldn't have enough time to come up with a singular noun that would fit any of the singular verbs that I was considering (reads, seems).
Interestingly, this is the exact same mistake that ChatGPT made! It has "spell" -> "spells" which is a plurality / correctness of sentence mistake.
My sentence is technically correct and could be used plausibly in conversation: "What kind of books do you want to read?" "Happy books sound great."
But it's a pretty weak sentence. Being restricted from articles makes it very difficult to get agreement.
It did get closer. For that type of query you can ask it check its work and can usually triangulate on correct answer within a single prompt, eventually.
I would be cautious of a Clever Hans effect there. If you repeat the question until you get the right answer you're providing the AI with significant extra information.
No, in a single prompt, you can instruct it to check its work and keep going until it’s right (or at least have it tell you which of the N answers were right or wrong). Essentially chain of thought reasoning.
What I find amazing about the original exchange was the profound lack of curiosity Knuth demonstrated. Because the model wasn’t flawless in performance he pinned it as a curiosity that was good at grammar and vacuous otherwise and wasn’t interested to hear how it improves. This reminds me of an awful lot of the computing field in this drama as it plays out. People that literally know how implausible any of these feats have been using traditional approaches immediately discount the entire thing the moment it hallucinates - and it feels like the more deterministic the bent of the person the more absolutely dismissive they are of what’s transpiring in front of us.
These models are doing feats that are stupendous and impossible before their advent. Not just a little bit, but the capability differences are so vast that it’s perhaps not even recognizable by people as being as vast as it is. I am impressed that Wolfram seems to have immediately grasped its significance and is running with it.
The fact this gist demonstrates essentially every single flaw was addressed. But that Knuth apparently doesn’t know / care months after GPT4’s introduction is demonstrative of a different type of personality.
What do you expect ? He is one the person in the world who has most the earned the right to take that attitude .
Both Knuth and GPTs are aggregators and presenters of knowledge, Knuth is however the antithesis of a LLM .
He has painstakingly spent years to make sure not a single mistake, not even a typo is there in material he publishes , he has devoted years developing a better typesetting so he can present his material accurately.
His obsession with accuracy is unparalleled and his dedication and mastery over communication to explain complex topics precisely and with an approachability that no one else comes close to .
He has strived for perfection all his life and not been far of the mark .ChatGPT for its all powers will never share that idealogy,
so I am more surprised that he was complimentary at all, and actually appreciated many of its skills
That’s actually not exactly my point - my point is his lack of curiosity … 3.5’s answered poorly but sounded convincing. But his dismissiveness of the potential and future advances bothered me.
He is 85! I would hope to be that disciplined about what what I can spend time on at that age
He was curious enough to spend some time on it and was worried it would sink more of his time with all the sub problems it is presented and asks specifically Stephan wolfram to disengage on this
He talks about his preference of working with authentic and trustworthy .
Maybe a younger Knuth may have spent more time , but I perhaps think not that likely really .
This is simply not a area of interest for him, he does truly understand the impact and potential - When he talks about novelists not capturing precursors to singularity and how millions of people have access to 0.01 % intelligence for free .
I don’t think he is dismissive of its potential and future , he is not working on everything that can change the world in computing just his areas of interest.
Perhaps you (I am certainly) disappointed that someone of Knuth’s stature is not going to spend time on an emerging field and that’s what really bothers us..
I can't comprehend this comment. Kunths commentary was glowing praise for the AI's thinking ability (and none of the "it's not AI" BS that is so popular), plus a statement that he believes accuracy is more important than raw power, so he wants "you" to work on that.
Knuth commented on GPT 4 at the start, and complimented its power and correctness at the end.
I much prefer the attitude of the chap that made the video "GPT 4 is smarter than you think" https://youtu.be/wVzuvf9D9BU
Instead of nit-picking flaws in what is a very early iteration of a revolutionary technology, he instead immediately started exploring ways of making it better and more useful.
Even with minimal effort that was essentially just copy-pasting some text around, he was able to show that the current way we use LLMs like GPT 4 is not the be-all and end-all of this type of technology.
Just in the last two weeks(!), I've read about the following still-experimental methods for enhancing LLMs:
1. Plugging in "calculators" like Wolfram Alpha.
2. Adding vision input so they can understand equations, graphs, etc...
3. Filtering the output probability vector for certain allowed terms only ("YES", "NO", "MAYBE"), making them more useful in programmatically-invoked scenarios.
4. Similarly, filtering the output token list for syntax-validity, such as "valid JSON", "valid XML", etc... That is, instead of a purely random selection between to "top-n" output tokens, only valid tokens can be chosen, based on contextual syntax.
5. Storing embeddings in a vector database, giving LLMs medium-term memory, and the ability to index and reference sources precisely.
6. Efficient fine-tuning through Low-Rank Adaptation (LoRA), which allows desktop GPUs to tune a model overnight! This overcomes the "stale long-term memory" issue of ChatGPT, which only knows things up to September 2021. It could now read the news daily and "keep up".
7. External script harnesses that run multiple LLMs in parallel, with different prompts and/or different system messages. Some optimised for "idea generation", some optimised for "task completion", and then finally models tuned for "review and verification". Almost like a human team, multiple ideas can be generated, merged, reviewed, planned out, and then actioned. Check out "smol developer", which utilises Anthropic's 100K context window for this: https://www.youtube.com/watch?v=UCo7YeTy-aE
This is just the beginning. Chat GPT 4 hasn't even been available for 3 months yet, and practically all of the above experimentation has been done with weaker models because GPT 4 still doesn't have generally-available API access! Similarly, the 32K context window version of the GPT 4 model isn't available to anyone except a lucky few.
What will 2024 bring!? Heck... what will H2 2023 bring?
100% agree - the magic comes when you constrain, inform, and integrate them in a feedback cycle with various multimodal inputs and classical optimization, solvers, agents, inference engines, etc. The criticism seems to be that this solution to a problem space doesn’t solve all problem spaces we’ve already done a good job solving and ignoring the fact it solves the spaces we have done a crap job solving. The fact it’s so powerful by itself is amazing. As we integrate it tightly with all the other techniques of the last 80 years of computing the emergent abilities will be mind-blowing. What baffles me is how few people seem to see it clearly.
And if you look a few years into the future: What will happen in five years from now? Isn't it plausible that we will have another revolution like LLMs? What will they be able to do? Or rather, what won't they be able to do?
What happens if we get strongly superhuman intelligence in just a few years? Is that really so implausible?
I don’t know Knuth. I understand LLMs for precisely what they are, how they’re built, the math behind them, the limits of what they’re doing, and I don’t over estimate the illusion. However while I see people over estimating them I think they’re extrapolating the current state to a state where it’s limits are restricted and augmented with other techniques and models that address their short comings. Lack of agency? We have agent techniques. Lack of consistency with reality? We have information retrieval and semantic inference systems. LLMs bring an unreasonably powerful ability to semantically interpret in a space of ambiguity and approximate enough reasoning and inference to tie together all the pieces we’ve built into an ensemble model that’s so close to AGI that it likely doesn’t matter. People look at LLMs and shake their head failing to realize it’s a single model and single technique that we haven’t even attempted to augment and fail to realize that it’s even possible to augment and constrain LLM with other techniques to address their non trivial failings.
Well you should before taking unwarranted potshots at the man. He's done more for humanity than you or I ever will, eh?
Anyway, you do sound like you know about LLMs, so apologies for that bit.
> People look at LLMs and shake their head failing to realize it’s a single model and single technique that we haven’t even attempted to augment and fail to realize that it’s even possible to augment and constrain LLM with other techniques to address their non trivial failings.
I doubt Knuth is doing that, rather I think the whole thing is orthogonal to his life's work. FWIW, I would love to know his thoughts after reading the GPT4 version of the answers to his questions, eh?
- - - - - -
> I think they’re extrapolating the current state to a state where it’s limits are restricted and [not] augmented with other techniques and models that address their short comings.
I think you might have dropped a negation in that sentence?
> Lack of agency? We have agent techniques. Lack of consistency with reality? We have information retrieval and semantic inference systems. LLMs bring an unreasonably powerful ability to semantically interpret in a space of ambiguity and approximate enough reasoning and inference to tie together all the pieces we’ve built into an ensemble model that’s so close to AGI that it likely doesn’t matter.
I agree! I've been saying for a few minutes now that we'll connect these LLMs to empirical feedback devices and they'll become scientists. Schmidhuber says his goal is "to create an automatic scientist and then retire.", eh?
(FWIW I think there are serious metaphysical ramifications of the pseudo- vs. real- AGI issue, but this isn't the forum for that.)
Thank you for specifying ChatGPT-4. So many commenters on the web say they used GPT4 without specifying if they're using the ChatGPT version. ChatGPT-4 is specifically aligned for answering questions better than the base GPT4 model.
It makes sense to call the foundation model GPT-4, like for the previous GPT versions. The fine-tunings are not where its core capabilities come from. Bing is also "a" GPT-4, just with different fine-tuning.
I would not be surprised if these questions become some form of canonical test for future language models.
Obviously, being the work of Knuth, they are extraordinarily insightful in peeling back the first layer of the answer and providing insight to the underlying properties of both the model itself, and the dataset on which it was trained. It also tests the ability to compute (not recite) very specific facts (e.g. when the sun will be directly above Japan), so checks if subroutines and ephemerides specific to this type of data exist.
But beyond the obvious technical merit - there is an alluding property to base our tests on those whom we respect. I used a similar - but far less sophisticated - set of questions when first exploring ChatGPT. But nobody will be drawn to Dotan Cohen's language model benchmarks - rightfully so. The name Knuth has such reverence in the field that I forsee this test, and variations on it to prevent rigging, becoming a canonical test of language models.
It's fed sub-word tokens not letters (even though it can split a word into letters), and apparently struggles with counting in general. No doubt some of the things it struggles with could be improved with targeted training, but others may require architectural changes.
Imagine yourself trying to use only 5 letter words if you can't see how many letters are actually in each word, and had to rely on a hodgepodge of other means to try to figure it out!
Based on my experiments it usually does get it right (18 correct answers out of 20 attempts), and the failures I got were similar to this one: a single six-letter word in an otherwise correct sentence.
Sam and friends must be giggling all the way to the bank: they have a service that 'probably' gives the correct result and paying customers are happy to retry until it gets it right.
> Sam and friends must be giggling all the way to the bank
it's true but for another reason. they yoinked it away from the nerds who were baited to work on openai because those nerds thought how the name of the company was spelled meant something about how it would behave. it reminds me of how some act around software names like 'alpha' like it has objective meaning with consequences in reality
Both the first and last words have repeating letters, so they fail under that interpretation too. There would have to be a bizarre interpretation that consecutive-repeating letters are counted as one, but non-consecutive are counted separately, for its response to be considered correct.
An AI aware of how to optimally answer questions put to it would find the least objectionable interpretation when one is a subset of the other. It also failed by not constructing a simpler sentence, like subject-verb-object or subject-verb-adjective-object, since its limitations related to letters and tokens, and its failure to double check its answers before output, mean it can make errors. The more it writes, the more chance it has of making an error.
Got the 'five character word' question wrong. Admittedly I also thought it was correct at first glance but then went back when someone called it out in another comment.
I also counted 4 errors in the sentence, not 3. "no help" should be "any help". This might just be conventionally wrong, not technically wrong I suppose.