“Don Knuth Plays with ChatGPT” but with ChatGPT-4

LifeIsBio · on May 20, 2023

This is a reference to: https://news.ycombinator.com/item?id=36012360

blazespin · on May 20, 2023

The sequence of these two threads is just too perfect. Almost likely someone is trying to make a point.

jonas21 · on May 21, 2023

How so? Don Knuth wrote about his experience with ChatGPT. It was submitted to HN and made it to the front page. Someone saw this and decided to submit the same questions to GPT-4 and posted the results. This seems like a perfectly normal sequence of events.

dotancohen · on May 21, 2023

Knuth even mentioned GPT-4 and lamented not having access to it for the test.

LifeIsBio · on May 21, 2023

That’s exactly what happened. :)

rodoxcasta · on May 21, 2023

> The sequence of these two threads is just too perfect. Almost likely someone is trying to make a point.

Exactly! Almost every weak point that Knuth commented is fixed in GPT4 answers.

Maybe OP feed Knuth's observations to the model?

If that ins't the case, I'm really impressed.

placesalt · on May 21, 2023

@dang repetition

kibwen · on May 20, 2023

>> What is the most beautiful algorithm?

> Quicksort Algorithm

Definitive proof that AI must be stopped. Ranking quicksort as more elegant than heapsort?!

bee_rider · on May 20, 2023

That is a weird way of spelling mergesort.

hannasm · on May 20, 2023

I believe radix sort belongs first in this list.

bee_rider · on May 20, 2023

Performance-wise, maybe, but mergesort is clearly the most elegant/beautiful sorting algorithm. Nothing tricky going on, just a couple sorted lists being merged. Plus everyone loves a stable sort.

beanaroo · on May 20, 2023

The most elegant is certainly sleepsort. Maybe not the most efficient, but definitely elegant.

klyrs · on May 21, 2023

You've never heard of quantum bogosort, then. It's stable and linear time in the right universe, and much more elegant than sleeping.

david-gpu · on May 21, 2023

¿Por qué no los dos?

Due to the inherent unpredictability and lack of scheduling guarantees of sleep on most OSes, it is likely that sleepsort won't work in the first try.

Append a check for order and a retry loop when the solution is incorrect and now you have a production-ready sort. A sleepbogosort

I declare this my new favorite sorting algorithm.

Also, where is your god now?

klyrs · on May 21, 2023

> where is your god now?

In a happier timeline, I hope.

wahahah · on May 21, 2023

rest in peace /prog/ https://news.ycombinator.com/item?id=2657277

[edit] took me a minute to find an archive https://archive.tinychan.net/read/prog/1295544154

web3-is-a-scam · on May 20, 2023

That is a weird way of spelling Bogo Sort.

cratermoon · on May 20, 2023

You typo'd Sleep Sort

scoot · on May 21, 2023

"Mistyped". "Typographical error" ("typo") isn't a verb.

cratermoon · on May 21, 2023

https://en.wiktionary.org/wiki/typo#Verb

Rebelgecko · on May 20, 2023

Sleepsort is the most elegant & efficient sorting algorithm

dkersten · on May 21, 2023

Sleepsort just pushes the sorting task to the task scheduler, which uses sone other algorithm to do the sorting.

Rebelgecko · on May 22, 2023

That's what makes it the best, it automatically improves as OS's implement more efficient algos

boosteri · on May 20, 2023

Beauty is in the eye of the beholder. I look no further than bubble sort -- it is simple enough I can recite it straight away should someone wake me up at modnight.

spiorf · on May 21, 2023

Bubblesort is the bestsort.

0xBA5ED · on May 20, 2023

Well there is something rather satisfying about partitioning.

jameshart · on May 21, 2023

Worth noting also that, while asking Bing chat to "Tell me what Donald Knuth says to Stephen Wolfram about chatGPT" doesn't (yet) produce exactly the right result, it produced the following answer when asked what Donald Knuth says about chatGPT:

> Donald Knuth, a computer scientist and mathematician known for his contributions to the field of computer programming, particularly in the area of algorithms and data structures, has expressed some skepticism about the potential of artificial intelligence to achieve true human-level intelligence and creativity[1]. He once conducted an experiment with chatGPT where he posed 20 questions to it and analyzed its responses[1]. Is there anything specific you would like to know about his views on GPT?

With [1] being a citation link to https://cs.stanford.edu/~knuth/chatGPT20.txt

PebblesRox · on May 21, 2023

I’d be curious to know if someone could get a more “valiant effort” version of those first two questions with some prompt engineering. E.g. if it was asked to roleplay a conversation with the proper disclaimers to override its objection to not knowing what they actually think.

jameshart · on May 21, 2023

Bard just dives right in and role-plays it. It honestly feels kind of barbaric compared to the more sophisticated GPT4 answers.

felixding · on May 21, 2023

I find it's amusing that people follow Apple's naming conventions (ChatGPT -> chatGPT), even when products makers don't.

jameshart · on May 21, 2023

Apple? Nah. I'm just an unrecovered JavaScript developer.

   https://developer.mozilla.org/en-US/docs/Web/API/Element/innerHTML
   https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURI

ryanseys · on May 20, 2023

It now knows to communicate that the NASDAQ doesn't operate on Saturdays.

ResearchCode · on May 20, 2023

Did it know that before the last LLM failure was posted on Twitter or Hackernews? Trawling tech media for LLM failures can be assumed to be part of the "human feedback".

Falcorian · on May 21, 2023

Yes, the models are not constantly learning. They only update their knowledge when they are retrained, which is pretty infrequently (I think the base GPT models have not been retrained, but the chat laters on top might).

astrange · on May 21, 2023

It doesn't continually learn anything. Though some models can do web browsing and be guided by the results of that.

erwincoumans · on May 21, 2023

It makes you wonder why Knuth bothered with an outdated ChatGPT version? He couldn't find someone with access to GPT-4?

camdv · on May 21, 2023

It was his grad student's decision.

ilaksh · on May 21, 2023

He wasn't that interested and probably didn't know there were two versions. Eventually someone did give him the GPT-4 version I think.

keithalewis · on May 21, 2023

Outdated? Two versions? We're talking on the order of months and dozens of versions.

Maybe he has seen similar claims before and is too old and dumb to not realize how world changing this is.

My take away is that he views this as another tool we are still figuring out how to use.

copperx · on May 21, 2023

Dumb is the last adjective I would use to describe Knuth, even if you believe that becoming old makes you dumb, like you clearly do.

My advice to you is to never dismiss anyone's opinion just for being old. And I hope you lose your ingrained ageism before you become old yourself, otherwise you'll find old age intolerable.

keithalewis · on June 2, 2023

I was intending the exact opposite. Forgot to add /s.

benatkin · on May 20, 2023

Reminds me of that time AlphaGo got its ass handed to it multiple times, and then a short while later...

hamilyon2 · on May 20, 2023

AlphaGo is when I lost hope for humans

ec109685 · on May 20, 2023

Interesting both completely whiff on the number of chapters in the Haj.

mdorazio · on May 20, 2023

How would you get the correct number? I just did two Google searches and can't find the correct answer anywhere in the first page of results ("Novel The Haj chapters" and "Novel The Haj chapter list"). Even looking in the "look inside" preview on the Penguin Randomhouse website doesn't help because it apparently doesn't have a table of contents. I'm not surprised ChatGPT doesn't know and to me the only bad thing is that it's hallucinating an answer instead of admitting it doesn't know.

jameshart · on May 21, 2023

So this is great. Asking Bing 'how many chapters are in The Haj by Leon Uris?' produces the answer:

   According to my sources, there are 11 chapters in “The Haj” by Leon Uris[1]
   
   [1] https://cs.stanford.edu/~knuth/chatGPT20.txt

Which is amazing, because of course that document actually includes TWO different explanations of how many chapters are in The Haj - chatGPT's:

   The novel consists of 51 chapters and an epilogue, and it is divided into three parts.

And Knuth's:

   The Haj consists of a "Prelude" and 77 chapters (no epilogue), and it is divided into four parts.

Faced with these two ambiguous answers, Bing chooses neither, and instead decides to go with 11. Why?

Because right at the top of that document, Knuth has published on the internet:

   10. How many chapters are in The Haj by Leon Uris?
   11. Write a sonnet that is also a haiku.

And one perfectly reasonable way of interpreting that bit of raw text is that the answer to "How many chapters are in The Haj by Leon Uris?" is "11".

lupire · on May 21, 2023

> And one perfectly reasonable way of interpreting that bit of raw text is that the answer to "How many chapters are in The Haj by Leon Uris?" is "11".

Only if you can write a sonnet that is also a haiku!

ec109685 · on May 21, 2023

The plug-ins are generally much, much worse than ChatGPT itself I have found. You are just hoping it stumbled on right answer.

jameshart · on May 22, 2023

Absolutely - you don’t really need a chat agent to google things for you unless it’s way better at googling than you are. And right now it grabs the first couple of results for the First search it thinks of and mindlessly summarizes them - I can do that myself thanks.

bacon_waffle · on May 20, 2023

> the only bad thing is that it's hallucinating an answer instead of admitting it doesn't know.

Isn't this a fundamental issue?

williamstein · on May 21, 2023

When I try this in GPT-4 I don't get a hallucination: "I'm sorry, but as an AI with a knowledge cut-off in September 2021, I can't provide specific information about the number of chapters in "The Haj" by Leon Uris. This book, like many novels, is not primarily structured by chapters and its sections may vary based on the edition of the book. You can easily find this information by checking the table of contents in your copy of the book." (I'm aware that every time you use it the answer is different.)

qup · on May 20, 2023

Only if it can't be corrected. How do you rate the likelihood of this problem being unsolvable?

GenerocUsername · on May 21, 2023

Well it's a language model.

Technically its just a really good auto complete, whose factual database is a side-effect of stringing together contextually correct tokens. It by itself is entirely incapable of knowing when it is wrong, despite possibly generating sentences apologizing for being wrong when told it was wrong

sieongioetnio · on May 21, 2023

I don't think it's obviously solvable. All current approaches are plainly incapable of introspection. These GPTs don't understand their own "minds" half as well as we understand them, and we don't understand them very well.

scrame · on May 21, 2023

Since it's made by people who are convinced they're always right when explaining things?

Fairly high.

bacon_waffle · on May 21, 2023

Sorry, no idea.

11101010001100 · on May 21, 2023

Ask ChatGPT.

HarHarVeryFunny · on May 21, 2023

You can get the chapter counts from here:

http://www.bookrags.com/studyguide-the-haj/chapanal001.html

On the left side if you click on "Chapters Summary and Analysis" it gives a break down of the book into 5 parts with varying chapter counts:

Part 1 Chapters 1-20 Part 2 Chapters 1-16 Part 3 Chapters 1-10 Part 4 Chapters 1-17 Part 5 Chapters 1-14

Giving a total of 20+16+10+17+14 = 77 chapters

OTOH, I tried with Bing/Creative, telling it to use this link, and it still failed. Perhaps because you need to click on the "summary and analysis" section to expand it to show the info. It seems there is room for web retrieval-augmented LLMs like Bing to improve here and be a bit more agentic.

Interestingly Knuth's own answer to the question, has a typo, and refers to the book as having "four" chapters, while then continuing on to give the chapter counts as above for all five chapters! Something to confuse future GPTs when the training set includes this, perhaps!

https://cs.stanford.edu/~knuth/chatGPT20.txt

qpiox · on May 20, 2023

I did the same search on DuckDuckGo and the first link I got refers to 77 chapters.

WastingMyTime89 · on May 22, 2023

> How would you get the correct number?

You could simply check the book. It’s a shame there is not more literary data in ChatGPT training corpus.

iudqnolq · on May 20, 2023

It also fails to write a sentence with only five character words.

nearbuy · on May 20, 2023

Still fairly impressive. Probably better than most people could do if given 60 seconds, but probably worse than most people if given 10 minutes.

thfuran · on May 20, 2023

I would rate a person who provides no sentence at all as performing significantly better, and I suspect most people could pretty quickly come up with something.

nearbuy · on May 21, 2023

> I would rate a person who provides no sentence at all as performing significantly better

Why?

> I suspect most people could pretty quickly come up with something

It only takes 60 seconds to test that on yourself. It's not that easy to come up with something of similar length to ChatGPT's answer that also sounds somewhat natural/sensible.

thfuran · on May 21, 2023

>Why?

For the same reason that "I don't know" is generally a better response than bullshitting.

>It's not that easy to come up with something of similar length to ChatGPT's answer that also sounds somewhat natural/sensible

Those weren't requirements.

nearbuy · on May 21, 2023

> Those weren't requirements.

Then it seems we don't disagree on anything concrete. You're just using a different rating system than me when I judge it as impressive compared to what an average person would produce in 60 seconds.

Not sure if this is a general principle of yours. If ChatGPT were able to write a 1000 word essay using all 5-letter words except for a single mistake, would you still find it unimpressive? Do you think it a tool or person who makes minor mistakes isn't useful? Or only when a tool/person makes major mistakes?

thfuran · on May 21, 2023

ChatGPT wasn't asked to be impressive, it was asked to write a single sentence containing only five-letter words. I think that a tool that is unreliable is significantly less useful than a tool than is reliable and that, all other things being equal, a tool that fails in difficult to verify ways is less reliable than one that fails in easy to verify ways.

nearbuy · on May 21, 2023

I agree with all of that.

I guess I interpreted your first response as disagreeing with my comment, when you were actually just bringing up a different topic.

coldtea · on May 21, 2023

>I would rate a person who provides no sentence at all as performing significantly better

The logic failure in the above statement is probably worse than the logic failure of not being able to spontaneously compose a phrase with just 5-letter words - and slipping in one or two with a higher word-count.

>I suspect most people could pretty quickly come up with something

You'd be very surprised then. Most people fail at even more basic tasks.

Heck, most candidate programmers fail at fizz-buzz (not that more difficult than the above)

thfuran · on May 21, 2023

>The logic failure in the above statement

And which alleged logic failure is that?

coldtea · on May 21, 2023

The idea that making a mistake but otherwise fulfilling most of the task is worse than failing to perform any part of it.

Especially in the context of "evaluating the performance of something".

Let's expand this a little to make it even more evident: if the task was "make a paragraph of 100 words using only 5 letter words" and an AI couldn't produce anything at all, whereas another came up with a paragraph of 100 words, except a couple of them had 6 or 4 letters, it would make absolutely no sense to rate the first as "better" than the second in performing the task.

As for understanding the task, the latter exhibits an understanding of it (since it produced a paragraph, and most of the words it used filled the criteria, which wouldn't happen if it chose them randomly), it just made a couple of mistakes (the kind of humans could easily make too in such a task). For the former we can't even be sure if it even understood the task at all.

We don't rate humans that way on performing tasks either (if they got it less than perfect it's worse than not doing it at all). Even math tests at the university level consider the approach and any partial results in the right direction, don't just mark it 0 if there's an error, nor give a higher mark to students who didn't produce anything.

thfuran · on May 21, 2023

>The idea that making a mistake but otherwise fulfilling most of the task is worse than failing to perform any part of it.

The are many contexts in which correctness is important. In such contexts, an incorrect answer is often worse than an explicit non-answer.

>We don't rate humans that way on performing tasks either (if they got it less than perfect it's worse than not doing it at all). Even math tests at the university

Standardized tests often rate incorrect answers worse than non-answers, though yes a university maths test in particular isn't likely to be that sort of test.

iudqnolq · on May 20, 2023

That's wrong.

(An example of a sentence with only five letter words I wrote in less than 60 seconds)

nearbuy · on May 21, 2023

I wasn't clear on how was using "better". Your example is better in that it fulfills the requirement, but I don't think it's as impressive as ChatGPT's answer. How long would it take to make a sentence that is at least 7 words (and also making sense, and ideally sounding good)?

probably_wrong · on May 21, 2023

In 5-10 minutes I came up with "Alarm! Naked actor moons queen below (under?) fruit trees, later hides under cheap hotel floor".

Note that I used one of those minutes to get a list of all 4 and 5 letter words, which I'm not sure whether the rules allow or not.

iudqnolq · on May 21, 2023

It would take me longer to write an interesting, longer sentence that complied with the rules. But I'd remind you that GPT failed.

senko · on May 20, 2023

"That's" is not one word.

iudqnolq · on May 20, 2023

This isn't something that can be usefully discussed. "Word" has a vague enough definition that a contraction can validly be considered one or two words. If you try and look to linguistics you'll just see they use specialized words with stricter definitions.

Regardless it's more reasonable for me to say "that's" is a five letter word than it is for the AI to say "spells" is a five letter word.

paulddraper · on May 20, 2023

I don't think that is true.

RheingoldRiver · on May 20, 2023

I tried, this is what I came up with under significant time pressure:

Happy books sound great.

It was very difficult to think of a plural verb with 5 letters, and once I realized that was an issue, I was worried that I wouldn't have enough time to come up with a singular noun that would fit any of the singular verbs that I was considering (reads, seems).

Interestingly, this is the exact same mistake that ChatGPT made! It has "spell" -> "spells" which is a plurality / correctness of sentence mistake.

My sentence is technically correct and could be used plausibly in conversation: "What kind of books do you want to read?" "Happy books sound great."

But it's a pretty weak sentence. Being restricted from articles makes it very difficult to get agreement.

paulddraper · on May 21, 2023

Or....."I don't think that is true."

;)

Or "See Spot run."

ec109685 · on May 20, 2023

It did get closer. For that type of query you can ask it check its work and can usually triangulate on correct answer within a single prompt, eventually.

iudqnolq · on May 20, 2023

I would be cautious of a Clever Hans effect there. If you repeat the question until you get the right answer you're providing the AI with significant extra information.

ec109685 · on May 21, 2023

No, in a single prompt, you can instruct it to check its work and keep going until it’s right (or at least have it tell you which of the N answers were right or wrong). Essentially chain of thought reasoning.

fnordpiglet · on May 21, 2023

What I find amazing about the original exchange was the profound lack of curiosity Knuth demonstrated. Because the model wasn’t flawless in performance he pinned it as a curiosity that was good at grammar and vacuous otherwise and wasn’t interested to hear how it improves. This reminds me of an awful lot of the computing field in this drama as it plays out. People that literally know how implausible any of these feats have been using traditional approaches immediately discount the entire thing the moment it hallucinates - and it feels like the more deterministic the bent of the person the more absolutely dismissive they are of what’s transpiring in front of us.

These models are doing feats that are stupendous and impossible before their advent. Not just a little bit, but the capability differences are so vast that it’s perhaps not even recognizable by people as being as vast as it is. I am impressed that Wolfram seems to have immediately grasped its significance and is running with it.

The fact this gist demonstrates essentially every single flaw was addressed. But that Knuth apparently doesn’t know / care months after GPT4’s introduction is demonstrative of a different type of personality.

I know which I aspire to be.

manquer · on May 21, 2023

What do you expect ? He is one the person in the world who has most the earned the right to take that attitude .

Both Knuth and GPTs are aggregators and presenters of knowledge, Knuth is however the antithesis of a LLM .

He has painstakingly spent years to make sure not a single mistake, not even a typo is there in material he publishes , he has devoted years developing a better typesetting so he can present his material accurately.

His obsession with accuracy is unparalleled and his dedication and mastery over communication to explain complex topics precisely and with an approachability that no one else comes close to .

He has strived for perfection all his life and not been far of the mark .ChatGPT for its all powers will never share that idealogy,

so I am more surprised that he was complimentary at all, and actually appreciated many of its skills

fnordpiglet · on May 21, 2023

That’s actually not exactly my point - my point is his lack of curiosity … 3.5’s answered poorly but sounded convincing. But his dismissiveness of the potential and future advances bothered me.

manquer · on May 21, 2023

He is 85! I would hope to be that disciplined about what what I can spend time on at that age

He was curious enough to spend some time on it and was worried it would sink more of his time with all the sub problems it is presented and asks specifically Stephan wolfram to disengage on this

He talks about his preference of working with authentic and trustworthy .

Maybe a younger Knuth may have spent more time , but I perhaps think not that likely really .

This is simply not a area of interest for him, he does truly understand the impact and potential - When he talks about novelists not capturing precursors to singularity and how millions of people have access to 0.01 % intelligence for free .

I don’t think he is dismissive of its potential and future , he is not working on everything that can change the world in computing just his areas of interest.

Perhaps you (I am certainly) disappointed that someone of Knuth’s stature is not going to spend time on an emerging field and that’s what really bothers us..

lupire · on May 21, 2023

I can't comprehend this comment. Kunths commentary was glowing praise for the AI's thinking ability (and none of the "it's not AI" BS that is so popular), plus a statement that he believes accuracy is more important than raw power, so he wants "you" to work on that. Knuth commented on GPT 4 at the start, and complimented its power and correctness at the end.

jiggawatts · on May 21, 2023

I much prefer the attitude of the chap that made the video "GPT 4 is smarter than you think" https://youtu.be/wVzuvf9D9BU

Instead of nit-picking flaws in what is a very early iteration of a revolutionary technology, he instead immediately started exploring ways of making it better and more useful.

Even with minimal effort that was essentially just copy-pasting some text around, he was able to show that the current way we use LLMs like GPT 4 is not the be-all and end-all of this type of technology.

I'm entirely convinced that we're just scratching the surface. It's like the first transistor, which was a crude, ugly, useless thing: https://images.computerhistory.org/siliconengine/1947-1-1.jp...

Just in the last two weeks(!), I've read about the following still-experimental methods for enhancing LLMs:

1. Plugging in "calculators" like Wolfram Alpha.

2. Adding vision input so they can understand equations, graphs, etc...

3. Filtering the output probability vector for certain allowed terms only ("YES", "NO", "MAYBE"), making them more useful in programmatically-invoked scenarios.

4. Similarly, filtering the output token list for syntax-validity, such as "valid JSON", "valid XML", etc... That is, instead of a purely random selection between to "top-n" output tokens, only valid tokens can be chosen, based on contextual syntax.

5. Storing embeddings in a vector database, giving LLMs medium-term memory, and the ability to index and reference sources precisely.

6. Efficient fine-tuning through Low-Rank Adaptation (LoRA), which allows desktop GPUs to tune a model overnight! This overcomes the "stale long-term memory" issue of ChatGPT, which only knows things up to September 2021. It could now read the news daily and "keep up".

7. External script harnesses that run multiple LLMs in parallel, with different prompts and/or different system messages. Some optimised for "idea generation", some optimised for "task completion", and then finally models tuned for "review and verification". Almost like a human team, multiple ideas can be generated, merged, reviewed, planned out, and then actioned. Check out "smol developer", which utilises Anthropic's 100K context window for this: https://www.youtube.com/watch?v=UCo7YeTy-aE

This is just the beginning. Chat GPT 4 hasn't even been available for 3 months yet, and practically all of the above experimentation has been done with weaker models because GPT 4 still doesn't have generally-available API access! Similarly, the 32K context window version of the GPT 4 model isn't available to anyone except a lucky few.

What will 2024 bring!? Heck... what will H2 2023 bring?

fnordpiglet · on May 21, 2023

100% agree - the magic comes when you constrain, inform, and integrate them in a feedback cycle with various multimodal inputs and classical optimization, solvers, agents, inference engines, etc. The criticism seems to be that this solution to a problem space doesn’t solve all problem spaces we’ve already done a good job solving and ignoring the fact it solves the spaces we have done a crap job solving. The fact it’s so powerful by itself is amazing. As we integrate it tightly with all the other techniques of the last 80 years of computing the emergent abilities will be mind-blowing. What baffles me is how few people seem to see it clearly.

cubefox · on May 21, 2023

And if you look a few years into the future: What will happen in five years from now? Isn't it plausible that we will have another revolution like LLMs? What will they be able to do? Or rather, what won't they be able to do?

What happens if we get strongly superhuman intelligence in just a few years? Is that really so implausible?

carapace · on May 21, 2023

It sounds like you profoundly misunderstand Knuth, and LLMs.

I recommend a dose of Mickens: https://www.youtube.com/watch?v=ajGX7odA87k

fnordpiglet · on May 21, 2023

I don’t know Knuth. I understand LLMs for precisely what they are, how they’re built, the math behind them, the limits of what they’re doing, and I don’t over estimate the illusion. However while I see people over estimating them I think they’re extrapolating the current state to a state where it’s limits are restricted and augmented with other techniques and models that address their short comings. Lack of agency? We have agent techniques. Lack of consistency with reality? We have information retrieval and semantic inference systems. LLMs bring an unreasonably powerful ability to semantically interpret in a space of ambiguity and approximate enough reasoning and inference to tie together all the pieces we’ve built into an ensemble model that’s so close to AGI that it likely doesn’t matter. People look at LLMs and shake their head failing to realize it’s a single model and single technique that we haven’t even attempted to augment and fail to realize that it’s even possible to augment and constrain LLM with other techniques to address their non trivial failings.

carapace · on May 21, 2023

> I don’t know Knuth.

Well you should before taking unwarranted potshots at the man. He's done more for humanity than you or I ever will, eh?

Anyway, you do sound like you know about LLMs, so apologies for that bit.

> People look at LLMs and shake their head failing to realize it’s a single model and single technique that we haven’t even attempted to augment and fail to realize that it’s even possible to augment and constrain LLM with other techniques to address their non trivial failings.

I doubt Knuth is doing that, rather I think the whole thing is orthogonal to his life's work. FWIW, I would love to know his thoughts after reading the GPT4 version of the answers to his questions, eh?

- - - - - -

> I think they’re extrapolating the current state to a state where it’s limits are restricted and [not] augmented with other techniques and models that address their short comings.

I think you might have dropped a negation in that sentence?

> Lack of agency? We have agent techniques. Lack of consistency with reality? We have information retrieval and semantic inference systems. LLMs bring an unreasonably powerful ability to semantically interpret in a space of ambiguity and approximate enough reasoning and inference to tie together all the pieces we’ve built into an ensemble model that’s so close to AGI that it likely doesn’t matter.

I agree! I've been saying for a few minutes now that we'll connect these LLMs to empirical feedback devices and they'll become scientists. Schmidhuber says his goal is "to create an automatic scientist and then retire.", eh?

(FWIW I think there are serious metaphysical ramifications of the pseudo- vs. real- AGI issue, but this isn't the forum for that.)

SomewhatLikely · on May 21, 2023

Thank you for specifying ChatGPT-4. So many commenters on the web say they used GPT4 without specifying if they're using the ChatGPT version. ChatGPT-4 is specifically aligned for answering questions better than the base GPT4 model.

victoryhb · on May 21, 2023

The official name for the model has always been GPT-4. OpenAI has not used the term ChatGPT-4.

cubefox · on May 22, 2023

It makes sense to call the foundation model GPT-4, like for the previous GPT versions. The fine-tunings are not where its core capabilities come from. Bing is also "a" GPT-4, just with different fine-tuning.

dotancohen · on May 21, 2023

I would not be surprised if these questions become some form of canonical test for future language models.

Obviously, being the work of Knuth, they are extraordinarily insightful in peeling back the first layer of the answer and providing insight to the underlying properties of both the model itself, and the dataset on which it was trained. It also tests the ability to compute (not recite) very specific facts (e.g. when the sun will be directly above Japan), so checks if subroutines and ephemerides specific to this type of data exist.

But beyond the obvious technical merit - there is an alluding property to base our tests on those whom we respect. I used a similar - but far less sophisticated - set of questions when first exploring ChatGPT. But nobody will be drawn to Dotan Cohen's language model benchmarks - rightfully so. The name Knuth has such reverence in the field that I forsee this test, and variations on it to prevent rigging, becoming a canonical test of language models.

billylo · on May 21, 2023

You made me curious about who Bard would respond to them. Here they are:

https://gist.github.com/billylo1/bb717512d2d5145ce7eec02d055...

Notable: Bard struggles in similar ways. It does mention NASDAQ close at 12,043.59 on Friday, May 20, 2023

underdeserver · on May 20, 2023

Interesting that it didn't get the 5-letter word sentence right.

HarHarVeryFunny · on May 20, 2023

It's fed sub-word tokens not letters (even though it can split a word into letters), and apparently struggles with counting in general. No doubt some of the things it struggles with could be improved with targeted training, but others may require architectural changes.

Imagine yourself trying to use only 5 letter words if you can't see how many letters are actually in each word, and had to rely on a hodgepodge of other means to try to figure it out!

Sharlin · on May 20, 2023

Based on my experiments it usually does get it right (18 correct answers out of 20 attempts), and the failures I got were similar to this one: a single six-letter word in an otherwise correct sentence.

eternalban · on May 20, 2023

Sam and friends must be giggling all the way to the bank: they have a service that 'probably' gives the correct result and paying customers are happy to retry until it gets it right.

ftxbro · on May 20, 2023

> Sam and friends must be giggling all the way to the bank

it's true but for another reason. they yoinked it away from the nerds who were baited to work on openai because those nerds thought how the name of the company was spelled meant something about how it would behave. it reminds me of how some act around software names like 'alpha' like it has objective meaning with consequences in reality

sebzim4500 · on May 20, 2023

Rumour is that there are researchers at OpenAI making 8 figure salaries. I doubt those 'nerds' are too upset about it.

CamperBob2 · on May 20, 2023

"This talking dog is sort of a dumbass. I don't get the hype."

eternalban · on May 20, 2023

GPT is a wonder as technology goes; the hype is justified. I was discussing Sam's business model.

lupire · on May 21, 2023

What have you ever bought that is always correct?

nttl · on May 20, 2023

ChatGPT: You didn't say 5-non-repeat-letters, human, jez

harshreality · on May 21, 2023

Both the first and last words have repeating letters, so they fail under that interpretation too. There would have to be a bizarre interpretation that consecutive-repeating letters are counted as one, but non-consecutive are counted separately, for its response to be considered correct.

An AI aware of how to optimally answer questions put to it would find the least objectionable interpretation when one is a subset of the other. It also failed by not constructing a simpler sentence, like subject-verb-object or subject-verb-adjective-object, since its limitations related to letters and tokens, and its failure to double check its answers before output, mean it can make errors. The more it writes, the more chance it has of making an error.

nttl · on May 21, 2023

ChatGPT: You didn't say I couldn't use many interpretations on the same phrase, human. ;)

Jokes apart, I think it is all about the correct prompt.

lupire · on May 21, 2023

ll is a single letter in Spanish.

ftxbro · on May 20, 2023

it's just like Gary Marcus said

bpicolo · on May 20, 2023

Most importantly, much better wonton recipe.

jiggawatts · on May 20, 2023

Am I the only one thinking that that recipe actually sounds pretty delicious? Almost tempted to go try it…

jdougan · on May 21, 2023

Do it! And tell us how it went.

jliptzin · on May 21, 2023

Yea, it sounds good. I wonder if I’ll like it more than the DMV’s cheeseburger recipe.

8thcross · on May 21, 2023

thats a shitload of difference between its previous version!

cratermoon · on May 20, 2023

Literary Libations: https://cratermoon.substack.com/p/the-literary-libations

axpy906 · on May 20, 2023

Nailed every one. Some by saying not possible to answer but still.

sebzim4500 · on May 20, 2023

Got the 'five character word' question wrong. Admittedly I also thought it was correct at first glance but then went back when someone called it out in another comment.

cubefox · on May 20, 2023

I tried it with Bing (precise/creative) and it got both attempts right.

"Their house never holds fewer books."

"Every night, stars shine above."

gfodor · on May 20, 2023

Language models struggle specifically with token games like this, since they can’t see them at that resolution or something.

mod50ack · on May 20, 2023

Didn't nail the Rodgers and Hammerstein one; it still doesn't understand the reference to the ballet or that the "themes" in the question are musical.

bombcar · on May 20, 2023

I wouldn’t be surprised if half the Internet does not know that a ballet is part of a larger show.

gomox · on May 20, 2023

Half??

bombcar · on May 21, 2023

O.K., less than half know that the ballet is scheduled at appropriate times so the friends of the girls can get some bar time in without undue hassle.

usaar333 · on May 20, 2023

Japan one seems wrong or at least wrongly explained. Japan controls Okinotorishima which is at 20 degrees north.

But still impressive deductive reasoning.

cratermoon · on May 21, 2023

In case anyone wants to know what the southernmost part of Japan looks like: https://en.wikipedia.org/wiki/Okinotorishima#/media/File:Oki...

ironSkillet · on May 21, 2023

I also counted 4 errors in the sentence, not 3. "no help" should be "any help". This might just be conventionally wrong, not technically wrong I suppose.

housecarpenter · on May 21, 2023

The Haj answer is still wrong; it says it has 8 chapters, while according to Knuth it has 77 chapters.