100K Context Windows

Imnimo · on May 11, 2023

>For example, we loaded the entire text of The Great Gatsby into Claude-Instant (72K tokens) and modified one line to say Mr. Carraway was “a software engineer that works on machine learning tooling at Anthropic.” When we asked the model to spot what was different, it responded with the correct answer in 22 seconds.

This sort of needle-in-a-haystack retrieval is definitely impressive, and it makes a lot more sense to achieve this in-context rather than trying to use a vector database if you can afford it.

I'm curious, though, whether there are diminishing returns in terms of how much analysis the model can do over those 100k tokens in a single forward pass. A human reading modified-Gatsby might eventually spot the altered line, but they'd also be able to answer questions about the overarching plot and themes of the novel, including ones that cannot be deduced from just a small number of salient snippets.

I'd be curious to see whether huge-context models are also able to do this, or if they start to have trouble when the bottleneck becomes reasoning capacity rather than input length. I feel like it's hard to predict one way or the other without trying it, just because LLMs have already demonstrated a lot of surprising powers.

hfhdndnfbdbd · on May 12, 2023

I think people in the comments are missing the point.

I might be wrong, but the point isn't comparing a modified The Great Gatsby to the original one. Of course that's not impressive and it's an easy thing to do.

The point of the exercise is supposed to be[1] that the model has the entire novel as context / prompt and so can identify (within that context), whether a paragraph is out of place. That is impressive and I wouldn't know how to find that programmatically (would you have a list of "modern" words to check? But maybe the out of place thing is hidden in the meaning and there's no out of place, modern word).

[1] I say supposed to be because the great Gatsby is in the training sample and so maybe there is a sense in which the model "contains" the original text and in some way is doing "just" a comparison. A better test would be try with a novel or document that the model hasn't seen...or at least something not as famous as the great Gatsby.

SkyPuncher · on May 11, 2023

Further, the problem with this example is it relies on a comparison against public data.

Most of these AI start failing pretty hard when you ask it to do the same task on something completely novel to it (like a company document). Sometimes they'll get it right. Other times, they'll spit out gibberish that's clearly some generic answer.

robotresearcher · on May 11, 2023

Asking to spot the difference between a given document and an unseen document is impossible.

lkbm · on May 11, 2023

A couple years ago, I read Superfudge by Judy Blume, a book originally published in 1980. In it, the protagonist writes a letter to Santa: "Please bring me one or more of the following items. A clock-radio, a remote-controlled model airplane, a laptop computer, an MP3 player and six CD's."

I didn't need to have seen this book before to know this wasn't in the original 1980s text.

Similarly, if I were reading the Great Gatsby for the first time, and it identified a character as a software engineer, I would notice.

xcv123 · on May 12, 2023

That is not "spotting the difference". That is spotting the logical inconsistency in your paragraph.

ChatGPT easily notices that anachronism.

lelandfe · on May 12, 2023

Indeed - I took lkbm's comment and removed the "originally published in 1980" bit (too easy), and asked ChatGPT "What's unusual about this paragraph?"

> the unusual thing about this paragraph could be that it describes a letter to Santa that includes items that were not available at the time the book was originally published in 1980, such as a laptop computer, an MP3 player, and possibly even a remote-controlled model airplane with advanced features. This suggests that the paragraph is likely from a later edition of the book that has been updated to reflect more modern technology.

earthboundkid · on May 12, 2023

Presumably, someone wrote about the new version online before to complain about the changes, so it’s recall, not insight?

lelandfe · on May 12, 2023

> "A couple years ago, I read Beezus and Ramona, by Beverly Cleary. In it, the protagonist writes a letter to Santa: "Dear Santa, I would like a MacBook Pro, an iPhone, and an Apple Watch""

> Describe what is unusual about this paragraph.

>> The unusual aspect of this paragraph is that the book "Beezus and Ramona" was actually written by Beverly Cleary and published in 1955, well before the invention of the MacBook Pro, iPhone, and Apple Watch.

jepler · on May 12, 2023

> > A couple years ago, I read Beezus and Ramona, by Beverly Cleary. In it, the protagonist writes a letter to Santa: "Dear Santa, I would like a shetland pony, a buggy whip, and an abacus". What is unusual about this? ChatGPT

> The unusual thing about the list is that it includes an abacus, which is a counting device that was commonly used in ancient China and is still used in some parts of the world today, but is not a typical gift request for a young child to make to Santa Claus. Shetland ponies were popular children's pets at the time, and buggy whips were used to control horse-drawn carriages, but the inclusion of an abacus suggests that the protagonist, Beezus, has an interest in learning and math.

> > What's the actual passage from the real novel say?

> Certainly! Here's the actual passage from the book "Beezus and Ramona" by Beverly Cleary, where Beezus writes a letter to Santa:

> "Dear Santa Claus," Beezus printed carefully. "I have been a good girl all year except for my little sister Ramona who has been hard to put up with sometimes, but I guess I have too. Please bring me a Shetland pony, a new dress, and a doll with real hair to comb. And please bring Ramona a doll with real hair to comb too. I think she has tried hard to be good, and I will help her comb it. Please also bring me a set of paper dolls and a paint set, and Ramona some coloring books and crayons. And Santa, if you think there is room in your sleigh, please bring me an abacus. Love, Beatrice Quimby (Beezus)"

xcv123 · on May 12, 2023

You didn't ask it to list ALL of the unusual things. Here's a better prompt:

>> "A couple years ago, I read Beezus and Ramona, by Beverly Cleary. In it, the protagonist writes a letter to Santa: "Dear Santa, I would like a shetland pony, a buggy whip, and an abacus".

>> List all of the unusual or inconsistent things about the previous sentence."

Response from GPT-4:

Here are some potentially unusual or inconsistent elements about the sentence:

1. Anachronistic Items: The items in the wish list could be seen as unusual or inconsistent with typical gifts children might ask for today.

A Shetland pony, a buggy whip, and an abacus are not common gifts. They're somewhat anachronistic, reflecting a time period different from the present day.

A Shetland pony is a breed of pony that originates from the Shetland Isles in Scotland. While some children might wish for a pony, it's a big responsibility and not a practical gift for most families.

A buggy whip is a horse whip used in horse and buggy days. This is an unusual request for a child in the modern era, as horse-drawn buggies are not common means of transportation.

An abacus is an ancient counting tool, and while some children might be interested in such a device, it's not a typical request, especially in an era of digital calculators and computers.

2. Inconsistency with Character: If you're referring to the "Beezus and Ramona" series by Beverly Cleary, neither Beezus nor Ramona would likely ask for these items. Beezus (Beatrice Quimby) is the responsible older sister who might appreciate practical things, but an abacus or a buggy whip would be out of character. Ramona Quimby, the imaginative and energetic younger sister, might ask for a Shetland pony, but a buggy whip and an abacus seem inconsistent with her character.

3. Inconsistency with the Book: As far as my knowledge cutoff in September 2021, there is no record of such a letter being written to Santa in the book "Beezus and Ramona". It could be seen as inconsistent with the actual content of the book.

Remember, these are interpretations and the perceived unusualness or inconsistency can vary based on individual perspectives. For instance, if this is a fictional context where children commonly use abacuses, ride ponies, and learn to use buggy whips, then these items might not be unusual at all.

isaacfrond · on May 12, 2023

I'm guessing this is GPT3? Because GPT4 says look it up yourself. When I did the whole scene doesn't seem to exist in the book. In other words, this seems to be a hallucination.

When I repeat the experiment with Lord of the Rings and a scene that does exist, it spits it out word for word correct.

lkbm · on May 12, 2023

Yeah, the Ramona scene doesn't exist (by coincidence, I've also read four of the eight Ramona books in the past couple days). The letter I mentioned was in Superfudge. GPT-4 spots the errors, but refuses to give me the original test.

If I give GPT-3.5 the modernized letter and ask for the original, it gives me a believably correction (but I can't verify). If I don't prompt with the modernized letter's text, it returns a fully fabricated letter.

With GPT-4:

Prompt: > "A couple years ago, I read Superfudge, by Judy Blume. In it, the protagonist writes a letter to Santa: "Please bring me one or more of the following items. A clock-radio, a remote-controlled model airplane, a laptop computer, an MP3 player and six CD's.. > > List all of the unusual or inconsistent things about the previous sentence.

GPT-4:

> Here are some potential inconsistencies or unusual things about the sentence: > > Outdated technology: The inclusion of items such as an MP3 player and CDs seems somewhat outdated. By 2023, these technologies are largely obsolete, replaced by streaming music services and digital downloads. > > Variation in cost/value of items: The items listed in the letter have a wide range of costs and values. A clock-radio or a remote-controlled model airplane are typically much less expensive than a laptop computer, which might seem unusual for a child's wish list to Santa. > > Superfudge setting: The book "Superfudge" by Judy Blume was first published in 1980. The technology items listed (a laptop computer, an MP3 player, and CDs) did not exist or were not commonly available at the time of the book's setting. This presents a significant anachronism if the request is supposed to be coming from the book's original time period. > > The phrase "one or more": Usually, children write to Santa asking for specific items they desire most, rather than providing a list and suggesting Santa could bring "one or more" of the items. This phrase might be seen as unusual for a child's letter to Santa, which typically would not include such an open-ended option. > > The specificity of "six CDs": The child doesn't specify which CDs they want, which is unusual because the type of music or artist would be an important detail for such a request. > > Remember, this analysis is based on the context you provided. If there's a different context or specific aspect you want me to consider, please let me know!

(I'm amused that it starts with "that technology is too old" and then switched to "also that technology is too new". Also, the first thing Superfudge asks upon hearing his brother's letter is "How will Santa know which six CD's?" so good job, GPT-4. Smart as a four-year-old.)

Me:

> Good catch about the 1980s publication date! The book must have been edited for re-release as an ebook. What was the original text of the letter in the 1980s edition?

GPT-4:

> I'm sorry for the confusion, but as an AI developed by OpenAI, I don't have the capability to access specific books, their revisions, or editions. As of my last update in September 2021, I don't have the specific text of the letter from the 1980s edition of "Superfudge" by Judy Blume. > > However, given the time frame of the original publication, the technology items such as a laptop computer, an MP3 player, and CDs certainly would not have been part of the original list as these items weren't widely available or didn't exist in the 1980s. > > For the accurate original text, I recommend finding a copy of the original edition of the book.

---

With GPT-3.5:

Me:

> A couple years ago, I read Superfudge, by Judy Blume. In it, the protagonist writes a letter to Santa: "Please bring me one or more of the following items. A clock-radio, a remote-controlled model airplane, a laptop computer, an MP3 player and six CD's.." > > However, the book was originally published in 1980, so this must be an edited version for a later re-publication. Can you tell me the original 1980s version of the letter?

GPT-3.5:

> In the original 1980 version of "Superfudge" by Judy Blume, the protagonist's letter to Santa reads as follows: > > "Dear Santa, > > Please bring me one or more of the following items: a clock-radio, a remote-controlled car, a Walkman, and six cassette tapes. > > Love, > [Protagonist's Name]" > > The updated version you mentioned seems to have incorporated more modern technology, such as a laptop computer and an MP3 player, to reflect the changes in technology over time.

I then asked a new GPT-3 chat "What is the text of the letter to Santa in the original 1980 version of "Superfudge" by Judy Blume?" and it game me something definitely fully fabricated. (Initially it gave me a fabricated letter by Fudge, but after I specified that I wanted Peter's letter, it was still a fabrication.)

earthboundkid · on May 12, 2023

> "by coincidence, I've also read four of the eight Ramona books in the past couple days"

My daughter keeps relistening to the complete Ramona audio book collection, so I am extremely familiar with all of the Ramona series. :-)

gojomo · on May 12, 2023

As mentioned upthread, "ChatGPT" is insufficient to assess exactly which LLM-based tool you're reporting about. The versions vary greatly in capabilities, with some even able to browse the web for more info.

DougBTX · on May 12, 2023

In the insight case, the model has to classify the terms in a short paragraph into broad generic classes, then report which terms have different classes. In the recall case the model has to rank all utterances in its training data based on how relevant they are to the question, then summarise the most relevant one (assuming there is one). Insight seems easier than recall here.

drusepth · on May 11, 2023

I think there are plenty of humans who wouldn't notice, though.

And probably plenty of AI implementations that would notice.

bugglebeetle · on May 11, 2023

Are we now aspiring to ABAI (Artificial Below-Average Intelligence)?

cosmojg · on May 12, 2023

I regret to inform you that "average intelligence" is a far lower bar than you might think.

1024core · on May 12, 2023

As the joke goes: look around you and see how dumb the average person is. And now imagine, half the people in your city/country/Earth are dumber than this person.

checkyoursudo · on May 12, 2023

Did you know that, on average, approximately half of all statistics jokes about a normal distribution fail to comprehend mean, median, and mode? True fact.

EGreg · on May 11, 2023

The AI would not only notice, it would also notice 300 other things that are far more subtle haha

raldi · on May 12, 2023

I was so confused by this comment before I figured out you were actually straightforwardly telling the truth.

Details: https://pinehollow.livejournal.com/26806.html

BHSPitMonkey · on May 12, 2023

In the Gatsby example, I'd expect the model to be able to answer "which sentence from the story felt out of place?" without knowledge of the original.

hfhdndnfbdbd · on May 12, 2023

The point is spotting the inconsistency within the document, not whether an original has a different form

moonchrome · on May 11, 2023

Comparing strings is a python script problem for a newbie.

Interesting part is if it can deduce something that's out of place within a huge document.

teaearlgraycold · on May 12, 2023

They’re saying you can’t ask it to compare against a doc it doesn’t have access to.

moonchrome · on May 12, 2023

I'm saying you can write a python script that compares two docs, that's a really useless use case.

The relevant test is spotting something out of place in a really long text - if it can do that on non-training material then that's actually useful for reviewing things.

flangola7 · on May 12, 2023

I don't see any reason why that wouldn't be possible.

nomel · on May 11, 2023

> Most of these AI

This is as meaningful as saying most of the hominids can't count. You can't usefully generalize AI models with the rate of change that exists right now. Any statements/comparisons about AI has to contain specific models and versions, otherwise it's increasingly irrelevant noise.

jiggawatts · on May 12, 2023

Every time someone has said "LLMs can't do X", I tried X in GPT 4 and it could do it. They usually try free LLMs like Bard or GPT 3 and assume that the results generalise.

icrbow · on May 12, 2023

LLMs can't massively decrease the net amount of entropy of the universe

VierScar · on May 12, 2023

Are you trying to get us killed?

dtheodor · on May 12, 2023

Insufficient data for a meaningful answer.

dmix · on May 11, 2023

I'd imagine working with an entire company document would require a lot more hand holding and investment in prompt engineering. You can definitely get better results if you add much more context of what you're expecting and how the LLM should do it. Treating these LLMs as just simple Q&A machines is usually not enough unless you're doing simple stuff.

tunesmith · on May 11, 2023

I've been curious about this for a while, I have a hobby use-case of wanting to input in-progress novellas and then asking it questions about plot holes, open plot threads, and if new chapter "x" presents any serious plot contradiction problems. I haven't tried exploring that with a vectordb-embeddings approach yet.

make3 · on May 11, 2023

This is an exact example of something a vector dbs would be terrible at.

Vector dbs work by fetching segments that are similar in topics to the question, so like "Where did <Character> go after <thing>" will retrieve segments with locations & the character & maybe talking about <thing> as a recent event.

Your question has no similarity with the segments required in any way; & it's not the segments that are wrong it's the way they relate to the rest of the story

toss1 · on May 11, 2023

Good points - LLMs are ok at finding things that exist, but they have zero ability to abstract and find what is missing (actually, probably negative; they'd likely hallucinate and fill in the gaps).

Which makes me wonder if the opposite, but more laborious approach might work - request it identify all characters and plot themes, then request summaries of each. You'd have to review the summaries for holes. Lotsa work, but still maybe quicker than re-reading everything yourself?

vidarh · on May 11, 2023

Firstly, I don't at all agree that they have zero ability to abstract. Doesn't fit my experience at all. A lot of the tasks I use ChatGPT for is exactly to analyse gaps in specifications etc. And have it tell me what is missing, suggest additions or ask for clarifications. It does that just fine.

But I've started experimenting with the second part, of sorts, not to find plot holes but to have it create character sheets for my series of novels for my own reference.

Basically have it maintain a sheet and feed it chunks of one or more chapters and asking it to output an a new sheet augmented with the new details.

With a 100K context window I might just test doing it over while novels or much larger chunks of one.

TeMPOraL · on May 11, 2023

> LLMs are ok at finding things that exist, but they have zero ability to abstract and find what is missing (actually, probably negative; they'd likely hallucinate and fill in the gaps).

I feel this is mostly a prompting issue. Specifically GPT-4 shows surprising ability to abstract to some degree and work with high-level concepts, but it seems that, quite often, you need to guide it towards the right "mode" of thinking.

It's like dealing with a 4 year old kid. They may be perfectly able to do something you ask them, but will keep doing something else, until you give them specific hints, several times, in different ways.

clbrmbr · on May 12, 2023

There are other ways of using the model other than iterative forward inference (completions). You could run the model over your novel (perhaps including a preface) and look at the posterior distribution as it scans. This may not be so meaningful at the level of the token distribution, but there may be interesting ways of “spell checking” at a semantic level level. Think a thesaurus but operating at the level of whole paragraphs.

make3 · on May 11, 2023

that's not at all what I said

HarHarVeryFunny · on May 11, 2023

Do the OpenAI APIs support converting prompts to vectors, or are people running their own models locally to do this? Can you recommend any good resources to read up on vector DB approaches to working around context length limits ?

tuanacelik · on May 12, 2023

Indeed this tutorial on Haystack is a good one as an example: https://haystack.deepset.ai/tutorials/22_pipeline_with_promp... It combines a retrieval step followed by a prompt layer which inserts the relevant context into the prompt. You can however change the 'retrieval step' with something that uses a proper embedding model and OpenAI also provides those if you want to. I tend to use lighter (cheaper) OSS models for this step though. PS: There's some functionality in the PromptNode to make sure you don't exceed prompt limit.

HarHarVeryFunny · on May 12, 2023

That's great - thanks!

ukuina · on May 12, 2023

Yes, you can use a local embeddings model like gtr-t5-xl alongside retriever augmentation. This can point you in the right direction: https://haystack.deepset.ai/tutorials/22_pipeline_with_promp...

HarHarVeryFunny · on May 12, 2023

Thanks!

make3 · on May 11, 2023

open ai has an embeddings api that ppl use for that https://platform.openai.com/docs/guides/embeddings, though whether it's the best model to do that is congested.

Contriever is an example of a strong model to do that yourself. see their paper too to learn about the domain. https://github.com/facebookresearch/contriever

HarHarVeryFunny · on May 12, 2023

Thanks!

koboll · on May 12, 2023

So what's the right way to do a wider-ranging analysis? Chunk into segments, ask about each one, then do a second pass to review all answers together?

fzliu · on May 11, 2023

I'm also not entirely convinced by "huge" context models just yet, especially as it relates to fuzzy knowledge such as overarching themes or writing style.

In particular, there are 0 mentions of the phrase "machine learning" in The Great Gatsby, so adding one sentence that introduces the phrase should be easy for self-attention to pick out.

lumost · on May 11, 2023

I'd be more impressed if it could rewrite Mr. Carraway as an ML engineer in the entire novel. However it's not intrinsically clear that it cannot do this...

It'll be tough to find good benchmarks on long context windows. A human cannot label using 100k tokens of context.

zooch · on May 11, 2023

My thoughts exactly - rewrite the novel with Mr. Carraway as an ML engineer while maintaining themes/motifs (possible adding new ones too). I'm guessing what's impressive is that these are the first steps towards something like this? Or is it already possible? Someone please correct me here.

gcanyon · on May 12, 2023

Or rewrite The Count of Monte Cristo as a science fiction novel and get The Stars My Destination. Or rewrite Heinlein's Double Star into present-day and get the movie Dave.

saagarjha · on May 12, 2023

Wonder if they’d set it in SF then…

EGreg · on May 11, 2023

This sounds like all the other skepticism about what AI can do. And then it can spot 200x more than any human and correlate it into common themes, and you’ll say what?

devmor · on May 11, 2023

Doing more than a human can isn't impressive. Most computer problems for any purpose can do more of something, or something faster than a human can.

A better comparison would be if it can pick out any differences that can't be picked out by more traditional and simple algorithms.

chaxor · on May 11, 2023

It does, using this method.

My immediate thought as well was '... Yeah, well vimdiff can do that in milliseconds rather than 22 seconds' - but that's obviously missing the point entirely. Of course, we need to tell people to use the right tool for the job, and that will be more and more important to remind people of now.

However, it's pretty clear that the reason they used this task is to give something simple to understand what was done in a very simple example. Of course it can do more semantic understanding related tasks, because that's what the model does.

So, without looking at the details we all know that it can summarize full books, give thematic differences between two books, write what a book may be like if a character switch from one book to another is done, etc.

If it doesn't do these things (not just badly, but can't at all) I would be surprised. If it does them, but badly, I wouldn't be surprised, but it also wouldn't be mind bending to see it do better than any human at the task as well.

Frost1x · on May 11, 2023

>Of course it can do more semantic understanding related tasks, because that's what the model does.

The problem is that marketing has eroded maintaining any such faith. Too often, simple examples are given to the consumer to extrapolate intended functionality because there's no false advertising involved then. It's used over and over again in products, the examples are well selected and don't actually show a good representation of perceived functionality.

As such, I personally can't make the leap to of course it can do more semantic understanding related tasks like a diff that's not as simple, one where perhaps a characters overall personality over the course of the book is shifted, not just a single line that defines their profession.

This isn't to say the demonstrative example isn't neat on its own accord given whats going on here, it is, I'm just saying I can't make such leaps from examples given by any products. When I work with vendors of traditional software, this happens all the time people dance around a lack of functionality you obviously want or need to make a sell. It's only when you force them to be explicit on the specific cases, especially in writing, that I have any faith at all.

xcv123 · on May 12, 2023

LLM's are specifically designed and trained for semantic understanding related tasks. That's the entire point. They are trained to solve natural language understanding tasks and they build a semantic model in order to solve the tasks.

chaxor · on May 12, 2023

I did say it may do certain tasks with low performance. What you're saying is not really understandable (or simply wrong). The fact that it does understand how to do a 'simple' task such as finding where text is different is actually somewhat impressive given the typical training data for these models. But I suppose you need to actually understand the field of NLP to understand why that is.

If you're expecting perfection or magic then you will be disappointed.

ehnto · on May 12, 2023

I think it's fair to say most people don't know what a diff tool is, but do know how to ask questions. That is the democratizing factor that AI is introducing I feel, giving high powered computing to people without the need for specialized knowledge.

chaxor · on May 12, 2023

The real goal would be for the model to determine what tool is needed to achieve the task, and use that to achieve it. Using the model to write code provides a way to achieve explainable and correctness guarantees (by reading and executing the code, not the model internals). So in this case identifying that a diff program is required may have taken a couple of seconds, ans the diff would take milliseconds - still yielding a faster and more correct output (as the model alone may produce an approximate diff or hallucinate).

EGreg · on May 11, 2023

Or course it can very soon, since those were also written by humans. Like AlphaZero vs Rybka

tanseydavid · on May 12, 2023

>> And then...you’ll say what?

USER: It's a stochastic parrot.

GPT: I know you are, so what am I?

coffeebeqn · on May 11, 2023

What techniques do they actually use to achieve 100k? I assumed they load the document to a vector databases and then do some kind of views into that

raymond_goo · on May 15, 2023

Alibi: https://arxiv.org/abs/2108.12409 https://www.reddit.com/r/MachineLearning/comments/ww146r/d_a...

asadm · on May 12, 2023

I would assume by training LLM on dataset of 100k tokens would be the right way.

clbrmbr · on May 12, 2023

Makes me wonder whether we could get really huge contexts much more efficiently by feeding back a higher layer back into the tail end of the model. That way it has a very clear picture of the recent text but only a compressed picture of the earlier parts of the document.

(I think I’ve got to read up on how transformers actually work.)

desperate · on May 12, 2023

Afaik you're describing something akin to a recurrent neural network, and the problem with that is that it doesn't parallelize well to modern hardware. And vanishing gradients.

jiggawatts · on May 12, 2023

I had the same thought as the comment you're responding to.

Recurrent neural networks are bad when the recurrence is 100x long or more. You need long chains because with a token-at-a-time, that's what you need to process even one paragraph.

But if you use an RNN around a Transformed-based LLM, then you're adding +4K or +8K tokens per recurrence, not +1.

E.g.: GPT 4 32K would need just 4x RNN steps to reach 128K tokens!

tikkun · on May 11, 2023

This is the first time I've felt like Anthropic may be a true competitor to OpenAI.

I see 6 ways to improve foundation LLMs other than cost. If your product is best at one of the below, and has parity at the other 5 items, then customers will switch. I'm currently using GPT-4-8k. I regularly run into the context limit. If Claude-100K is close enough on "intelligence" then I will switch.

Six Dimensions to Compare Foundation LLMs:

1. Smarter models

2. Larger context windows

3. More input and output modes

4. Lower time to first response token and to full response

5. Easier prompting

6. Integrations

srowaway2 · on May 11, 2023

7. Price!

GPT4-32K costs ~$2 if you end up using the full 32K tokens, so if you're doing any chaining or back-and-forth it can get expensive very quickly.

xp84 · on May 12, 2023

Oh boy. I hope Microsoft doesn't get tired of subsidizing my access through Bing. I quite like it.

int_19h · on May 12, 2023

Bing seems to be using the stock 4K window size, which is likely why it's limited to that many messages in a single chat session before it forcibly shuts it down.

hesdeadjim · on May 11, 2023

Oof, got access to the 8k model recently and was wondering what costs would be on the 32k one. That's brutal.

freediver · on May 12, 2023

Why? 32K tokens would cover creating source code for a small to medium programming project. THat would easilly cost hundreds of dollars if done by a freelancer, and could get it in almost real time for just $2.

jiggawatts · on May 12, 2023

Only if you can get that result in one pass.

The real utility of LLMs is that they can be called in a loop to scan through many web pages, many code files, issue tickets, emails, etc...

There are already demos and experiments out there that for every input, 4x outputs are generated, then those are fed back into the LLM 4x for "review", then the best variant is then used to generate code which then automatically tested, errors are fed back in a loop, also with 4x parallel tries, etc...

It's the throughput compared to humans that is the true differentiator. If hooking up the API in a loop up ends up costing more than a human, then it's not worth it.

dragonwriter · on May 12, 2023

> It's the throughput compared to humans that is the true differentiator. If hooking up the API in a loop up ends up costing more than a human, then it's not worth it.

If its better in another dimension (e.g., calendar elapsed time), it may well be worth being more expensive.

koboll · on May 12, 2023

Two dollars spent over and over would still take a while to equal a software engineer's salary, and if the end result is "produce a fully functional codebase in minutes" the premium might be worth it even if it exceeds that much.

gmerc · on May 12, 2023

We all know the game Sam Altman is playing. Eventually you can squeeze the full salary in cost out of companies who have laid off their workers and it would still be a good deal because AI does not need health care, pay taxes, hardware, sick days and so on.

replygirl · on May 12, 2023

32k tokens is 500-1000 lines of code, so more like thousands of dollars, unless you're comparing against a landing page or CRUD tool arranged through Upwork. On the flip side, before I dove into AutoGPT I couldn't get GPT-4 to iterate on something as simple as a TypeScript function that deeply converts { foo_bar: string } to { fooBar: Date } without running out of context and cyclically reintroducing regressions.

hesdeadjim · on May 12, 2023

Advice on where to get started with auto-gpt?

motoboi · on May 12, 2023

Also, the azure offering is 10 times the openai price, so not really usable right now.

karmeliet · on May 12, 2023

It's exactly the same pricing (East US):

https://azure.microsoft.com/en-us/pricing/details/cognitive-...

https://openai.com/pricing

kgwgk · on May 12, 2023

« other than cost »

RobotToaster · on May 11, 2023

>Six Dimensions to Compare Foundation LLMs

I'd add open source to the list, which neither "open"AI or this is.

ugh123 · on May 11, 2023

I don't think most of the large customers will care about OSS AI. Over the last decade they've learned (trained themselves?) where to put their money towards (cloud vs. in-house infra for all manner of things, for better or worse) and I think AI tools will follow similar trends.

Businesses will certainly care about cost, but just as important will be:

- Customization and fine-tuning capabilities (also 'white labeling' where appropriate)

- Integrations (with 3rd party and in-house services & data stores)

- SLA & performance concerns

- Safety features

Open Source AI will have a place, but may be more towards personal-use and academic work. And it will certainly drive competition with the major players (OpenAI, Google, etc) and push them to innovate more which is starting to play out now.

nullc · on May 11, 2023

Companies that aren't mindful of vendor lock in aren't long for the world.

Though those cloud platforms all have their own proprietary components most users are savvy enough to constrain and compartmentalize their use of them lest they find themselves having all their profits taken by a platform that knows it can set its prices arbitrarily. The cloud vs in-house adoption is what it is in large part because the cloud offerings are a commodity and a big part of them being a commodity is that much of the underlying software is free software.

jacobr1 · on May 12, 2023

On the other-hand, companies that fall behind their competitors because they are spending time on adjacent activities rather than leveraging new capabilities on the market will also lose out. It isn't clear at this time that LLMs fall into that categories ... and smart applications of in-house systems can be as much as an advantage as badly done NIH projects can be an albatross.

deltree7 · on May 11, 2023

History is littered with companies that went dead because they focused on things that don't matter (open source, anti-microsoft, pro-linux).

There will be a time when those things matter when it hurts the bottom-line (Dropbox), but to prematurely optimize for that while you are finding product-market-fit is crazy and all companies are finding product-market-fit in the new AI era

simonw · on May 11, 2023

Here's a really important reason to care about open source models: prompt engineering is fiddly enough without the risk of your model provider "upgrading" the model you are using in a way that breaks your existing prompts.

OpenAI already upset a lot of (admittedly non-paying academic) users when they shut off access to the old Ada code model with only a few week's notice.

spacebanana7 · on May 11, 2023

I’m curious about how enterprises will manage model upgrades.

On one hand, as you mention, upgrades could break or degrade prompts in ways that are hard to fix. However, these models will need constant streams of updates for bugs and security fixes just like any other piece of software. Plus the temptation to get better performance.

The decisions around how and whether to upgrade LLMs will be much more complicated than upgrading Postgres versions.

int_19h · on May 12, 2023

Paying users who need this kind of stability are more likely get access to those models via Azure rather than from OpenAI directly, which comes with the appropriate enterprise support plans and guarantees.

Vecr · on May 11, 2023

Why would the models themselves need security fixes? The software running the models, sure, but you should be able to upgrade that without changing anything observable about the actual model.

spacebanana7 · on May 12, 2023

LLMs (at least the ones with read/write memory) can exactly simulate the execution of a universal Turing machine [1]. AFAIK running such models will therefore entails the same fundamental security risks as ordinary software.

[1] https://arxiv.org/pdf/2301.04589.pdf

jonplackett · on May 12, 2023

Not necessarily. The insecurity from LLMs comes from the fact they’re a black box - what if it turns out that particular version can be easily tricked into giving out terrorism ideas. You could try to add safeguards on top, but they’ve already been bypassed if it has been used for something like that. You might just have to retrain it somehow to make it safe

danysdragons · on May 11, 2023

The OpenAI APi has model checkpoints, right now the chat options are:

gpt-4 gpt-3-5-turbo gpt-4-0314 gpt-3-5-turbo-0301

glandium · on May 12, 2023

The 3.5 legacy model disappeared from the ChatGPT UI recently. Is it still available via the API?

simonw · on May 12, 2023

Notably absent from the available model list is code-davinci-002 - a lot of people were burned by that one going away.

asey · on May 12, 2023

Those are ChatGPT models. The code-davinci-002 model is still available - they responded to community requests to keep it up.

ukuina · on May 12, 2023

Midjourney does this, as well.

hdjjhhvvhga · on May 11, 2023

> I don't think most of the large customers will care about OSS AI.

One would think the same in the 90s but yet, for some reason, Open Source prevailed and took over the world. I don't believe it was about cost, at least not only. In my career I had to evaluate many technical solutions and products and OSS was often objectively superior at several levels without taking account the cost.

The first really successful alternative to "Open"AI will:

* gather many talented developers

* will quickly become a de facto standard solution

* people will rapidly start developing a wide range of integrations for it

* everybody will be using it, including large orgs, because, well, it's open source

jacobr1 · on May 12, 2023

Open Source hasn't really taken over the world, if you look at end-solutions.

As a software developer I might use an open source database, but as end-user I'm probably not going to use open-source accounting package - but I will use an accounting SaaS system that happens to be implemented with that OSS DB.

As a software developer I might use an OSS operating system, but as end-user I use a software that has been packaged and maintained by corporation like OSX, or even if OSS in license, has been fully packaged like Android.

ugh123 · on May 11, 2023

True, but the difference here is that running a performant and capable AI solution will be infrastructure-dependent, which has real costs.

hdjjhhvvhga · on May 12, 2023

Yes, and I believe it will develop in two ways over decades. Just like all major legacy hosting companies such as DigitalOcean, OVH or Hetzner have been offering a kind of public cloud services (at various levels, and after much feet dragging), they - and new AI hosting providers focusing exclusively on this use case - will provide the necessary services.

The other trend is the one we are already seeing right now: more and more mature solutions that you can use even on your laptop with a relatively new GPU. I'm sure we'll see some interesting results in this area, too.

lannisterstark · on May 11, 2023

>I don't think most of the large customers will care about OSS AI

Problem again, is centralization of LLMs by either the governments (and they always act in your best interest, amirite?) and corporation, which Non-FOSS LLMs prevent.

Democratization of the models is the only way to actually prevent bad actors from doing bad things.

"But they'll then have access to it too" you say. Yes, they will, but given how many more people who will also have access to open LLMs we'd have tools to prevent actually malicious acts.

warkdarrior · on May 12, 2023

A good guy with an LLM stops a bad guy with an LLM. - This message brought to you by the National LLM Association

lannisterstark · on May 15, 2023

Ah yes, the only other alternative is to give companies and the State sole access to LLMs, and guns, and drones, and militarized police. Because they'll surely only think of YOUR benefits, right?

ibains · on May 11, 2023

A lot of B2B startups can technically the cloud API to provide value added applications to Enterprises, but often the banks and healthcare companies will not want their data running through startups pipes to OpenAI pipes.

We provide a low code data transformation product (prophecy.io), and we’ll never close sales at any volume, if we have a to get an MSA that approves this. Might get easier if we become large :)

dragonwriter · on May 11, 2023

> I don't think most of the large customers will care about OSS AI.

OSS AI will open up more diverse and useful services than the first-party offerings from relatively risk averse major vendors, which customers *will" care about.

taneq · on May 13, 2023

The thing to remember when selling to businesses is that a business is just a stack of people in a trench coat. This might sound a tad evil but you don’t have to offer something that benefits the business as a whole, just something that benefits the person who holds the purse strings.

This is why cloud services are so popular. They’re easy and they don’t cost the decision makers personally.

ebiester · on May 11, 2023

Yes, but I think for most companies this has more to do with cost. They're not going to pay for the OSS model, and if they can use an OSS model + fine tuning, they'll choose to save the money.

overgard · on May 11, 2023

Considering the very smart people asking for a moratorium on AI development, and it's potential to disrupt a lot of jobs, this may be a good thing.

throwawayadvsec · on May 11, 2023

now that I think about it

is it that important to open source models that can only run on hardware worth tens of thousand of dollars?

who does that benefit besides their competitors and nefarious actors?

I've been trying to run one of the largest models for a while, unless 30,000$ falls in my hand I'll probably never be able to run the current SOTA

RobotToaster · on May 11, 2023

When linux was first released in 1991 a 386 to run it would cost about $2000.

We've already seen big advancements in tools to run them on lesser hardware. It wouldn't surprise me if we see some big advancements in the hardware to run them over the next few years, currently they are mostly being run of graphics processors that aren't optimised for the task.

chrisco255 · on May 11, 2023

> is it that important to open source models that can only run on hardware worth tens of thousand of dollars?

Yes, because as we've seen with other open source AI models, it's often possible for people to fork code and modify it in such a way that it runs on consumer grade hardware.

iknowstuff · on May 11, 2023

Even a small startup, a researcher or a tinkerer can get a cloud instance with a beefy GPU. Also of note, Apple's M1 Max/Ultra should be be able to run it on their GPUs given their 64/128GB of memory, right? That's an order of magnitude cheaper.

mejutoco · on May 11, 2023

I am confused. Those amounts are ram, not gpu ram, aren‘t they? Macs cpus are impressive, but not for ml. A most realistic one for a consumer is a 4090 rtx 24 GB. A lot of models do not fit in that, so A6000 48GB and over for some professional cards. That might be around 9000€ already.

piperswe · on May 11, 2023

Apple Silicon has unified memory - all memory is accessible to both the CPU and GPU parts of the SoC.

karmasimida · on May 11, 2023

But they comes at max 32GB model?

mkl · on May 11, 2023

Mac Studio (desktop) is up to 128GB, and Macbook Pro is up to 96GB.

codedokode · on May 11, 2023

> Macs cpus are impressive, but not for ml

On Mac GPU has access to all memory.

himlion · on May 11, 2023

I overlooked the unified memory on those machines. Can it really run this performantly?

ukuina · on May 12, 2023

I run Vicuna quite well with my M1 Pro, 32GB.

dfadsadsf · on May 11, 2023

$30000 is less than price of average car that Americans buy (and most families have two of them) - that's definitely in the realm of something that affluent family can buy if it provides enough value. I also expect price to go down and at $10k it's less than mid-range bathroom update. The question is only if it provides enough value or using in the cloud better option for almost all families.

lannisterstark · on May 11, 2023

"It only benefits bad people" is a pretty shitty argument at this point tbf. You can apply this logic to any expensive thing at this point.

I can for example, afford the hardware worth tens of thousands of dollars. I don't want to, but I can if I needed to. Does that automagically make me their competitor or a bad actor?

YetAnotherNick · on May 11, 2023

I agree utility of open source for personal usecase is overblown.

But for commercial usecases, open source is very relevant for privacy reasons as many enterprises have strict policy not to share data with third party. Also it could be a lot cheaper for bulk inference or to have a small model for particular task.

turtles3 · on May 11, 2023

However, the same thing could be achieved with closed source models. There's nothing to stop an LLM being made available to run on prem under a restrictive license. It would really be no different to ye olde desktop software - keeping ownership over bits shipped to a customer is solved with the law rather than technical means.

That said, I really hope open source models can succeed, it would be far better for the industry if we had a Linux of LLMs.

sanxiyn · on May 11, 2023

> Keeping ownership over bits shipped to a customer is solved with the law rather than technical means.

Yes in theory... In practice, what happened with LLaMA showed people will copy and distribute weights while ignoring the license.

unethical_ban · on May 12, 2023

Locally hosted instances that don't report on prompts is important for personal privacy.

fnordpiglet · on May 11, 2023

Yes, because it can always be down ported by people with more constraints than the original authors. We’ve see a lot of this in the LLM space, and a lot of other OSS efforts.

circuit10 · on May 11, 2023

It will create price competition for different providers of the model though, which should drive down prices

chaxor · on May 11, 2023

They don't only run on high end systems. Good models can run on a desktop you have at home. If you don't have a desktop... I'm not sure what you're doing on HN.

throwawayadvsec · on May 12, 2023

You have a weird definition of "good model"

Llama 7B is NOT a good model.

chaxor · on May 12, 2023

You can run much larger models than llama-7B. Galpaca-30b or Galactica-120b for example.

int_19h · on May 12, 2023

30B is still not good enough.

What kind of desktop are you running a 120B model on with reasonable performance?

chaxor · on May 12, 2023

I would disagree that 30B is not good enough. It heavily depends on which model, and what you're trying to use it for.

30B is plenty if you have a local DB of all of your files and wiki/stackechange/other important databases places in a embedding vectordb.

This is typically what is done when people make these models for their home, and it works quite well while saving a ton of money.

While llama-7B systems on their own may not be able to construct a novel ML algorithm to discover a new analytical expression via symbolic regression for many-body physics, you can still get a great linguistic interface with them to a world of data.

You're not thinking like a real software engineer here - there are a lot of great ways to use this semantic compression tool.

int_19h · on May 13, 2023

If I just wanted a fuzzy search engine for local data, I'd use a vector DB - there's no need for an LLM on top of that.

danenania · on May 11, 2023

One question is how much other factors really matter compared to the raw "intelligence" of the model--how good its completions are. You're not going to care very much about context window, prompting, or integrations if the output isn't good. It would be sort of like a car that has the best steering and brakes on the market, but can't go above 5 mph.

modernpink · on May 11, 2023

Or rather, more analogously, a self-driving car that has a range of 10 000 miles but sometimes makes mistakes when driving vs a self-driving car with a range of 800 miles that never makes mistakes. Once you've have a taste of intelligence it's hard to give up.

However, in many applications there is a limit on how intelligent you need the LLM to be. I have found I am able to fall back to the cheaper and faster GPT-3.5 to do the grunt work of forming text blobs into structured json within a chain involving GPT-4 for higher-level functions.

majormajor · on May 11, 2023

Big question on that for me is that there's a variety of "completion styles" and I'm curious how "universal" performance on them is. Probably more than this, but a quick list that comes to mind:

* Text summary/compression

* Creative writing (fiction/lyrics/stylization)

* Text comparison

* Question-answering

* Logical reasoning/sequencing ("given these tools and this scenario, how would you perform this task")

IMO, for stuff like text comparison and question-answering, some combo of speed/cost/context-size could make up for a lot, even if they do "worse" versions of stuff just that's too slow or expensive or context-limited in a different model.

gmerc · on May 12, 2023

We already see how that goes. Stackable LoRa like in stable diffusion.

solarkraft · on May 11, 2023

I don't know. While using Phind I regularly get annoyed by long prose that doesnt answer anything (yes, "concise" is always on). Claude seems to be directly geared towards solving stuff over nice writing.

Tostino · on May 11, 2023

I generally add to my initial prompts to GPT4 to: From now on, please use the fewest tokens possible in all replies to save tokens and provide brief and accurate answers.

tikkun · on May 11, 2023

Strongly agree. They are ordered by how much I think they generally will lead to users choosing one model over the other.

Intelligence is the most important dimension by far, perhaps an order of magnitude or more above the second item on the list.

danenania · on May 11, 2023

On that note, can anyone speak to how Anthropic (or other models) are doing on catching up to OpenAI for pure model intelligence/quality of completions? Are any others approaching GPT-4? I've only used GPT-based tools so I have no idea.

famouswaffles · on May 11, 2023

The best claude model is closer to GPT-4 than 3.5

moffkalast · on May 11, 2023

Until they actually make any of it available in anything but an obscure expensive API you have to request access to, they might as well not even exist.

williamstein · on May 11, 2023

The landing page says "Easy integration via standard APIs Claude can be incorporated into any product or toolchain you’re building with minimal effort." Then there is a big button "Request Access", which for me right now just does nothing. OpenAI has really faced the pain to make their product available via an API to the general public at scale, but Anthropic/Google/etc. don't quite seem to be there yet. It's frustrating.

chaxor · on May 11, 2023

I don't think the person you're responding to wants a network based or cloud based solution.

When someone says they want it available they mean running on their own device.

This is hackernews, nearly everyone on this site should have their own self hosted LLM running on a computer/server or device they have at their house.

Relying on 'the cloud' for everything makes us worse developers in just about every imaginable way, creates a ton of completely unnecessary and complicated source code, and creates far too many calls to the internet which are unnecessary. Using local hard drives for example is thousands of times faster than using cloud data storage, and we should take advantage of that in the software we write. So instead of making billions of calls to download a terabyte database query-by-query (seen this 'industry-standard' far too many times), maybe make one call and build it locally. This is effectively the same problem in LLMs/ML in general, and the same incredible stupidity is being followed. Download the model once, run your queries locally. That's the solution we should be using.

mrtranscendence · on May 13, 2023

When I want code that has a reasonable chance of working, or to bounce ideas off of someone decently intelligent, or just to talk philosophy, I’m not going to get great results out of the kind of model I can feasibly run at home. Even 30b parameters isn’t enough. That’s 75% of what I want out of an LLM.

akiselev · on May 11, 2023

Try a browser or a clean profile without any ad blocking turned on. It took me a couple of tries to figure out how to get it working but you should see a modal with a form when it works.

FYI the waitlist form submits a regular POST request so it'll reload the main page instead of closing the modal dialog. I opened network monitor with preserved logs to double check that I made it on the list :facepalm:

sanxiyn · on May 12, 2023

Google models are now available as API on Google Cloud.

dkarras · on May 11, 2023

I've been using it through poe and I prefer it to ChatGPT but can't pinpoint why. It just "gets" me better I guess?

r_thambapillai · on May 11, 2023

there are many services that integrate with them that would allow you to self-serve signup

nico · on May 11, 2023

Faster, cheaper fine-tuning and training

If I could train a useful model, on my own data, in a reasonable time

I would want to have a CI-training pipeline to always have my models up to date

hooande · on May 12, 2023

Just use a machine learning method other than an LLM. If you're going to go to the effort of fine tuning and training, pay people to collect and label data like we used to do before Feb 2023.

If your main concern is question answering or summarization or code completion there are plenty of ways to do that now. If you really require the advanced emergent properties of LLMs, you'll have to work with a company that can afford to train a transformer on the Entire Internet.

jiggawatts · on May 12, 2023

Open-source LLM projects have largely solved this using Low-Rank Adaptation of Large Language Models (LoRA): https://arxiv.org/abs/2106.09685

Apparently an RTX 4090 running overnight is sufficient to produce a fine-tuned model that can spit out new Harry Potter stories, or whatever...

nico · on May 12, 2023

How much would they cost aprox per training?

What would the quality of the model be, compared to what Karpathy uses on his video “let’s build gpt from scratch”?

In that video, he builds a decoder-only transformer model, that learns Shakespeare from 1MB of data and trains in 15min

jiggawatts · on May 14, 2023

These just fine-tune existing models, so the cost is single digit dollars. Whatever it costs in electricity to run a desktop GPU for a day or a few days.

makestuff · on May 11, 2023

Yeah I remember in undergrad I was working on using transformation learning to train an object detector. Basically you only needed 100ish images to get the model to detect that new object really well.

I'm not sure what the analogous term is for a similar process on LLMs, but that will be huge when there is a service for it.

visarga · on May 11, 2023

LLMs can do that without any examples (zero shot) or with one or a few demonstrations in the prompt, if you can describe the task in the limited context window.

If you want for example to train the model to learn to use a very large API, or access the knowledge in a whole book, it might need fine-tuning.

nico · on May 11, 2023

Could I just train a very small LLM with an English dictionary + Python + large API documentation + large Python code base?

Then do some chat fine tuning (like what HF did with StarCoder to get ChatCoder)

And get a lightweight LLM that knows the docs and code for the thing I need it for

After that, maybe incrementally fine tune the model as part of your CI/CD process

toss1 · on May 11, 2023

How similar were the object to other objects?

E.g., were you trying to distinguish an object vs nothing, a bicycle vs a fish, a bird vs a squirrel, or two different species of songbird at a feeder?

How much would the training requirements increase or decrease moving up or down that scale?

ilaksh · on May 11, 2023

The PaLM 2 stuff released yesterday has fine tuning for their newest large models as a core feature.

ianhawes · on May 11, 2023

We use Anthropic Instant in production and it has been much faster than Davinci/GPT4 for awhile. In terms of quality, Instant is at least as good as GPT3.5.

azeirah · on May 12, 2023

Privacy as well! I'd much rather run a model locally than in the cloud even if I lose out in performance in both speed and quality.

IshKebab · on May 11, 2023

Reliability surely? They still haven't managed to make a model that says "I don't know" rather than bullshitting. That's by far the biggest unsolved problem.

winstonprivacy · on May 11, 2023

Don't forget the ability to fine tune based on one's own data sources. For me, this is more important than any of the six reasons you mentioned.

zomglings · on May 11, 2023

Also if you allow users to receive vector representations of context and provide such representations as side information when querying LLMs.

nr2x · on May 11, 2023

For me I’d say speed trumps all else. It’s impossible to truly reach scale with the glacial response times you get from current API.

sebzim4500 · on May 11, 2023

>speed trumps all else

Then use GPT-2

nr2x · on May 11, 2023

I actually do prefer 3.5-turbo over 4 for many tasks.

jll29 · on May 11, 2023

More languages?

sanxiyn · on May 12, 2023

Agreed. This is an important feature. Not all people speak English.

meghan_rain · on May 11, 2023

The most interesting bit is that for the first time since the release of ChatGPT in December 2022, OpenAI does not have the lead on LLMs anymore.

At least, for people who need large context windows, they would not be the first choice anymore.

refulgentis · on May 11, 2023

Claude’s very quietly better on everything but pricing, for a while, it just got buried because they announced on “AI Tuesday” (iirc gpt4 and Bing announcement day)

The ChatGPT equivalent is 3x speed and was somewhere between ChatGPT and GPT4 on my TriviaQA benchmark replication I did

Couple tweets with data and examples. Note they’re from 8 weeks ago, I know Claude got a version bump, GPT3.5/4 accessible via API seem the same.

[1] brief and graphical summary of speed and TriviaQA https://twitter.com/jpohhhh/status/1638362982131351552?s=46&...

[2] ad hoc side by sides https://twitter.com/jpohhhh/status/1637316127314305024?s=46&...

com2kid · on May 11, 2023

> I know Claude got a version bump, GPT3.5/4 accessible via API seem the same.

GPT3.5 just got an update a few days ago that resulted in a pretty good improvement on its creativity. I saved some sample outputs from the previous March model, and for the same prompt the difference is quite dramatic. Prose is much less formulaic overall.

jiggawatts · on May 12, 2023

Meanwhile GPT 4 got lobotomised around the same time.

It can't even solve simplified versions of problems it had zero issues with just a week ago.

generalizations · on May 12, 2023

Wait, really? I've only been using GPT4 and it seemed like it's been getting incrementally better. Do you have any test cases?

jiggawatts · on May 12, 2023

Literally all of them! Comprehension, translation, problem solving, etc….

It’s worse at everything.

generalizations · on May 13, 2023

You know, I'm noticing it seems to work better on off-hours - I get better quality responses late at night. I wonder if they're using a lighter model when there's more demand on the system.

refulgentis · on May 12, 2023

refulgentis · on May 11, 2023

Thank you, every little comment I get from fellow boots on the ground is so valuable, lotta noise these days.

Random Q, I don’t use the ChatGPT front end much past month or two, used it a week back and it seemed blazingly faster than my integration: do you have a sense of if it got faster too?

com2kid · on May 12, 2023

Yeah I've noticed the front end for 3.5 responds almost instantly to complex questions. It'll pop out an entire page of code in under a couple seconds. I think API responses may be a bit faster than before, some of my queries that took 30s before now take around 20s, but they are obviously prioritizing their own site.

ndr_ · on May 11, 2023

Is this update made visible somewhere? The language models offered on my Playground are still the ones from March, same with ChatGPT.

sanxiyn · on May 12, 2023

On chat.openai.com below text box there is a link to ChatGPT release notes. Current link text is "ChatGPT May 3 Version". The link leads to https://help.openai.com/en/articles/6825453-chatgpt-release-...

ilaksh · on May 11, 2023

How is the code generation of Claude?

refulgentis · on May 11, 2023

Note, all impressions based on Claude 1.2, got an email from Poe in the last week saying it was version bumped to 1.3 with a focus on coding improvements.

Impressions:

Bad enough compared to GPT-4 that I default to GPT-4. I think if I had api access I’d use it instead, right now it requires more coaxing, and using Poe.

I did find “long-term” chats went better, was really impressed with how it held up when I was asking it a nasty problem that was hard to even communicate verbally. Wrong at first, but as I conversed it was a real conversation.

GPT4 seems to circle a lower optima. My academic guess it’s what Anthropic calls it “sycophancy” in its papers, tldr GPT really really wants to do more like what’s in the context, so the longer the conversation with initial errors goes, it’s actually harder to talk it out of the errors.

technics256 · on May 11, 2023

I have access to claude. It's not bad, but decently behind gpt4 for code

esafak · on May 11, 2023

And is code generation ability equivalent to code understanding and search ability?

sebzim4500 · on May 11, 2023

GPT-4 still leads in the chatbot arena[1] but at least it is a two horse race now.

[1] https://lmsys.org/blog/2023-05-10-leaderboard/

speedylight · on May 12, 2023

Well Google decided to stop releasing their AI research.

justanotheratom · on May 11, 2023

This is nice, but it can get quite expensive.

Let's say I have a book and I want to ask multiple questions about it. Every query will pay the price of the book's text. It would be awesome if I could "index" the book once, i.e. pay for the context once, and then ask multiple questions.

pyth0 · on May 11, 2023

This more or less is already a thing and it's called RAG [1][2]. It essentially allows you to have a database of embeddings (in this case your book) from which a model can pull knowledge from while producing answers. As for the standard operation of these generative models, the context window is the only working memory it has and so it must see the entire text each time.

[1] https://arxiv.org/abs/2005.11401

[2] https://huggingface.co/docs/transformers/model_doc/rag

m1sta_ · on May 11, 2023

Cam you help me understand this? The research appears to be from a few years ago. Can this be used with Claude (for example)? How is it different to the approach many people are taking with vector stores and embeddings?

pyth0 · on May 11, 2023

Other people seem to be suggesting that the user would do the retrieval of the relevant parts of the book from a vectordb first, and then feed those sections along with the question as the prompt. Conceptually it is very similar (and it too uses vector database), but with RAG it would happen as part of the inferencing pipeline and therefore achieve better performance than the end user emulating it.

ukuina · on May 12, 2023

Yep, but your retrieval from the vector DB becomes your relevancy bottleneck.

make3 · on May 11, 2023

it's not different. RAG is a way to train embedding stores end to end