Hacker News new | past | comments | ask | show | jobs | submit login
100K Context Windows (anthropic.com)
924 points by samwillis on May 11, 2023 | hide | past | favorite | 389 comments



>For example, we loaded the entire text of The Great Gatsby into Claude-Instant (72K tokens) and modified one line to say Mr. Carraway was “a software engineer that works on machine learning tooling at Anthropic.” When we asked the model to spot what was different, it responded with the correct answer in 22 seconds.

This sort of needle-in-a-haystack retrieval is definitely impressive, and it makes a lot more sense to achieve this in-context rather than trying to use a vector database if you can afford it.

I'm curious, though, whether there are diminishing returns in terms of how much analysis the model can do over those 100k tokens in a single forward pass. A human reading modified-Gatsby might eventually spot the altered line, but they'd also be able to answer questions about the overarching plot and themes of the novel, including ones that cannot be deduced from just a small number of salient snippets.

I'd be curious to see whether huge-context models are also able to do this, or if they start to have trouble when the bottleneck becomes reasoning capacity rather than input length. I feel like it's hard to predict one way or the other without trying it, just because LLMs have already demonstrated a lot of surprising powers.


I think people in the comments are missing the point.

I might be wrong, but the point isn't comparing a modified The Great Gatsby to the original one. Of course that's not impressive and it's an easy thing to do.

The point of the exercise is supposed to be[1] that the model has the entire novel as context / prompt and so can identify (within that context), whether a paragraph is out of place. That is impressive and I wouldn't know how to find that programmatically (would you have a list of "modern" words to check? But maybe the out of place thing is hidden in the meaning and there's no out of place, modern word).

[1] I say supposed to be because the great Gatsby is in the training sample and so maybe there is a sense in which the model "contains" the original text and in some way is doing "just" a comparison. A better test would be try with a novel or document that the model hasn't seen...or at least something not as famous as the great Gatsby.


Further, the problem with this example is it relies on a comparison against public data.

Most of these AI start failing pretty hard when you ask it to do the same task on something completely novel to it (like a company document). Sometimes they'll get it right. Other times, they'll spit out gibberish that's clearly some generic answer.


Asking to spot the difference between a given document and an unseen document is impossible.


A couple years ago, I read Superfudge by Judy Blume, a book originally published in 1980. In it, the protagonist writes a letter to Santa: "Please bring me one or more of the following items. A clock-radio, a remote-controlled model airplane, a laptop computer, an MP3 player and six CD's."

I didn't need to have seen this book before to know this wasn't in the original 1980s text.

Similarly, if I were reading the Great Gatsby for the first time, and it identified a character as a software engineer, I would notice.


That is not "spotting the difference". That is spotting the logical inconsistency in your paragraph.

ChatGPT easily notices that anachronism.


Indeed - I took lkbm's comment and removed the "originally published in 1980" bit (too easy), and asked ChatGPT "What's unusual about this paragraph?"

> the unusual thing about this paragraph could be that it describes a letter to Santa that includes items that were not available at the time the book was originally published in 1980, such as a laptop computer, an MP3 player, and possibly even a remote-controlled model airplane with advanced features. This suggests that the paragraph is likely from a later edition of the book that has been updated to reflect more modern technology.


Presumably, someone wrote about the new version online before to complain about the changes, so it’s recall, not insight?


> "A couple years ago, I read Beezus and Ramona, by Beverly Cleary. In it, the protagonist writes a letter to Santa: "Dear Santa, I would like a MacBook Pro, an iPhone, and an Apple Watch""

> Describe what is unusual about this paragraph.

>> The unusual aspect of this paragraph is that the book "Beezus and Ramona" was actually written by Beverly Cleary and published in 1955, well before the invention of the MacBook Pro, iPhone, and Apple Watch.


> > A couple years ago, I read Beezus and Ramona, by Beverly Cleary. In it, the protagonist writes a letter to Santa: "Dear Santa, I would like a shetland pony, a buggy whip, and an abacus". What is unusual about this? ChatGPT

> The unusual thing about the list is that it includes an abacus, which is a counting device that was commonly used in ancient China and is still used in some parts of the world today, but is not a typical gift request for a young child to make to Santa Claus. Shetland ponies were popular children's pets at the time, and buggy whips were used to control horse-drawn carriages, but the inclusion of an abacus suggests that the protagonist, Beezus, has an interest in learning and math.

> > What's the actual passage from the real novel say?

> Certainly! Here's the actual passage from the book "Beezus and Ramona" by Beverly Cleary, where Beezus writes a letter to Santa:

> "Dear Santa Claus," Beezus printed carefully. "I have been a good girl all year except for my little sister Ramona who has been hard to put up with sometimes, but I guess I have too. Please bring me a Shetland pony, a new dress, and a doll with real hair to comb. And please bring Ramona a doll with real hair to comb too. I think she has tried hard to be good, and I will help her comb it. Please also bring me a set of paper dolls and a paint set, and Ramona some coloring books and crayons. And Santa, if you think there is room in your sleigh, please bring me an abacus. Love, Beatrice Quimby (Beezus)"


You didn't ask it to list ALL of the unusual things. Here's a better prompt:

>> "A couple years ago, I read Beezus and Ramona, by Beverly Cleary. In it, the protagonist writes a letter to Santa: "Dear Santa, I would like a shetland pony, a buggy whip, and an abacus".

>> List all of the unusual or inconsistent things about the previous sentence."

Response from GPT-4:

Here are some potentially unusual or inconsistent elements about the sentence:

1. Anachronistic Items: The items in the wish list could be seen as unusual or inconsistent with typical gifts children might ask for today.

A Shetland pony, a buggy whip, and an abacus are not common gifts. They're somewhat anachronistic, reflecting a time period different from the present day.

A Shetland pony is a breed of pony that originates from the Shetland Isles in Scotland. While some children might wish for a pony, it's a big responsibility and not a practical gift for most families.

A buggy whip is a horse whip used in horse and buggy days. This is an unusual request for a child in the modern era, as horse-drawn buggies are not common means of transportation.

An abacus is an ancient counting tool, and while some children might be interested in such a device, it's not a typical request, especially in an era of digital calculators and computers.

2. Inconsistency with Character: If you're referring to the "Beezus and Ramona" series by Beverly Cleary, neither Beezus nor Ramona would likely ask for these items. Beezus (Beatrice Quimby) is the responsible older sister who might appreciate practical things, but an abacus or a buggy whip would be out of character. Ramona Quimby, the imaginative and energetic younger sister, might ask for a Shetland pony, but a buggy whip and an abacus seem inconsistent with her character.

3. Inconsistency with the Book: As far as my knowledge cutoff in September 2021, there is no record of such a letter being written to Santa in the book "Beezus and Ramona". It could be seen as inconsistent with the actual content of the book.

Remember, these are interpretations and the perceived unusualness or inconsistency can vary based on individual perspectives. For instance, if this is a fictional context where children commonly use abacuses, ride ponies, and learn to use buggy whips, then these items might not be unusual at all.


I'm guessing this is GPT3? Because GPT4 says look it up yourself. When I did the whole scene doesn't seem to exist in the book. In other words, this seems to be a hallucination.

When I repeat the experiment with Lord of the Rings and a scene that does exist, it spits it out word for word correct.


Yeah, the Ramona scene doesn't exist (by coincidence, I've also read four of the eight Ramona books in the past couple days). The letter I mentioned was in Superfudge. GPT-4 spots the errors, but refuses to give me the original test.

If I give GPT-3.5 the modernized letter and ask for the original, it gives me a believably correction (but I can't verify). If I don't prompt with the modernized letter's text, it returns a fully fabricated letter.

With GPT-4:

Prompt: > "A couple years ago, I read Superfudge, by Judy Blume. In it, the protagonist writes a letter to Santa: "Please bring me one or more of the following items. A clock-radio, a remote-controlled model airplane, a laptop computer, an MP3 player and six CD's.. > > List all of the unusual or inconsistent things about the previous sentence.

GPT-4:

> Here are some potential inconsistencies or unusual things about the sentence: > > Outdated technology: The inclusion of items such as an MP3 player and CDs seems somewhat outdated. By 2023, these technologies are largely obsolete, replaced by streaming music services and digital downloads. > > Variation in cost/value of items: The items listed in the letter have a wide range of costs and values. A clock-radio or a remote-controlled model airplane are typically much less expensive than a laptop computer, which might seem unusual for a child's wish list to Santa. > > Superfudge setting: The book "Superfudge" by Judy Blume was first published in 1980. The technology items listed (a laptop computer, an MP3 player, and CDs) did not exist or were not commonly available at the time of the book's setting. This presents a significant anachronism if the request is supposed to be coming from the book's original time period. > > The phrase "one or more": Usually, children write to Santa asking for specific items they desire most, rather than providing a list and suggesting Santa could bring "one or more" of the items. This phrase might be seen as unusual for a child's letter to Santa, which typically would not include such an open-ended option. > > The specificity of "six CDs": The child doesn't specify which CDs they want, which is unusual because the type of music or artist would be an important detail for such a request. > > Remember, this analysis is based on the context you provided. If there's a different context or specific aspect you want me to consider, please let me know!

(I'm amused that it starts with "that technology is too old" and then switched to "also that technology is too new". Also, the first thing Superfudge asks upon hearing his brother's letter is "How will Santa know which six CD's?" so good job, GPT-4. Smart as a four-year-old.)

Me:

> Good catch about the 1980s publication date! The book must have been edited for re-release as an ebook. What was the original text of the letter in the 1980s edition?

GPT-4:

> I'm sorry for the confusion, but as an AI developed by OpenAI, I don't have the capability to access specific books, their revisions, or editions. As of my last update in September 2021, I don't have the specific text of the letter from the 1980s edition of "Superfudge" by Judy Blume. > > However, given the time frame of the original publication, the technology items such as a laptop computer, an MP3 player, and CDs certainly would not have been part of the original list as these items weren't widely available or didn't exist in the 1980s. > > For the accurate original text, I recommend finding a copy of the original edition of the book.

---

With GPT-3.5:

Me:

> A couple years ago, I read Superfudge, by Judy Blume. In it, the protagonist writes a letter to Santa: "Please bring me one or more of the following items. A clock-radio, a remote-controlled model airplane, a laptop computer, an MP3 player and six CD's.." > > However, the book was originally published in 1980, so this must be an edited version for a later re-publication. Can you tell me the original 1980s version of the letter?

GPT-3.5:

> In the original 1980 version of "Superfudge" by Judy Blume, the protagonist's letter to Santa reads as follows: > > "Dear Santa, > > Please bring me one or more of the following items: a clock-radio, a remote-controlled car, a Walkman, and six cassette tapes. > > Love, > [Protagonist's Name]" > > The updated version you mentioned seems to have incorporated more modern technology, such as a laptop computer and an MP3 player, to reflect the changes in technology over time.

I then asked a new GPT-3 chat "What is the text of the letter to Santa in the original 1980 version of "Superfudge" by Judy Blume?" and it game me something definitely fully fabricated. (Initially it gave me a fabricated letter by Fudge, but after I specified that I wanted Peter's letter, it was still a fabrication.)


> "by coincidence, I've also read four of the eight Ramona books in the past couple days"

My daughter keeps relistening to the complete Ramona audio book collection, so I am extremely familiar with all of the Ramona series. :-)


As mentioned upthread, "ChatGPT" is insufficient to assess exactly which LLM-based tool you're reporting about. The versions vary greatly in capabilities, with some even able to browse the web for more info.


In the insight case, the model has to classify the terms in a short paragraph into broad generic classes, then report which terms have different classes. In the recall case the model has to rank all utterances in its training data based on how relevant they are to the question, then summarise the most relevant one (assuming there is one). Insight seems easier than recall here.


I think there are plenty of humans who wouldn't notice, though.

And probably plenty of AI implementations that would notice.


Are we now aspiring to ABAI (Artificial Below-Average Intelligence)?


I regret to inform you that "average intelligence" is a far lower bar than you might think.


As the joke goes: look around you and see how dumb the average person is. And now imagine, half the people in your city/country/Earth are dumber than this person.


Did you know that, on average, approximately half of all statistics jokes about a normal distribution fail to comprehend mean, median, and mode? True fact.


The AI would not only notice, it would also notice 300 other things that are far more subtle haha


I was so confused by this comment before I figured out you were actually straightforwardly telling the truth.

Details: https://pinehollow.livejournal.com/26806.html


In the Gatsby example, I'd expect the model to be able to answer "which sentence from the story felt out of place?" without knowledge of the original.


The point is spotting the inconsistency within the document, not whether an original has a different form


Comparing strings is a python script problem for a newbie.

Interesting part is if it can deduce something that's out of place within a huge document.


They’re saying you can’t ask it to compare against a doc it doesn’t have access to.


I'm saying you can write a python script that compares two docs, that's a really useless use case.

The relevant test is spotting something out of place in a really long text - if it can do that on non-training material then that's actually useful for reviewing things.


I don't see any reason why that wouldn't be possible.


> Most of these AI

This is as meaningful as saying most of the hominids can't count. You can't usefully generalize AI models with the rate of change that exists right now. Any statements/comparisons about AI has to contain specific models and versions, otherwise it's increasingly irrelevant noise.


Every time someone has said "LLMs can't do X", I tried X in GPT 4 and it could do it. They usually try free LLMs like Bard or GPT 3 and assume that the results generalise.


LLMs can't massively decrease the net amount of entropy of the universe


Are you trying to get us killed?


Insufficient data for a meaningful answer.


I'd imagine working with an entire company document would require a lot more hand holding and investment in prompt engineering. You can definitely get better results if you add much more context of what you're expecting and how the LLM should do it. Treating these LLMs as just simple Q&A machines is usually not enough unless you're doing simple stuff.


I've been curious about this for a while, I have a hobby use-case of wanting to input in-progress novellas and then asking it questions about plot holes, open plot threads, and if new chapter "x" presents any serious plot contradiction problems. I haven't tried exploring that with a vectordb-embeddings approach yet.


This is an exact example of something a vector dbs would be terrible at.

Vector dbs work by fetching segments that are similar in topics to the question, so like "Where did <Character> go after <thing>" will retrieve segments with locations & the character & maybe talking about <thing> as a recent event.

Your question has no similarity with the segments required in any way; & it's not the segments that are wrong it's the way they relate to the rest of the story


Good points - LLMs are ok at finding things that exist, but they have zero ability to abstract and find what is missing (actually, probably negative; they'd likely hallucinate and fill in the gaps).

Which makes me wonder if the opposite, but more laborious approach might work - request it identify all characters and plot themes, then request summaries of each. You'd have to review the summaries for holes. Lotsa work, but still maybe quicker than re-reading everything yourself?


Firstly, I don't at all agree that they have zero ability to abstract. Doesn't fit my experience at all. A lot of the tasks I use ChatGPT for is exactly to analyse gaps in specifications etc. And have it tell me what is missing, suggest additions or ask for clarifications. It does that just fine.

But I've started experimenting with the second part, of sorts, not to find plot holes but to have it create character sheets for my series of novels for my own reference.

Basically have it maintain a sheet and feed it chunks of one or more chapters and asking it to output an a new sheet augmented with the new details.

With a 100K context window I might just test doing it over while novels or much larger chunks of one.


> LLMs are ok at finding things that exist, but they have zero ability to abstract and find what is missing (actually, probably negative; they'd likely hallucinate and fill in the gaps).

I feel this is mostly a prompting issue. Specifically GPT-4 shows surprising ability to abstract to some degree and work with high-level concepts, but it seems that, quite often, you need to guide it towards the right "mode" of thinking.

It's like dealing with a 4 year old kid. They may be perfectly able to do something you ask them, but will keep doing something else, until you give them specific hints, several times, in different ways.


There are other ways of using the model other than iterative forward inference (completions). You could run the model over your novel (perhaps including a preface) and look at the posterior distribution as it scans. This may not be so meaningful at the level of the token distribution, but there may be interesting ways of “spell checking” at a semantic level level. Think a thesaurus but operating at the level of whole paragraphs.


that's not at all what I said


Do the OpenAI APIs support converting prompts to vectors, or are people running their own models locally to do this? Can you recommend any good resources to read up on vector DB approaches to working around context length limits ?


Indeed this tutorial on Haystack is a good one as an example: https://haystack.deepset.ai/tutorials/22_pipeline_with_promp... It combines a retrieval step followed by a prompt layer which inserts the relevant context into the prompt. You can however change the 'retrieval step' with something that uses a proper embedding model and OpenAI also provides those if you want to. I tend to use lighter (cheaper) OSS models for this step though. PS: There's some functionality in the PromptNode to make sure you don't exceed prompt limit.


That's great - thanks!


Yes, you can use a local embeddings model like gtr-t5-xl alongside retriever augmentation. This can point you in the right direction: https://haystack.deepset.ai/tutorials/22_pipeline_with_promp...


Thanks!


open ai has an embeddings api that ppl use for that https://platform.openai.com/docs/guides/embeddings, though whether it's the best model to do that is congested.

Contriever is an example of a strong model to do that yourself. see their paper too to learn about the domain. https://github.com/facebookresearch/contriever


Thanks!


So what's the right way to do a wider-ranging analysis? Chunk into segments, ask about each one, then do a second pass to review all answers together?


I'm also not entirely convinced by "huge" context models just yet, especially as it relates to fuzzy knowledge such as overarching themes or writing style.

In particular, there are 0 mentions of the phrase "machine learning" in The Great Gatsby, so adding one sentence that introduces the phrase should be easy for self-attention to pick out.


I'd be more impressed if it could rewrite Mr. Carraway as an ML engineer in the entire novel. However it's not intrinsically clear that it cannot do this...

It'll be tough to find good benchmarks on long context windows. A human cannot label using 100k tokens of context.


My thoughts exactly - rewrite the novel with Mr. Carraway as an ML engineer while maintaining themes/motifs (possible adding new ones too). I'm guessing what's impressive is that these are the first steps towards something like this? Or is it already possible? Someone please correct me here.


Or rewrite The Count of Monte Cristo as a science fiction novel and get The Stars My Destination. Or rewrite Heinlein's Double Star into present-day and get the movie Dave.


Wonder if they’d set it in SF then…


This sounds like all the other skepticism about what AI can do. And then it can spot 200x more than any human and correlate it into common themes, and you’ll say what?


Doing more than a human can isn't impressive. Most computer problems for any purpose can do more of something, or something faster than a human can.

A better comparison would be if it can pick out any differences that can't be picked out by more traditional and simple algorithms.


It does, using this method.

My immediate thought as well was '... Yeah, well vimdiff can do that in milliseconds rather than 22 seconds' - but that's obviously missing the point entirely. Of course, we need to tell people to use the right tool for the job, and that will be more and more important to remind people of now.

However, it's pretty clear that the reason they used this task is to give something simple to understand what was done in a very simple example. Of course it can do more semantic understanding related tasks, because that's what the model does.

So, without looking at the details we all know that it can summarize full books, give thematic differences between two books, write what a book may be like if a character switch from one book to another is done, etc.

If it doesn't do these things (not just badly, but can't at all) I would be surprised. If it does them, but badly, I wouldn't be surprised, but it also wouldn't be mind bending to see it do better than any human at the task as well.


>Of course it can do more semantic understanding related tasks, because that's what the model does.

The problem is that marketing has eroded maintaining any such faith. Too often, simple examples are given to the consumer to extrapolate intended functionality because there's no false advertising involved then. It's used over and over again in products, the examples are well selected and don't actually show a good representation of perceived functionality.

As such, I personally can't make the leap to of course it can do more semantic understanding related tasks like a diff that's not as simple, one where perhaps a characters overall personality over the course of the book is shifted, not just a single line that defines their profession.

This isn't to say the demonstrative example isn't neat on its own accord given whats going on here, it is, I'm just saying I can't make such leaps from examples given by any products. When I work with vendors of traditional software, this happens all the time people dance around a lack of functionality you obviously want or need to make a sell. It's only when you force them to be explicit on the specific cases, especially in writing, that I have any faith at all.


LLM's are specifically designed and trained for semantic understanding related tasks. That's the entire point. They are trained to solve natural language understanding tasks and they build a semantic model in order to solve the tasks.


I did say it may do certain tasks with low performance. What you're saying is not really understandable (or simply wrong). The fact that it does understand how to do a 'simple' task such as finding where text is different is actually somewhat impressive given the typical training data for these models. But I suppose you need to actually understand the field of NLP to understand why that is.

If you're expecting perfection or magic then you will be disappointed.


I think it's fair to say most people don't know what a diff tool is, but do know how to ask questions. That is the democratizing factor that AI is introducing I feel, giving high powered computing to people without the need for specialized knowledge.


The real goal would be for the model to determine what tool is needed to achieve the task, and use that to achieve it. Using the model to write code provides a way to achieve explainable and correctness guarantees (by reading and executing the code, not the model internals). So in this case identifying that a diff program is required may have taken a couple of seconds, ans the diff would take milliseconds - still yielding a faster and more correct output (as the model alone may produce an approximate diff or hallucinate).


Or course it can very soon, since those were also written by humans. Like AlphaZero vs Rybka


>> And then...you’ll say what?

USER: It's a stochastic parrot.

GPT: I know you are, so what am I?


What techniques do they actually use to achieve 100k? I assumed they load the document to a vector databases and then do some kind of views into that



I would assume by training LLM on dataset of 100k tokens would be the right way.


Makes me wonder whether we could get really huge contexts much more efficiently by feeding back a higher layer back into the tail end of the model. That way it has a very clear picture of the recent text but only a compressed picture of the earlier parts of the document.

(I think I’ve got to read up on how transformers actually work.)


Afaik you're describing something akin to a recurrent neural network, and the problem with that is that it doesn't parallelize well to modern hardware. And vanishing gradients.


I had the same thought as the comment you're responding to.

Recurrent neural networks are bad when the recurrence is 100x long or more. You need long chains because with a token-at-a-time, that's what you need to process even one paragraph.

But if you use an RNN around a Transformed-based LLM, then you're adding +4K or +8K tokens per recurrence, not +1.

E.g.: GPT 4 32K would need just 4x RNN steps to reach 128K tokens!


This is the first time I've felt like Anthropic may be a true competitor to OpenAI.

I see 6 ways to improve foundation LLMs other than cost. If your product is best at one of the below, and has parity at the other 5 items, then customers will switch. I'm currently using GPT-4-8k. I regularly run into the context limit. If Claude-100K is close enough on "intelligence" then I will switch.

Six Dimensions to Compare Foundation LLMs:

1. Smarter models

2. Larger context windows

3. More input and output modes

4. Lower time to first response token and to full response

5. Easier prompting

6. Integrations


7. Price!

GPT4-32K costs ~$2 if you end up using the full 32K tokens, so if you're doing any chaining or back-and-forth it can get expensive very quickly.


Oh boy. I hope Microsoft doesn't get tired of subsidizing my access through Bing. I quite like it.


Bing seems to be using the stock 4K window size, which is likely why it's limited to that many messages in a single chat session before it forcibly shuts it down.


Oof, got access to the 8k model recently and was wondering what costs would be on the 32k one. That's brutal.


Why? 32K tokens would cover creating source code for a small to medium programming project. THat would easilly cost hundreds of dollars if done by a freelancer, and could get it in almost real time for just $2.


Only if you can get that result in one pass.

The real utility of LLMs is that they can be called in a loop to scan through many web pages, many code files, issue tickets, emails, etc...

There are already demos and experiments out there that for every input, 4x outputs are generated, then those are fed back into the LLM 4x for "review", then the best variant is then used to generate code which then automatically tested, errors are fed back in a loop, also with 4x parallel tries, etc...

It's the throughput compared to humans that is the true differentiator. If hooking up the API in a loop up ends up costing more than a human, then it's not worth it.


> It's the throughput compared to humans that is the true differentiator. If hooking up the API in a loop up ends up costing more than a human, then it's not worth it.

If its better in another dimension (e.g., calendar elapsed time), it may well be worth being more expensive.


Two dollars spent over and over would still take a while to equal a software engineer's salary, and if the end result is "produce a fully functional codebase in minutes" the premium might be worth it even if it exceeds that much.


We all know the game Sam Altman is playing. Eventually you can squeeze the full salary in cost out of companies who have laid off their workers and it would still be a good deal because AI does not need health care, pay taxes, hardware, sick days and so on.


32k tokens is 500-1000 lines of code, so more like thousands of dollars, unless you're comparing against a landing page or CRUD tool arranged through Upwork. On the flip side, before I dove into AutoGPT I couldn't get GPT-4 to iterate on something as simple as a TypeScript function that deeply converts { foo_bar: string } to { fooBar: Date } without running out of context and cyclically reintroducing regressions.


Advice on where to get started with auto-gpt?


Also, the azure offering is 10 times the openai price, so not really usable right now.



« other than cost »


>Six Dimensions to Compare Foundation LLMs

I'd add open source to the list, which neither "open"AI or this is.


I don't think most of the large customers will care about OSS AI. Over the last decade they've learned (trained themselves?) where to put their money towards (cloud vs. in-house infra for all manner of things, for better or worse) and I think AI tools will follow similar trends.

Businesses will certainly care about cost, but just as important will be:

- Customization and fine-tuning capabilities (also 'white labeling' where appropriate)

- Integrations (with 3rd party and in-house services & data stores)

- SLA & performance concerns

- Safety features

Open Source AI will have a place, but may be more towards personal-use and academic work. And it will certainly drive competition with the major players (OpenAI, Google, etc) and push them to innovate more which is starting to play out now.


Companies that aren't mindful of vendor lock in aren't long for the world.

Though those cloud platforms all have their own proprietary components most users are savvy enough to constrain and compartmentalize their use of them lest they find themselves having all their profits taken by a platform that knows it can set its prices arbitrarily. The cloud vs in-house adoption is what it is in large part because the cloud offerings are a commodity and a big part of them being a commodity is that much of the underlying software is free software.


On the other-hand, companies that fall behind their competitors because they are spending time on adjacent activities rather than leveraging new capabilities on the market will also lose out. It isn't clear at this time that LLMs fall into that categories ... and smart applications of in-house systems can be as much as an advantage as badly done NIH projects can be an albatross.


History is littered with companies that went dead because they focused on things that don't matter (open source, anti-microsoft, pro-linux).

There will be a time when those things matter when it hurts the bottom-line (Dropbox), but to prematurely optimize for that while you are finding product-market-fit is crazy and all companies are finding product-market-fit in the new AI era


Here's a really important reason to care about open source models: prompt engineering is fiddly enough without the risk of your model provider "upgrading" the model you are using in a way that breaks your existing prompts.

OpenAI already upset a lot of (admittedly non-paying academic) users when they shut off access to the old Ada code model with only a few week's notice.


I’m curious about how enterprises will manage model upgrades.

On one hand, as you mention, upgrades could break or degrade prompts in ways that are hard to fix. However, these models will need constant streams of updates for bugs and security fixes just like any other piece of software. Plus the temptation to get better performance.

The decisions around how and whether to upgrade LLMs will be much more complicated than upgrading Postgres versions.


Paying users who need this kind of stability are more likely get access to those models via Azure rather than from OpenAI directly, which comes with the appropriate enterprise support plans and guarantees.


Why would the models themselves need security fixes? The software running the models, sure, but you should be able to upgrade that without changing anything observable about the actual model.


LLMs (at least the ones with read/write memory) can exactly simulate the execution of a universal Turing machine [1]. AFAIK running such models will therefore entails the same fundamental security risks as ordinary software.

[1] https://arxiv.org/pdf/2301.04589.pdf


Not necessarily. The insecurity from LLMs comes from the fact they’re a black box - what if it turns out that particular version can be easily tricked into giving out terrorism ideas. You could try to add safeguards on top, but they’ve already been bypassed if it has been used for something like that. You might just have to retrain it somehow to make it safe


The OpenAI APi has model checkpoints, right now the chat options are:

gpt-4 gpt-3-5-turbo gpt-4-0314 gpt-3-5-turbo-0301


The 3.5 legacy model disappeared from the ChatGPT UI recently. Is it still available via the API?


Notably absent from the available model list is code-davinci-002 - a lot of people were burned by that one going away.


Those are ChatGPT models. The code-davinci-002 model is still available - they responded to community requests to keep it up.


Midjourney does this, as well.


> I don't think most of the large customers will care about OSS AI.

One would think the same in the 90s but yet, for some reason, Open Source prevailed and took over the world. I don't believe it was about cost, at least not only. In my career I had to evaluate many technical solutions and products and OSS was often objectively superior at several levels without taking account the cost.

The first really successful alternative to "Open"AI will:

* gather many talented developers

* will quickly become a de facto standard solution

* people will rapidly start developing a wide range of integrations for it

* everybody will be using it, including large orgs, because, well, it's open source


Open Source hasn't really taken over the world, if you look at end-solutions.

As a software developer I might use an open source database, but as end-user I'm probably not going to use open-source accounting package - but I will use an accounting SaaS system that happens to be implemented with that OSS DB.

As a software developer I might use an OSS operating system, but as end-user I use a software that has been packaged and maintained by corporation like OSX, or even if OSS in license, has been fully packaged like Android.


True, but the difference here is that running a performant and capable AI solution will be infrastructure-dependent, which has real costs.


Yes, and I believe it will develop in two ways over decades. Just like all major legacy hosting companies such as DigitalOcean, OVH or Hetzner have been offering a kind of public cloud services (at various levels, and after much feet dragging), they - and new AI hosting providers focusing exclusively on this use case - will provide the necessary services.

The other trend is the one we are already seeing right now: more and more mature solutions that you can use even on your laptop with a relatively new GPU. I'm sure we'll see some interesting results in this area, too.


>I don't think most of the large customers will care about OSS AI

Problem again, is centralization of LLMs by either the governments (and they always act in your best interest, amirite?) and corporation, which Non-FOSS LLMs prevent.

Democratization of the models is the only way to actually prevent bad actors from doing bad things.

"But they'll then have access to it too" you say. Yes, they will, but given how many more people who will also have access to open LLMs we'd have tools to prevent actually malicious acts.


A good guy with an LLM stops a bad guy with an LLM. - This message brought to you by the National LLM Association


Ah yes, the only other alternative is to give companies and the State sole access to LLMs, and guns, and drones, and militarized police. Because they'll surely only think of YOUR benefits, right?


A lot of B2B startups can technically the cloud API to provide value added applications to Enterprises, but often the banks and healthcare companies will not want their data running through startups pipes to OpenAI pipes.

We provide a low code data transformation product (prophecy.io), and we’ll never close sales at any volume, if we have a to get an MSA that approves this. Might get easier if we become large :)


> I don't think most of the large customers will care about OSS AI.

OSS AI will open up more diverse and useful services than the first-party offerings from relatively risk averse major vendors, which customers *will" care about.


The thing to remember when selling to businesses is that a business is just a stack of people in a trench coat. This might sound a tad evil but you don’t have to offer something that benefits the business as a whole, just something that benefits the person who holds the purse strings.

This is why cloud services are so popular. They’re easy and they don’t cost the decision makers personally.


Yes, but I think for most companies this has more to do with cost. They're not going to pay for the OSS model, and if they can use an OSS model + fine tuning, they'll choose to save the money.


Considering the very smart people asking for a moratorium on AI development, and it's potential to disrupt a lot of jobs, this may be a good thing.


now that I think about it

is it that important to open source models that can only run on hardware worth tens of thousand of dollars?

who does that benefit besides their competitors and nefarious actors?

I've been trying to run one of the largest models for a while, unless 30,000$ falls in my hand I'll probably never be able to run the current SOTA


When linux was first released in 1991 a 386 to run it would cost about $2000.

We've already seen big advancements in tools to run them on lesser hardware. It wouldn't surprise me if we see some big advancements in the hardware to run them over the next few years, currently they are mostly being run of graphics processors that aren't optimised for the task.


> is it that important to open source models that can only run on hardware worth tens of thousand of dollars?

Yes, because as we've seen with other open source AI models, it's often possible for people to fork code and modify it in such a way that it runs on consumer grade hardware.


Even a small startup, a researcher or a tinkerer can get a cloud instance with a beefy GPU. Also of note, Apple's M1 Max/Ultra should be be able to run it on their GPUs given their 64/128GB of memory, right? That's an order of magnitude cheaper.


I am confused. Those amounts are ram, not gpu ram, aren‘t they? Macs cpus are impressive, but not for ml. A most realistic one for a consumer is a 4090 rtx 24 GB. A lot of models do not fit in that, so A6000 48GB and over for some professional cards. That might be around 9000€ already.


Apple Silicon has unified memory - all memory is accessible to both the CPU and GPU parts of the SoC.


But they comes at max 32GB model?


Mac Studio (desktop) is up to 128GB, and Macbook Pro is up to 96GB.


> Macs cpus are impressive, but not for ml

On Mac GPU has access to all memory.


I overlooked the unified memory on those machines. Can it really run this performantly?


I run Vicuna quite well with my M1 Pro, 32GB.


$30000 is less than price of average car that Americans buy (and most families have two of them) - that's definitely in the realm of something that affluent family can buy if it provides enough value. I also expect price to go down and at $10k it's less than mid-range bathroom update. The question is only if it provides enough value or using in the cloud better option for almost all families.


"It only benefits bad people" is a pretty shitty argument at this point tbf. You can apply this logic to any expensive thing at this point.

I can for example, afford the hardware worth tens of thousands of dollars. I don't want to, but I can if I needed to. Does that automagically make me their competitor or a bad actor?


I agree utility of open source for personal usecase is overblown.

But for commercial usecases, open source is very relevant for privacy reasons as many enterprises have strict policy not to share data with third party. Also it could be a lot cheaper for bulk inference or to have a small model for particular task.


However, the same thing could be achieved with closed source models. There's nothing to stop an LLM being made available to run on prem under a restrictive license. It would really be no different to ye olde desktop software - keeping ownership over bits shipped to a customer is solved with the law rather than technical means.

That said, I really hope open source models can succeed, it would be far better for the industry if we had a Linux of LLMs.


> Keeping ownership over bits shipped to a customer is solved with the law rather than technical means.

Yes in theory... In practice, what happened with LLaMA showed people will copy and distribute weights while ignoring the license.


Locally hosted instances that don't report on prompts is important for personal privacy.


Yes, because it can always be down ported by people with more constraints than the original authors. We’ve see a lot of this in the LLM space, and a lot of other OSS efforts.


It will create price competition for different providers of the model though, which should drive down prices


They don't only run on high end systems. Good models can run on a desktop you have at home. If you don't have a desktop... I'm not sure what you're doing on HN.


You have a weird definition of "good model"

Llama 7B is NOT a good model.


You can run much larger models than llama-7B. Galpaca-30b or Galactica-120b for example.


30B is still not good enough.

What kind of desktop are you running a 120B model on with reasonable performance?


I would disagree that 30B is not good enough. It heavily depends on which model, and what you're trying to use it for.

30B is plenty if you have a local DB of all of your files and wiki/stackechange/other important databases places in a embedding vectordb.

This is typically what is done when people make these models for their home, and it works quite well while saving a ton of money.

While llama-7B systems on their own may not be able to construct a novel ML algorithm to discover a new analytical expression via symbolic regression for many-body physics, you can still get a great linguistic interface with them to a world of data.

You're not thinking like a real software engineer here - there are a lot of great ways to use this semantic compression tool.


If I just wanted a fuzzy search engine for local data, I'd use a vector DB - there's no need for an LLM on top of that.


One question is how much other factors really matter compared to the raw "intelligence" of the model--how good its completions are. You're not going to care very much about context window, prompting, or integrations if the output isn't good. It would be sort of like a car that has the best steering and brakes on the market, but can't go above 5 mph.


Or rather, more analogously, a self-driving car that has a range of 10 000 miles but sometimes makes mistakes when driving vs a self-driving car with a range of 800 miles that never makes mistakes. Once you've have a taste of intelligence it's hard to give up.

However, in many applications there is a limit on how intelligent you need the LLM to be. I have found I am able to fall back to the cheaper and faster GPT-3.5 to do the grunt work of forming text blobs into structured json within a chain involving GPT-4 for higher-level functions.


Big question on that for me is that there's a variety of "completion styles" and I'm curious how "universal" performance on them is. Probably more than this, but a quick list that comes to mind:

* Text summary/compression

* Creative writing (fiction/lyrics/stylization)

* Text comparison

* Question-answering

* Logical reasoning/sequencing ("given these tools and this scenario, how would you perform this task")

IMO, for stuff like text comparison and question-answering, some combo of speed/cost/context-size could make up for a lot, even if they do "worse" versions of stuff just that's too slow or expensive or context-limited in a different model.


We already see how that goes. Stackable LoRa like in stable diffusion.


I don't know. While using Phind I regularly get annoyed by long prose that doesnt answer anything (yes, "concise" is always on). Claude seems to be directly geared towards solving stuff over nice writing.


I generally add to my initial prompts to GPT4 to: From now on, please use the fewest tokens possible in all replies to save tokens and provide brief and accurate answers.


Strongly agree. They are ordered by how much I think they generally will lead to users choosing one model over the other.

Intelligence is the most important dimension by far, perhaps an order of magnitude or more above the second item on the list.


On that note, can anyone speak to how Anthropic (or other models) are doing on catching up to OpenAI for pure model intelligence/quality of completions? Are any others approaching GPT-4? I've only used GPT-based tools so I have no idea.


The best claude model is closer to GPT-4 than 3.5


Until they actually make any of it available in anything but an obscure expensive API you have to request access to, they might as well not even exist.


The landing page says "Easy integration via standard APIs Claude can be incorporated into any product or toolchain you’re building with minimal effort." Then there is a big button "Request Access", which for me right now just does nothing. OpenAI has really faced the pain to make their product available via an API to the general public at scale, but Anthropic/Google/etc. don't quite seem to be there yet. It's frustrating.


I don't think the person you're responding to wants a network based or cloud based solution.

When someone says they want it available they mean running on their own device.

This is hackernews, nearly everyone on this site should have their own self hosted LLM running on a computer/server or device they have at their house.

Relying on 'the cloud' for everything makes us worse developers in just about every imaginable way, creates a ton of completely unnecessary and complicated source code, and creates far too many calls to the internet which are unnecessary. Using local hard drives for example is thousands of times faster than using cloud data storage, and we should take advantage of that in the software we write. So instead of making billions of calls to download a terabyte database query-by-query (seen this 'industry-standard' far too many times), maybe make one call and build it locally. This is effectively the same problem in LLMs/ML in general, and the same incredible stupidity is being followed. Download the model once, run your queries locally. That's the solution we should be using.


When I want code that has a reasonable chance of working, or to bounce ideas off of someone decently intelligent, or just to talk philosophy, I’m not going to get great results out of the kind of model I can feasibly run at home. Even 30b parameters isn’t enough. That’s 75% of what I want out of an LLM.


Try a browser or a clean profile without any ad blocking turned on. It took me a couple of tries to figure out how to get it working but you should see a modal with a form when it works.

FYI the waitlist form submits a regular POST request so it'll reload the main page instead of closing the modal dialog. I opened network monitor with preserved logs to double check that I made it on the list :facepalm:


Google models are now available as API on Google Cloud.


I've been using it through poe and I prefer it to ChatGPT but can't pinpoint why. It just "gets" me better I guess?


there are many services that integrate with them that would allow you to self-serve signup


Faster, cheaper fine-tuning and training

If I could train a useful model, on my own data, in a reasonable time

I would want to have a CI-training pipeline to always have my models up to date


Just use a machine learning method other than an LLM. If you're going to go to the effort of fine tuning and training, pay people to collect and label data like we used to do before Feb 2023.

If your main concern is question answering or summarization or code completion there are plenty of ways to do that now. If you really require the advanced emergent properties of LLMs, you'll have to work with a company that can afford to train a transformer on the Entire Internet.


Open-source LLM projects have largely solved this using Low-Rank Adaptation of Large Language Models (LoRA): https://arxiv.org/abs/2106.09685

Apparently an RTX 4090 running overnight is sufficient to produce a fine-tuned model that can spit out new Harry Potter stories, or whatever...


How much would they cost aprox per training?

What would the quality of the model be, compared to what Karpathy uses on his video “let’s build gpt from scratch”?

In that video, he builds a decoder-only transformer model, that learns Shakespeare from 1MB of data and trains in 15min


These just fine-tune existing models, so the cost is single digit dollars. Whatever it costs in electricity to run a desktop GPU for a day or a few days.


Yeah I remember in undergrad I was working on using transformation learning to train an object detector. Basically you only needed 100ish images to get the model to detect that new object really well.

I'm not sure what the analogous term is for a similar process on LLMs, but that will be huge when there is a service for it.


LLMs can do that without any examples (zero shot) or with one or a few demonstrations in the prompt, if you can describe the task in the limited context window.

If you want for example to train the model to learn to use a very large API, or access the knowledge in a whole book, it might need fine-tuning.


Could I just train a very small LLM with an English dictionary + Python + large API documentation + large Python code base?

Then do some chat fine tuning (like what HF did with StarCoder to get ChatCoder)

And get a lightweight LLM that knows the docs and code for the thing I need it for

After that, maybe incrementally fine tune the model as part of your CI/CD process


How similar were the object to other objects?

E.g., were you trying to distinguish an object vs nothing, a bicycle vs a fish, a bird vs a squirrel, or two different species of songbird at a feeder?

How much would the training requirements increase or decrease moving up or down that scale?


The PaLM 2 stuff released yesterday has fine tuning for their newest large models as a core feature.


We use Anthropic Instant in production and it has been much faster than Davinci/GPT4 for awhile. In terms of quality, Instant is at least as good as GPT3.5.


Privacy as well! I'd much rather run a model locally than in the cloud even if I lose out in performance in both speed and quality.


Reliability surely? They still haven't managed to make a model that says "I don't know" rather than bullshitting. That's by far the biggest unsolved problem.


Don't forget the ability to fine tune based on one's own data sources. For me, this is more important than any of the six reasons you mentioned.


Also if you allow users to receive vector representations of context and provide such representations as side information when querying LLMs.


For me I’d say speed trumps all else. It’s impossible to truly reach scale with the glacial response times you get from current API.


>speed trumps all else

Then use GPT-2


I actually do prefer 3.5-turbo over 4 for many tasks.


More languages?


Agreed. This is an important feature. Not all people speak English.


The most interesting bit is that for the first time since the release of ChatGPT in December 2022, OpenAI does not have the lead on LLMs anymore.

At least, for people who need large context windows, they would not be the first choice anymore.


Claude’s very quietly better on everything but pricing, for a while, it just got buried because they announced on “AI Tuesday” (iirc gpt4 and Bing announcement day)

The ChatGPT equivalent is 3x speed and was somewhere between ChatGPT and GPT4 on my TriviaQA benchmark replication I did

Couple tweets with data and examples. Note they’re from 8 weeks ago, I know Claude got a version bump, GPT3.5/4 accessible via API seem the same.

[1] brief and graphical summary of speed and TriviaQA https://twitter.com/jpohhhh/status/1638362982131351552?s=46&...

[2] ad hoc side by sides https://twitter.com/jpohhhh/status/1637316127314305024?s=46&...


> I know Claude got a version bump, GPT3.5/4 accessible via API seem the same.

GPT3.5 just got an update a few days ago that resulted in a pretty good improvement on its creativity. I saved some sample outputs from the previous March model, and for the same prompt the difference is quite dramatic. Prose is much less formulaic overall.


Meanwhile GPT 4 got lobotomised around the same time.

It can't even solve simplified versions of problems it had zero issues with just a week ago.


Wait, really? I've only been using GPT4 and it seemed like it's been getting incrementally better. Do you have any test cases?


Literally all of them! Comprehension, translation, problem solving, etc….

It’s worse at everything.


You know, I'm noticing it seems to work better on off-hours - I get better quality responses late at night. I wonder if they're using a lighter model when there's more demand on the system.


Nah


Thank you, every little comment I get from fellow boots on the ground is so valuable, lotta noise these days.

Random Q, I don’t use the ChatGPT front end much past month or two, used it a week back and it seemed blazingly faster than my integration: do you have a sense of if it got faster too?


Yeah I've noticed the front end for 3.5 responds almost instantly to complex questions. It'll pop out an entire page of code in under a couple seconds. I think API responses may be a bit faster than before, some of my queries that took 30s before now take around 20s, but they are obviously prioritizing their own site.


Is this update made visible somewhere? The language models offered on my Playground are still the ones from March, same with ChatGPT.


On chat.openai.com below text box there is a link to ChatGPT release notes. Current link text is "ChatGPT May 3 Version". The link leads to https://help.openai.com/en/articles/6825453-chatgpt-release-...


How is the code generation of Claude?


Note, all impressions based on Claude 1.2, got an email from Poe in the last week saying it was version bumped to 1.3 with a focus on coding improvements.

Impressions:

Bad enough compared to GPT-4 that I default to GPT-4. I think if I had api access I’d use it instead, right now it requires more coaxing, and using Poe.

I did find “long-term” chats went better, was really impressed with how it held up when I was asking it a nasty problem that was hard to even communicate verbally. Wrong at first, but as I conversed it was a real conversation.

GPT4 seems to circle a lower optima. My academic guess it’s what Anthropic calls it “sycophancy” in its papers, tldr GPT really really wants to do more like what’s in the context, so the longer the conversation with initial errors goes, it’s actually harder to talk it out of the errors.


I have access to claude. It's not bad, but decently behind gpt4 for code


And is code generation ability equivalent to code understanding and search ability?


GPT-4 still leads in the chatbot arena[1] but at least it is a two horse race now.

[1] https://lmsys.org/blog/2023-05-10-leaderboard/


Well Google decided to stop releasing their AI research.


This is nice, but it can get quite expensive.

Let's say I have a book and I want to ask multiple questions about it. Every query will pay the price of the book's text. It would be awesome if I could "index" the book once, i.e. pay for the context once, and then ask multiple questions.


This more or less is already a thing and it's called RAG [1][2]. It essentially allows you to have a database of embeddings (in this case your book) from which a model can pull knowledge from while producing answers. As for the standard operation of these generative models, the context window is the only working memory it has and so it must see the entire text each time.

[1] https://arxiv.org/abs/2005.11401

[2] https://huggingface.co/docs/transformers/model_doc/rag


Cam you help me understand this? The research appears to be from a few years ago. Can this be used with Claude (for example)? How is it different to the approach many people are taking with vector stores and embeddings?


Other people seem to be suggesting that the user would do the retrieval of the relevant parts of the book from a vectordb first, and then feed those sections along with the question as the prompt. Conceptually it is very similar (and it too uses vector database), but with RAG it would happen as part of the inferencing pipeline and therefore achieve better performance than the end user emulating it.


Yep, but your retrieval from the vector DB becomes your relevancy bottleneck.


it's not different. RAG is a way to train embedding stores end to end


somehow got down voted on something I'm a professional expert at


With embeddings, you essentially can. Group the book into sections, embed each section, then when you do a prompt, add in the N most similar embedded sections to your prompt.


What if the question is "What are the main themes of this work?"

Or anything where the question answer isn't 'close' to the words used in the question?

How well does this work vs giving it the whole thing as a prompt?

I assume worse but I'm not sure how this approach compares to giving it the full thing in the prompt or splitting it into N sections and running on each and then summarizing.


That is solved by hypothetical embeddings.

Background: https://summarity.com/hyde

Demo: https://youtu.be/elNrRU12xRc?t=1550 (or try it on findsight.ai and compare results of the "answer" vs the "state" filter)

For even deeper retrieval consider late interaction models such as ColBERT


I'm not understanding how that works compared to having the full text?

Does the embedding structure somehow expose the themes? And if so, is it more the embeddings that are answering the question by how it groups things?


Any material comparing the different embedding models? I'm working on information retrieval from government documents and without any ML experience it's daunting


You pretty much summed up the drawbacks of the embeddings approach. In my experience it's pretty hard to extract the relevant parts of text, especially when the text is uniform.


You could do multi level summaries etc but yeah this is all just band aids around token limits.


I don't think it's as much of a band-aid as it first appears since this roughly mimics how a human would do it.

The problem is that humans have continuous information retrieval and storage where the current crop of embedding systems are static and mostly one shot.


Humans have limited working memory, they quickly forget short term memory (unless it's super significant) and our long term memory fades selectively if not reactivated or significant (intense).

This weird leaky memory has advantages and disadvantages. Forgetting is useful, it removes garbage.

Machine models could vary the balance of temporal types, drop out Etc. We may get some weird behavior.

I would guess we will see many innovations in how memory is stored in systems like these.


The real gain would be if we were able to use the 100K Context Windows and not this "embeddings trick". The embeddings work only in some cases where the answer is in a short part(s) of the document. If user asks something like "what are the main ideas?" or "Summarize the document." or any question that needs context from large portions of the book/pdf/file, then it will not work with the embeddings trick that use just short passages in prompt. But if large context windows costs are high, we need to keep using embeddings and few text parts.


The price on this will plummet over the next few years, the economic benefits are too large


The economic benefits of mining asteroids are also too large to ignore yet here we are, levelling villages to dig for coal.

Just a few manufacturers hold the effective cartel monopoly on LLM acceleration and you best bet they will charge out the ass for it.


> The economic benefits of mining asteroids are also too large

Ironically, that's an example I like to list as "pure sci-fi fantasy, divorced from economic reality."

The total cost of the iron ore that goes into making a new a car is about $200-$300 dollars, depending on various factors (size of the car, ore spot price, etc...).

Even if -- magically -- asteroid mining made not just "iron ore", but specifically the steel alloy used for car bodies literally free, new cars costing $30,000 would now cost... $29,700.

You can save more by skipping the optional coffee cup warmer, or whatever.

In reality: 90% of iron and steel is recycled, and asteroid mining is not magic.


Why would anyone want to mine asteroids for iron, given that it's the most abundant element on Earth? I don't think I ever recall seeing a proposal like that outside of the broader notion of "space factories" (where such mining makes sense to reduce the cost of shipping materials back and forth, not because it's cheaper as such).

It's stuff like platinum and germanium that makes asteroid mining potentially interesting.


Metallic asteroids are mostly nickel-iron.

On Earth, geological processes concentrate elements into ores, primarily through volcanic and hydrological means. Neither are available in small, cold asteroids devoid of liquid water. Hence asteroids are generally undifferentiated mineralogically, making mining them much less economically viable.

You often see total quantities listed as an amazing thing, glossing over the fact that the Earth has more of everything and in usefully concentrated lumps.


Market competition and innovation in both ML and hardware has consistently driven down the price of AI in the past decade. You only have to look at where we are with capabilities today compared to ten years ago when CIFAR100 classifiers were the state of the art.

Barring a Chinese invasion of Taiwan, these APIs will halve in price over the next year.


What would a Chinese invasion of Taiwan do to "tech", it sounds like it would be awfully catastrophic ?

On the other hand, for all peoples worrying about China, they are pretty restrained given the enormous turmoil the invasion would cause. If they wanted to break the USA, now would probably be the time to do it?


Well for one thing the US seems committed to blow up TSMC to prevent China getting the tech.

https://www.theregister.com/2023/03/14/us_china_tsmc_taiwan/


Planning for the inevitable?


Well here's to hoping I guess.


I'm wondering what level you're thinking. Cloud vendors? GPU vendors? Fabs?


Given what's used right now to my knowledge, the main ones would be Nvidia's tensor cores, Apple's M chips and Google's cloud TPUs. All of that's TSMC I think?


Yes, but physics trumps economics.


Not sure about this one but you can usually ask multiple questions in one shot at least


Generation is more expensive than the prompt input (for Claude v1, generation is 3x the cost; for GPT-4 it's 2x the cost)

It makes the economics slightly trickier.


I wonder why this is? Naively there's no difference between the two from a transformer standpoint.

Perhaps it's because under the hood there's additional safety analysis/candidate generate that is resource intensive?


It's because the input tokens can be batch-processed in a single forward pass through the model, while generating tokens requires one forward pass through the model per token.

If you do the math of how much memory bandwidth is required by a forward pass vs. how much compute, you'll see that inference is entirely limited by memory bandwidth and will use compute resources very inefficiently. In contrast, input processing is able to fully use the available compute.

Of course, there are ways to mitigate this problem, like processing multiple token streams in parallel, but the fundamental problem remains.


Ah, this makes total sense. I was thinking about FLOPs in the abstract and not about the wall-clock time. Thanks for the explanation.


Normally the inputs are padded out to the context length [1] and so the cost to embed 1 token or N tokens is the same. The output is produced token-by-token and so the amount of GPU time increases with the number of output tokens.

[1] I'm not sure if these huge context lengths are achieved the same way (i.e. a single input vector of length N) but given the cost is constant for input I would assume the resource usage is too.


This doesn't match my mental model (or implemented model in the case of GPT2) of how self-attention works (you need to calculate the residual stream for each individual token, attending to all prior tokens before it). Have a link?


I work on infrastructure for serving large language models but I don't have any background in ML, so my perspective is looking at these models as a black box (and also conversations with the people that do the ML stuff). It is the case in practice at least from a latency side that with a fixed context length N, embedding any number of tokens from 0 to N takes the same amount of time. Perhaps it's a difference between the conceptual and actual implementation on GPU?

edit - This occurred to me after the fact but I wonder if the difference is that the use case I work with is processing batches of many different embedding requests (but computed in one batch), therefore it has to process `min(longest embedding, N)` tokens so any individual request in theory has no difference. This would also be the case for Anthropic however.


Ah, you're thinking about embeddings which are basically the encoder stack on a traditional transformer architecture. Modern GPT-like models (including Claude), however, drop the encoder and use decoder-only architectures.

I could imagine something where encoders pad up to the context length because causal masking doesn't apply and the self attention has learned to look across the whole context-window.


Decoder only architecture? What is this? That doesn't sound like a transformer at all, are you saying gpt4 uses a totally different algorithm?


Nope, a decoder only transformer is a variant of the original architecture proposed by Google [1]. All variants of GPT that we know about (1 through 3) all roughly use this same architecture which takes only the decoder stack from the original Google paper and drops the encoder [2]

[1] Original Google Paper - https://arxiv.org/abs/1706.03762

[2] Original GPT Paper - https://s3-us-west-2.amazonaws.com/openai-assets/research-co...


How can it work without an encoder?


Everyone serious batches together short prompts so the cost is roughly proportional to the tokens.


Well each additional token generated requires rerunning the model right to find the next likely token given the previous one


Naively, yes, but you can cache the bulk of that "rerunning" [1]. That said the (non-flash) attention costs go up with the length of the sequence so perhaps this is just a simpler way to approximate these costs.

[1] https://kipp.ly/blog/transformer-inference-arithmetic/


Yes, caching the states of the sequence would make sense. An issue is that it's still more expensive to compute the new tokens even if you cache the states viewed so far


The analogy I can think of here is a pointer, but AFAIK the context would always need to go along with the prompt unless you could tweak internal state to bias towards the context.

Otherwise, it might make sense to have a separate routine which compresses the context as efficiently as possible. Auto encoder?


I don’t see this in the article. Has Anthropic explained the mechanism by which they were able to cost-effectively expand the context window, and whether there was additional training or a design decision (e.g. alternative positional embedding approach) that helped the model optimize for a larger window?


No. As far as I know, they haven't said anything about this. Neither did OpenAI about gpt-4-32k.

MosaicML did say something about MPT-7B-StoryWriter-65k+: https://www.mosaicml.com/blog/mpt-7b. They are using ALiBi (Attention with Linear Biases): https://arxiv.org/abs/2108.12409.

I think OpenAI and Anthropic are using ALiBi or their own proprietary advances. Both seem possible.


Interesting. Does the decision to use ALiBi have to be done before the model weights are first trained, or is there a way that these models could have incorporated ALiBi instead or in addition to an alternate positional encoding method to ALiBi after they were first trained?


The decision needs to be made before starting training. Maybe there is a clever way to add it after the fact in the style of LoRA? First, that would be a different method in its own right (just as LoRA is), second, I can't see how to do so easily. But then I just thought about it for a minute.


a lot of people are speculating online (https://twitter.com/search?q=anthropicai%20alibi&src=typed_q...) but i'm guessing it's ALiBi, which was also used by MPT-7B to get up to 85k long context


No, they are playing this close to the chest, similar to how OpenAI achieved 32k context limit.


Can LLMs take advantage of this bigger window to solve meaningful tasks though? I can't imagine in the training data, knowing what happened 100k tokens ago would be _that_ relevant to predicting the current token very often, so unless this is something that the model learns to leverage more implicitly, I'd be a bit pessimistic.


Yes. For instance, a large context window allows you to have a chat for months where the model can remember and make use of everything you’ve ever talked about. That enables creating a much more effective “assistant” that can remember key details months later that may be valuable.

A second example is the analysis of long documents. Today, hacks like chunking and HyDE enable us to ask questions about a long document or a corpus of documents. But is far superior if the model can ingest the whole document and apply attention to everything, rather than just one chunk at a time. Chunking effectively means that the model is limited to drawing conclusions from one chunk at a time and cannot synthesize useful responses relating to the entire document.


It remains to be seen just how effective longer contexts are because if the attention vectors don't ever learn to pick up specific items from further back in the text then having more tokens doesn't really matter.

Given that the conventional cost of training attention layers grows quadratically with the number of tokens I think Anthropic is doing some kind of approximation here. Not clear at all that you would get the same results as vanilla attention.


They did mention that the inference time to answer a question about the book was something like 22 seconds, so perhaps they are indeed still using self-attention.


I'm not questioning whether it would be useful, just whether it's actually something that token masking in training is going to work to make the model learn this.


I would imagine that if you are training on the text of a novel, then anything that happened earlier in the text may be relevant for predicting the next events. Especially if it's something like a detective novel that has clues about the criminal's identity scattered across the story.

Also if you are training on a database of code.


Yeah but when you're training a neural net with backprop on a finite dataset, "this would help the model" ≠ "the model will learn this". This is 100% speculation, but my intuition is that it's not going to work very well unless it happens 'a lot' in the training data, or if they've curated the data specifically to try and make it learn long range signals.


Gets pricier as you chat for longer, imagine having to chat a line with a history with 20k token.


I'd argue that books are a clear example where the 100k tokens context would make a huge difference.


I would guess that semantic similarity would be the stronger training signal than distance once you go beyond a sentence or two away.


I'm pretty dubious - how would the model not get absolutely swamped by the vast amount of potential context if it's not learning to ignore long range signals for the most part?


We need public benchmarks.

This is incredibly fast progress on large contexts and I would like to see if they are actually attending equally as well to all of the information or there is some sparse approximation leading to intelligence/reasoning degradation.


https://lmsys.org/blog/2023-05-10-leaderboard/

https://chat.lmsys.org/?arena

Claude by Anthropic has more favourable responses then ChatGPT


So I tried this prompt in their chatbot arena multiple times. Each time getting the wrong answer:

"Given that Beth is Sue's sister and Arnold is Sue's father and Beth Junior is Beth's Daughter and Jacob is Arnold's Great Grandfather, who is Jacob to Beth Junior?"


Is the right answer pointing out that Arnold might not be Beth's father, and so Beth Junior might be unrelated to Jacob?


I just tried it and gpt-3.5-turbo got it right.


ChatGPT3.5*

It's still below GPT4, but it is closer to 4 than 3.5


It means nothing as long as they don't actually let us test the API.

Good luck waiting for it.


“POC or GTFO” as the security people say. :-)


We have access and we’re already playing with the 100K model. It’s pretty insane. We‘re about to ditch all of our recursive summarization code.

If you want me to test something for you, lmk and I’ll send it through the api.


> We‘re about to ditch all of our recursive summarization code.

I’m in an adjacent industry and this is what I’m looking forward to.


See the pricing PDF[^1] and API docs[^2], but TL;DR:

- Price per token doesn't change compared to regular models

- Existing api users have access now by setting the `model` param to "claude-v1-100k" or "claude-instant-v1-100k"

- New customers can join waitlist at anthropic.com/product

[1]: https://cdn2.assets-servd.host/anthropic-website/production/... [2]: https://console.anthropic.com/docs/api/reference#parameters


No pricing, but given that OpenAI's GPT-4 doubles the cost-per-token if you go from 8k to a 32k context window, I suspect the pricing here will be 2-4x from the base Claude model which is 9k: https://cdn2.assets-servd.host/anthropic-website/production/...

Although with flash attention, who knows if marginal cost scales that consistently.


Pricing is the same as the base model.



Huh. Well that changes things.


Only for the duration of the beta


Source?


the actual tweet you linked.


It doesn’t say exclusively for the beta period


With an extremely literal reading you are correct, but there was clearly an implication.


<4x would be quite optimistic, at ~11x the tokens the amount of compute/memory required would be n^10 (even with the lower starting point of flash attention) so unless they are already have excessive margins it wouldn't make much sense to go that low.


I was assuming they used a different architecture to get the increase instead of just letting it eat hardware that way. Especially with the speed numbers in the post.



Those are the same SKUs I linked.

The new model are a different model identified that's not listed in the pricing doc, although it sounds like the intent may be to replace the base from looking at the API docs: https://console.anthropic.com/docs/api/reference#-v1-complet...


I requested & have been waiting for access to Claude for nearly 3 months now. Guess the waitlist must be really long...


API access or just access to the chatbot?

You can go through Poe.com


You likely got rejected. Was the same for me and I reapplied with a good use case and was let in


It's very unfortunate that all of these AI models are so impressive, yet they're all heavily filtered and bogged down by these massive AI corporations. All the filtering that they do heavily impacts the performance of the language models. A 100K context would also be incredible for roleplay but infeasible because of the heavy filtering.


Claude may reject answers, but other than OpenAI‘s GPT you can put words into the mouth of the assistant and essentially bypass safety checks.

In fact, Anthropic explicitly discusses putting words into the assistants‘ mouth to be able to shape it’s responses and make it better align with the desired output.


Eventually you will get your account banned, not to mention that the filtering that they do decreases the quality of the results you will get compared to an uncensored model, even if you can "jailbreak" it.


If we were to get banned for this, we'd have been banned long ago. We literally process "questionable" content "as a service" and this use case was explicitly approved. (We do heuristics and ML-assisted background checks on unstructured OSINT data.)


I think being "explicitly approved" matters a lot in this context. Regular customers can't benefit from the same privileges you did.


That's just wrong. We ran metric tons of questionable content before being approved for anything.


All I see in the link is empty PR claims - is there any information about how they're doing that? There are all kinds of known techniques that "expand" context window without really doing so, with different tradeoffs, and unless they provide actual information, any claims should be taken with a pile of salt, we shouldn't just assume that they actually have "true" 100k context windows.


How are LLM’s increasing their context size? I guess you just increase input size if it’s for the self supervised GPT3 style training but for RLHF? Are they creating datasets of books to input to the LLM and then making human labelers label the response? There might be a smart way that does not involve new datasets


Mosaic wrote about their new model here. https://www.mosaicml.com/blog/mpt-7b It was trained on 65k inputs and has decent performance working with 80k+ tokens.


I don't think RLHF datasets need to take full advantage of the context window. There's also many ways to programatically generate NLP datasets.


> You can drop multiple documents or even a book into the prompt and then ask Claude questions that require synthesis of knowledge across many parts of the text.

This is cool but does it also work the other way around? Generate a book's worth of content based on a single prompt?


Kinda. But it`s going to be a lot like how data compression works. There will always be a somewhat fundamental limit to how much "creativity" you can get out of a small prompt generating large texts when using an isolated model.


That's a good question. Can Claude write a coherent book?


I am curious how consistent Claude is at obeying detailed instructions. One issue ChatGPT 3.5 and 4 have, even with just a few hundred words of instructions, is it forgets instructions given to it earlier on.[1]

This huge context window is awesome though, I'm trying to use LLMs to do small town social interaction simulations, with output in a structured format. Finding ways to compress existing state and pass it around, so the LLM knows the current state of what people in the town did for a given day is hard with a tiny token limit!

[1] For my use cases, early instructions tend to be describing a DSL syntax for responses, if I add too much info after the instructions, the response syntax starts getting wonky!


A simple example I ran in to was I asked ChatGPT to generate me story in madlibs format for my 4 year old daughter. They're in the format "The young _____ went to the ______, ...", and she fills in the blanks with silly nouns/adjectives.

As she kept asking for more, I prompted "great, do another one" and eventually my original instruction fell out of the context window. It continued to generate a children's story, but with no more blanks.


This is actually a different issue, largely a UI one, although one I wish ChatGPT would fix it.

There is no good way to tell it "this isn't a conversation, just repeat the answer to the initial prompt again".

The solution is to just re-paste the initial prompt in each time, but still it isn't ideal. There isn't a good way to tell chatgpt "you can throw away all the context after the initial prompt and up until now".

Of course the entire point of ChatGPT is that it maintains a conversation thread, so I get why they don't fix up this edge case.

My problem is more of, I give ChatGPT some complicated instructions, and it'll start forgetting the early on instructions long before any token limit is reached.

So for example, if early on I ask for certain tokens to be returned in parens, well my initial prompt is too long, it'll forget the parens thing and start returning tokens without the surrounding (), which then breaks my parser!


Almost every UI for LLMs I've seen has a way to specify an initial prompt that never goes out of context, it's strange that it's not a feature in ChatGPT.


big if true? :)

Exciting to see competition across LLMs for increasing context window size.

I can't find updated pricing anywhere. Previous prices are here: https://cdn2.assets-servd.host/anthropic-website/production/... but don't seem to be embedded directly on the Anthropic website. I tried messing with the URL (apr -> may/jun) but 404'ed.


> Exciting to see competition across LLMs for increasing context window size.

Maybe. I think the debate is going to continue about prompt optimization vs. context window size.

A while ago, I had a rather interesting conversation with GPT-3.5 about forgetting things. Knowing what to forget, or delete from the prompt, may be just as important as what to put in it.

Putting the kitchen sink into the prompt probably isn't going to help much, past a certain point and it may be putting certain things in there based on time and context is a better strategy.


Yeah, there's definitely diminishing returns. I just wanted to talk to ChatGPT about a game I'm developing. I have pages upon pages of product design notes and I'm not able to just copy/paste the whole thing in and start talking to it at 8k context length. There's not really duplicate information as far as I can tell since each section covers new topics. I'm sure there's a way to express the same ideas more succinctly, but I kind of want ChatGPT to do that for me rather than me figuring out how to do that just to interface the ideas into it.


Hey! Try this: https://github.com/featurebasedb/DocGPT

Holler if you want help with it. I have some more code I'm adding to it this week.


With quadratic time complexity for context size, that gets expensive.


Did anyone else get on the waitlist, get in, and now their console link doesn't work? I remember deciding the code generation wasn't good enough to bother. Not sure if I actually ever activated it but I guess not.

Now I tried to request access again on their form and it just redirected. Can't even tell if that worked.

Does anyone know if this can program as well as GPT-4? Because if so then the larger context window is a big improvement.


I do have access to it and from my very limited testing it looks like it can program at least on par with GPT-3.5. I didn't have time yet to test it more comprehensively against GPT-4.


OK great thanks that's what I heard. Very interested to hear about comparisons with GPT-4.


Curious what this will mean for the vector db vendors. Imagine finetuning would be quick and cheap. Could there be a world where vector dbs aren’t needed anymore?


100k context limit is still a limit (we have no idea how Anthropic is achieving this - if it is extension of the base model context limit itself or some vector db trickery in the backend or probably even RAG). Even in this example, though it could fit entire text of Great Gatsby it still is 1 book/text/document. Typical business use cases require searching through hundreds if not thousands of documents/books and finding similar vector embeddings through all of them and fetching top-K results (this is how Google search works when it has to scan through embeddings for billions of websites). These top-K results can be stuffed into the 100k context limit and produce an even more holistic picture rather than just stuff one book/pdf/file into the context. Depends on the requirements though. I don't see how it might affect vector db vendors who can process billions of vectors per query and provide top-K results.

Also having a massive context length is not necessarily a good thing from perspective of cost. It also doesn't work great with a chatbot as you will have to feed the same 100k worth context back into the chatbot for every question which will turn out to be very expensive. At some point you will have to discard some parts of the context to be specific to the question being asked and that is where vector embeddings come into play. For one off research/Q&A 100k limit works great!


Anyone using Claude? How long did it take you to get access?


Claude is available for free in the Poe app (poe.com). I think it's good and underappreciated.


It is good, but the free subscription to Poe only provides access to Claude Instant. It's impressively fast but not their smartest model (claude-v1.3).


yeah, been using it instead of ChatGPT and it performs better IMO. My conversational LLM of choice for sure.


I've got access, it's _blazing_ fast and seems very good. Solved some of my little puzzles that other models couldn't. I haven't tried ChatGPT-4 yet, but it's the best one that I have used.


You need to try GPT4 only because GPT3.5 really doesn't compare to it in a lot of ways.


GPT-4 is a major leap ahead of everything else I've used (including GPT-3.5), so definitely worth trying for comparison.


My wallet is hardly capable of handling 8k GPT-4.


This is a really useful thread with posts from people actually using LLMs.

I wonder if Claude-100k could be used to ingest this entire thread and then answer questions based on it, or summarize or identify the pros/cons of certain aspects of Claude, large context windows, vector embeddings, etc.


This seems like it could be a game changer. Modern LLM based applications face a balancing act of context limitations, which often results in some kind of mapreduce-type behavior when that context can’t fit the input

If contexts keep growing, the landscape of LLM application engineering will as well


The problem is there are no public benchmarks usually so it is hard to really compare on long context lengths to see if they are still performing equally intelligent of tasks.


Maybe this model can finish Winds of Winter and the rest of GoT for us...


75,000 words is a drop in the bucket for A Song of Ice and Fire:

https://blog.fostergrant.co.uk/2017/08/03/word-counts-popula...



You'd want to generate it in multiple steps to make it feasible to control the text generation anyway. First call generates the broad outline, several parallel calls flesh out character development and some other details so that they're consistent, then generate the story piece by piece by feeding in bits of the outline.


And then you end up with what the movie did which is not exactly a GRRM novel.


That may need a million tokens just for one book, though!


I’d be excited for Dexter ending that doesn’t suck.


That's actually a really interesting use case!


Add Berserk to that list.


I often prefer Claude over GPT4 (partially due to speed), but it degrades more quickly. Like I can get a better response early, but usually the quality drops faster. But, sometimes if it can really vibe with it, it gets better over time.


This is a fascinating study on the impact of context windows on language models. It's interesting to see how smaller context windows can lead to more efficient and accurate language models, even when dealing with complex natural language tasks like question answering. I think this research could have important implications for a wide range of applications, from chatbots and virtual assistants to machine translation and text summarization. I'm looking forward to seeing how these findings are further developed and applied in real-world scenarios.


Nice, that's roughly a 250-page book based on average word counts.


Has anyone actually tried to talk with Anthropic team for obtaining commercial usage permissions? I'm slightly vary of anything that says "talk to our sales team and explore partnerships" to be a time waste. But happy to change my mind if anyone has gone through the pain and found it worth.


Anthropic is basically Google's openAI.


It's not a Google company, their share amounts to ~10%.


How much did Microsoft own before the 49% deal?

Also anthropic has a "google cloud partnership" basically they are hooked on cloud credits like openAI and Azure.


Would be great to see some benchmarks on how loss changes across this very large context. It’s been technically possible to do 1mln+ token context for some time with performance deterioration so it would be interesting to see how this compares to those efforts


Google is really trying to catch up to OpenAI & MS. The truth is they have never been in the race to begin with. All they had and still have is PR stunts. Let's see if their copying of MS model will produce anything useful.


> The truth is they have never been in the race to begin with.

Product race? My understanding is they've been so concerned with safety/harm that they've been slow to implement a lot of tools - then OpenAI made an attempt at it anyway.

Google has generally been ahead from a research perspective though. And honestly it's going to be really sad if they just stop releasing papers outright - hopefully the release their previous gen stuff as they go :/


I don't know how anyone can say this with a straight face when Google is the one who invented LLMs as used today to begin with.

Google has a product issue, not an AI research one.


It's usually the least informed with the most self-assured sweeping opinions.


DeepMind and Google invented many other things, but I think the first GPT style token predictor was actually ... GPT, a model by OpenAI. RLHF was also invented at OpenAI. They also had the first text-to-image model.


Curious why you think this? PaLM2 looks great, and Google has been productizing cutting edge AI pretty fast for years.


PaLM 2 can't even solve "Write three sentences ending in the word Apple."

It's worse than GPT-3.5. Go see for yourself at bard.google.com, which is running on PaLM 2 everywhere but the EU as of yesterday.


Ah yes, the famous benchmark for all LLMs. I just tried your novel example with GPT-3.5 and it couldn't solve it either:

> After lunch, I like to snack on a juicy and crisp apple to satisfy my sweet tooth.

> In the fall, many families enjoy going to apple orchards to pick their own apples and make homemade apple pies.

> The new MacBook Pro features a powerful M1 chip and a stunning Retina display, making it the perfect tool for creative professionals who work with Apple software.


Eh, I think as "human evaluated" metrics go, it's a decent test of how well it can parse a reasonably complex sentence and reply accurately.

For me:

GPT4 3/3: I couldn't resist the temptation to take a bite of the juicy, red apple. Her favorite fruit was not a pear, nor an orange, but an apple. When asked what type of tree to plant in our garden, we unanimously agreed on an apple.

GPT3.5 2/3: "After a long day of hiking, I sat under the shade of an apple tree, relishing the sweet crunch of a freshly picked apple." "As autumn approached, the air filled with the irresistible aroma of warm apple pie baking in the oven, teasing my taste buds." "The teacher asked the students to name a fruit that starts with the letter 'A,' and the eager student proudly exclaimed, 'Apple!'"

Bard 0/3: Sure, here are three sentences ending in the word "apple": I ate an apple for breakfast.The apple tree is in bloom. The apple pie was delicious. Is there anything else I can help you with?

Bard definitely seems to fumble the hardest, it's pretty funny how it brackets the response too. "Here's three sentences ending with the word apple!" nope.

Edit: Interesting enough, Bard seems to outperform GPT3.5 and at least match 4 on my pet test prompt, asking it "What’s that Dante quote that goes something like “before me there were no something, and only something something." 3.5 struggled to find it, 4 finds it relatively quickly, Bard initially told me that quote isn't in the poem but when I reiterated I couldn't remember the whole thing it found it immediately and sourced the right translation. It answered as if it were reading out of a specific translation too - "The source I used was..." Is there agent behavior under the hood of bard or is just how the model is trained to communicate?


I guess PaLM2 is competitive with GPT-3.5 so for people not willing to pay it will be an attractive offering.

I'm not sure that counts as 'great' though.


Based on what do you think it's comparable to GPT-3.5 and not to 4? Did we see a lot of public performance?


They claim it is already being used in Bard, also if you read the paper it does much worse at the important benchmarks.


OpenAI is the Microsoft Explorer of AI.


Google has multiple horses in this race.

They invested $300m in Anthropic in late 2022: https://www.ft.com/content/583ead66-467c-4bd5-84d0-ed5df7b5b...

(Non-paywall: https://archive.is/Y5A9B)


What's the catch? Using GPT-4 relative to its own marketing copy was a letdown.


Is this real input context or is it some vectordb in the background type trickery?


Pretty sure it's not "real" (model) context width.

Another wide context model is MosaicML's MPT-7B-StoryWriter-65k+ which they are describing as having a context width of 65k, but then give a bit more detail to say they are using ALiBi - a type of positional encoding that allows longer contexts at inference time than training (i.e beyond the real context width of the model).

For these types of "extended context" models to actually reason over inputs longer than the native context width of the model, I assume that there is indeed some sort of vector DB trickery - maybe paging thru the input to generate vector DB content, then using some type of Retrieval Augmented Generation (RAG) to process that using the extended contexts ?

Maybe someone from Anthropic or MosaicML could throw us a bone and give a bit more detail of how these are working !

https://www.mosaicml.com/blog/mpt-7b

https://arxiv.org/abs/2005.11401


There is no trickery going on for MPT. The model is open source. MPT-7B-StoryWriter-65k+ was trained on books with 65k context length so there is nothing not "real" about it. The point of ALiBi is that you can reuse training you did with short context for long context to save compute, but of course you also just train with long context.

Who knows about Anthropic.


appears to be real input context


Ok. It has spatial comprehension of some level. Unlike GPT-4 it lacks proper time comprehension because it is bad at calculus. Unlike GPT-4 it can't properly solve traveling salesman problem.


Is there any path towards folding tokens into the actual model? That is, continual training rather than the current "training first then just tokens after"


PaLM 2 on Vertex AI which Google just released yesterday has fine tuning the large models as a core part of their offering.


There has got to be a number of fascinating tricks that they're using to support context lengths that long. Shame it's all closed-source.


I use GPT-4 through the API, but I can't help but hate the token/characterization pricing of these LLM APIs we've seen so far. Because the entire context needs to be fed back into the model, as my conversation gets longer, it gets more expensive. Yeah, it's fractions of a cent and cheaper, but something about it is so psychologically taxing that I'd rather pay a flat sum/month and get unlimited access, even if it costs more considering my usage.


Have you tried to start a new chat after your first question, but refine your new prompt to include some infos you gathered from the first response? This way, you know exactly how many tokens you gonna send.


Nice. Will we be able to get to 1M tokens?


Seems like a good target. Even 100K seems too small. As a reference point, the Bible is ~750,000 words.


"You are a hebrew god and below the dashes is The Word. Who will you smite today?"


How does Claude stack up to GPT-4?


The "Request Access" button currently does nothing


Their sign up form does not let me sign in for early access.

A bit disappointing


Does context window size matter beyond a certain point?


Anyone with API access tried this for coding already?


so i m going to just paste a few physics book and ask it "make fusion"

What is the approach to increase the sequence length here?


How do I sign-up? What is the cost?


Going to be absolutely expensive.


this is an advertisement, wouldn't it be nice to know this in advance?


In advance of what? The first three words on this article's page, beyond the top banner, are "product", "announcements", and "Introducing". What other category of article could you more reasonably expect to encounter after reading those three words?


in advance of clicking


Is there any paper or architecture about this model that explains how they achieved it? is it unlimiformer [https://arxiv.org/abs/2305.01625?utm_source=tldrai] or something different?


This is incredible


god I'd love to work there


> When we asked the model to spot what was different, it responded with the correct answer in 22 seconds.

Now we've gone from using ML to implement slow, unreliable databases, to using ML to implement slow, unreliable string comparison, I guess


Sounds expensive. I guess we know where the $580M 'investment' from SBF is going now.


The day a quantum computer is able to host a huge LLM, things will get really interesting for humanity.

I say this, because I'm not sure how all of this is really going to scale on GPUs. It feels like LLM's are just as magical as quantum computing.


I am noticing a different tone coming from Anthropic. Unlike openai they dont appear to be focused on fud and replacement. Gives the impression it’s run by adults instead of crypto bros turned ai experts. Curious how their models will work.


>Unlike openai they dont appear to be focused on fud and replacement.

I do not have a clue what you are talking about. What happened?


Um Ilya Sutskever isn't a crypto bro.


No but sam altman is. That company can go whistling.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: