More

jamesk_au · 2025-05-30T00:20:16 1748564416

Does anyone have any insight (or informed guesses) that might explain the strange downward "spike" that was consistently observed at dimension 196 in OpenAI's text-embedding-ada-002 model?

jamesk_au · on July 23, 2024

Was it only the last sentence that was wrong?

Would you mind indicating what advice you think the PhD person should have written to you based on what they knew of you at the time?

Genuinely curious, and thank you for the anecdote.

jamesk_au · on July 20, 2023

Can anyone find a link to Apple’s nine-page submission document?

All reports of this seem to refer back to the same BBC article, which purports to quote from Apple’s submission, but doesn’t link to it.

jamesk_au · on June 9, 2023

Can anyone comment on how the limits on GPT-X’s token space translate to limits on its vocabulary (with corresponding limits on understanding input and generating output)?

For example, is GPT-4’s list of ~100k tokens sufficient to understand and generate every non-obsolete word in the English language (per, say, a standard dictionary)? Or even every word in the training data?

If not, do we have examples of ordinary words that it is impossible for GPT-4 ever to understand or generate? What happens when it encounters those words and is unable to tokenize them; are they simply ignored (eg omitted from the input vector, or set to 0 or some sort of null token)?

wolfgang42 · on June 9, 2023

IIRC from poking around in the LLaMA internals (I assume ChatGPT is the same since it’s the obvious way to handle this): the token list has a complete set of tokens of length 1. This means that in the degenerate case where the tokenizer can’t compose the text out of any other tokens it’ll still be processable, just as a collection of single-character tokens that the language model presumably has vaguer associations for. (Which I imagine doesn’t actually affect things significantly; if you added more tokens for less-frequently-seen strings, it still wouldn’t have much of an idea what to do with them.)

belladoreai · on June 9, 2023

You are almost correct, though it doesn't happen at character level, it happens at byte level. Most characters are in LLaMA tokenizer's vocabulary, but all characters aren't. So if you use a character that was uncommon in the training material, it will fall back to byte-level tokens. In most cases 1 character can be represented as 1 byte (and thus 1 byte-level token). However, some characters require more than 1 byte in UTF-8; those characters might end up with as much as 4 tokens.

Majromax · on June 9, 2023

> However, some characters require more than 1 byte in UTF-8; those characters might end up with as much as 4 tokens.

This would seem to raise an interesting "prompt golf" challenge: find a reasonable-sounding prompt that causes the language model to generate invalid UTF-8 in its output.

simonw · on June 9, 2023

My current understanding is that the lack of a token for a specific word does nothing to prevent that word from being "understood" or produced in output - GPT-4 is very capable of consuming and producing text in languages such as Spanish despite most Spanish words not corresponding to a single token.

kgeist · on June 9, 2023

For Russian text, it degrades to, basically, 1 character = 1 token, due to the tokenization issues discussed in the article, yet it produces absolutely coherent text, almost same as in English. In my tests, its Russian output is worse than English output, though, something like 80% quality I'd say. I'm not an LLM expert but I have a theory that, being mostly trained on English text, its thought processes actually happen in English (the part of the model which was trained on English text) and for Russian, it's able to map English to Russian and back thanks to its language translation ability, because I've noticed sometimes it produces slightly awkward sentences whose word choice makes sense in English (calques?) and not as much in Russian.

Ambix · on June 11, 2023

I think the same. LLMs are actually sort of "multi-linguas", able to transform source of any language to internal representation and then do output in some other language, thanks to so many layers of neurons inside it.

jamesk_au · on Feb 9, 2023

For a fun exploration of this, see the short story “Crystal Nights”, by Greg Egan: https://www.gregegan.net/MISC/CRYSTAL/Crystal.html

jamesk_au · on Feb 9, 2023

I also had this experience, thank you for fixing it so quickly. The game is fun!

jamesk_au · on May 17, 2017

Possibly relevant (from the Firebase FAQ[1]):

Why was my Realtime Database reported bandwidth lower than average between September 2016 and March 2017?

For our bandwidth calculations, we normally include SSL encryption overhead (based on layer 5 of the OSI model). However, in September 2016, we introduced a bug that caused our bandwidth reporting to ignore encryption overhead. This might have resulted in artificially low reported bandwidth and bills on your account for a few months.

We released a fix for the bug in late March 2017, returning bandwidth reporting and billing to their normal levels.

Does not excuse the issue with support.

[1] https://firebase.google.com/support/faq/

lucb1e · on May 17, 2017

How is expended computational power (to encrypt) relevant for the bandwidth bill though? I understand it's an expense for them, but (1) 7500% is not a correction in measurements (they must have noticed that over 2 years if it actually costs that much) and (2) it's a completely new expense category (if your plans are bandwidth-based, you can't just force someone into a larger plan when bandwidth stayed the same, you should announce and explain the change in plans).

Aissen · on May 17, 2017

I'm guessing they had to update this FAQ after they started ignoring the OP (and other customers) messages. Also, something is wrong with the timeline with regards to the OP's blogpost where they say they've been using Firebase for a while.

infogulch · on May 17, 2017

Yeah OP says they've been using FB with similar costs for over two years, so I highly doubt they're complaining about something that was only a problem for the last 6 months.

korzun · on May 17, 2017

The original post claims they had the service for over two years and everything was smooth.

Something does not add up here.

degenerate · on May 17, 2017

> we introduced a bug

I know it's not the case, but this wording makes it sound intentional. May as well say "We designed an internal problem"

talklittle · on May 17, 2017

Not really. "Introduced a bug" is fairly common (just do a Google search and see), and clearly does not suggest intention.

jamesk_au · on Feb 28, 2017

This analysis from Ben Thompson (Stratechery[1]) just yesterday would be a great place to start:

"Imagine a Twitter app that, instead of a generic Moment that is little more than Twitter’s version of a thousand re-blogs, let you replay your Twitter stream from any particular moment in time. Miss the Oscars gaffe? Not only can you watch the video, you can read the reactions as they happen, from the people you actually care enough to follow. Or maybe see the reactions through someone else’s eyes: choose any other user on Twitter, and see what they saw as the gaffe happened.

What is so powerful about this seemingly simple feature is that it would commoditize “live” in a way that is only possibly digitally, and that would uniquely benefit the company: now the experience of “live” (except for the shock value) would be available at any time, from any perspective, and only on Twitter. That such a feature does not exist — indeed, that the company’s stated goal is to become more like old media, instead of uniquely leveraging digital — is as good an explanation for why the company has foundered as any."

[1] https://stratechery.com/2017/twitter-live-and-luck/

xiaoma · on Feb 28, 2017

Wow. He cites himself 9 times and cites himself citing himself at the end.

K0SM0S · on March 1, 2017

It may sound egotistic, but it's also a solid way to build something akin to a 'book' albeit in a discrete blogged form: you tend to quote yourself not to make the same point over and over again in so many different phrasings. It makes the arguments consistent, and rewards thoroughness, paying attention to each post (coherency, wording, etc.) Eventually it builds up a coherent, cohesive corpus of ideas that hopefully, should the theme be well-defined, is a book.

Usually when going for publication, you'd replace these quotes with references to previous chapters, typically in footnotes.

burkaman · on Feb 28, 2017

Well, it's his blog. It's pretty normal to link to previous blog posts you've written about a particular topic.

LordHumungous · on Feb 28, 2017

Gotta get that SEO juice baby

LordHumungous · on Feb 28, 2017

Eh. I don't know if an old tweet stream has much entertainment value. It'd be like watching a replay of last seasons football game.

deong · on Feb 28, 2017

I'm guessing Twitter would happily trade balance sheets with ESPN Classic.

jamesk_au · on Feb 15, 2017

Chris Lattner talked about some of these issues on the Accidental Tech Podcast a few weeks ago: http://atp.fm/205-chris-lattner-interview-transcript/

It was a great episode and is worth the listen for those interested.

On Swift adoption at Apple:

"The Swift team itself has specific goals they need to achieve before there can be truly, across-the-board adoption at Apple. ABI stability is the number-one thing [35:30] that prevents framework developers, for example, from adopting Swift. That's a really important thing. That's one of the reasons it's always a really high priority. Swift has been adopted by application developers and other things. The Dock is public. Swift Playgrounds app is public. The Music app in iOS is publicly known. So there are definitely some big adopters.

More broadly though, the big problem is that I think, I won't speak for everybody but many, many people doing [36:00] Objective-C development at Apple are chomping at the bit. They want to be using Swift. It's really just a matter of getting the technology problems solved and checking off the things that are holding people back. It's not about people dragging their feet and not wanting to use it."

On whether to adopt Swift now:

"I don't [1:14:30] think Objective-C is going to go away anytime soon. Apple still supports C and C++ and there's no obvious benefit of dropping Objective-C, and obviously they have a ton of Objective-C code themselves."

He also described Apple's approach to some of the strategic questions that arose early on, such as whether to just invest in making Objective-C better instead of introducing Swift, and the various trade-offs involved.

jamesk_au · on Jan 30, 2017

"It is our goal to train a trillion-parameter model on a trillion-word corpus. We have not scaled our systems this far as of the writing of this paper, but it should be possible by adding more hardware."