Does anyone have any insight (or informed guesses) that might explain the strange downward "spike" that was consistently observed at dimension 196 in OpenAI's text-embedding-ada-002 model?
Can anyone comment on how the limits on GPT-X’s token space translate to limits on its vocabulary (with corresponding limits on understanding input and generating output)?
For example, is GPT-4’s list of ~100k tokens sufficient to understand and generate every non-obsolete word in the English language (per, say, a standard dictionary)? Or even every word in the training data?
If not, do we have examples of ordinary words that it is impossible for GPT-4 ever to understand or generate? What happens when it encounters those words and is unable to tokenize them; are they simply ignored (eg omitted from the input vector, or set to 0 or some sort of null token)?
IIRC from poking around in the LLaMA internals (I assume ChatGPT is the same since it’s the obvious way to handle this): the token list has a complete set of tokens of length 1. This means that in the degenerate case where the tokenizer can’t compose the text out of any other tokens it’ll still be processable, just as a collection of single-character tokens that the language model presumably has vaguer associations for. (Which I imagine doesn’t actually affect things significantly; if you added more tokens for less-frequently-seen strings, it still wouldn’t have much of an idea what to do with them.)
You are almost correct, though it doesn't happen at character level, it happens at byte level. Most characters are in LLaMA tokenizer's vocabulary, but all characters aren't. So if you use a character that was uncommon in the training material, it will fall back to byte-level tokens. In most cases 1 character can be represented as 1 byte (and thus 1 byte-level token). However, some characters require more than 1 byte in UTF-8; those characters might end up with as much as 4 tokens.
> However, some characters require more than 1 byte in UTF-8; those characters might end up with as much as 4 tokens.
This would seem to raise an interesting "prompt golf" challenge: find a reasonable-sounding prompt that causes the language model to generate invalid UTF-8 in its output.
My current understanding is that the lack of a token for a specific word does nothing to prevent that word from being "understood" or produced in output - GPT-4 is very capable of consuming and producing text in languages such as Spanish despite most Spanish words not corresponding to a single token.
For Russian text, it degrades to, basically, 1 character = 1 token, due to the tokenization issues discussed in the article, yet it produces absolutely coherent text, almost same as in English. In my tests, its Russian output is worse than English output, though, something like 80% quality I'd say. I'm not an LLM expert but I have a theory that, being mostly trained on English text, its thought processes actually happen in English (the part of the model which was trained on English text) and for Russian, it's able to map English to Russian and back thanks to its language translation ability, because I've noticed sometimes it produces slightly awkward sentences whose word choice makes sense in English (calques?) and not as much in Russian.
I think the same. LLMs are actually sort of "multi-linguas", able to transform source of any language to internal representation and then do output in some other language, thanks to so many layers of neurons inside it.
Why was my Realtime Database reported bandwidth lower than average between September 2016 and March 2017?
For our bandwidth calculations, we normally include SSL encryption overhead (based on layer 5 of the OSI model). However, in September 2016, we introduced a bug that caused our bandwidth reporting to ignore encryption overhead. This might have resulted in artificially low reported bandwidth and bills on your account for a few months.
We released a fix for the bug in late March 2017, returning bandwidth reporting and billing to their normal levels.
How is expended computational power (to encrypt) relevant for the bandwidth bill though? I understand it's an expense for them, but (1) 7500% is not a correction in measurements (they must have noticed that over 2 years if it actually costs that much) and (2) it's a completely new expense category (if your plans are bandwidth-based, you can't just force someone into a larger plan when bandwidth stayed the same, you should announce and explain the change in plans).
I'm guessing they had to update this FAQ after they started ignoring the OP (and other customers) messages. Also, something is wrong with the timeline with regards to the OP's blogpost where they say they've been using Firebase for a while.
Yeah OP says they've been using FB with similar costs for over two years, so I highly doubt they're complaining about something that was only a problem for the last 6 months.
This analysis from Ben Thompson (Stratechery[1]) just yesterday would be a great place to start:
"Imagine a Twitter app that, instead of a generic Moment that is little more than Twitter’s version of a thousand re-blogs, let you replay your Twitter stream from any particular moment in time. Miss the Oscars gaffe? Not only can you watch the video, you can read the reactions as they happen, from the people you actually care enough to follow. Or maybe see the reactions through someone else’s eyes: choose any other user on Twitter, and see what they saw as the gaffe happened.
What is so powerful about this seemingly simple feature is that it would commoditize “live” in a way that is only possibly digitally, and that would uniquely benefit the company: now the experience of “live” (except for the shock value) would be available at any time, from any perspective, and only on Twitter. That such a feature does not exist — indeed, that the company’s stated goal is to become more like old media, instead of uniquely leveraging digital — is as good an explanation for why the company has foundered as any."
It may sound egotistic, but it's also a solid way to build something akin to a 'book' albeit in a discrete blogged form: you tend to quote yourself not to make the same point over and over again in so many different phrasings. It makes the arguments consistent, and rewards thoroughness, paying attention to each post (coherency, wording, etc.) Eventually it builds up a coherent, cohesive corpus of ideas that hopefully, should the theme be well-defined, is a book.
Usually when going for publication, you'd replace these quotes with references to previous chapters, typically in footnotes.
It was a great episode and is worth the listen for those interested.
On Swift adoption at Apple:
"The Swift team itself has specific goals they need to achieve before there can be truly, across-the-board adoption at Apple. ABI stability is the number-one thing [35:30] that prevents framework developers, for example, from adopting Swift. That's a really important thing. That's one of the reasons it's always a really high priority. Swift has been adopted by application developers and other things. The Dock is public. Swift Playgrounds app is public. The Music app in iOS is publicly known. So there are definitely some big adopters.
More broadly though, the big problem is that I think, I won't speak for everybody but many, many people doing [36:00] Objective-C development at Apple are chomping at the bit. They want to be using Swift. It's really just a matter of getting the technology problems solved and checking off the things that are holding people back. It's not about people dragging their feet and not wanting to use it."
On whether to adopt Swift now:
"I don't [1:14:30] think Objective-C is going to go away anytime soon. Apple still supports C and C++ and there's no obvious benefit of dropping Objective-C, and obviously they have a ton of Objective-C code themselves."
He also described Apple's approach to some of the strategic questions that arose early on, such as whether to just invest in making Objective-C better instead of introducing Swift, and the various trade-offs involved.
"It is our goal to train a trillion-parameter model on a trillion-word corpus. We have not scaled our systems this far as of the writing of this paper, but it should be possible by adding more hardware."