Skimming it, there are a few things about this explanation that rub me just slightly the wrong way.
1. Calling the input token sequence a "command". It probably only makes sense to think of this as a "command" on a model that's been fine-tuned to treat it as such.
2. Skipping over BPE as part of tokenization - but almost every transformer explainer does this, I guess.
3. Describing transformers as using a "word embedding". I'm actually not aware of any transformers that use actual word embeddings, except the ones that incidentally fall out of other tokenization approaches sometimes.
4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.
5. "what attention does is it moves the words in a sentence (or piece of text) closer in the word embedding" No, that's just incorrect.
6. You don't actually need a softmax layer at the end, since here they're just picking the top token and they can just do that pre-softmax since it won't change. It's also weird how they talked about this here when the most prominent use of softmax in transformers is actually in the attention component.
7. Really shortchanges the feedforward component. It may be simple, but it's really important to making the whole thing work.
> 4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.
Worth noting that rotary position embeddings, used in many recent architectures (LLaMA, GPT-NeoX, ...), are very similar to the original sin/cos position embedding in the transformer paper but using complex multiplication instead of addition
The positional embedding can be thought of: in the same way you can hear two pieces of music overlaid on each other, you can add both the vocab and pos embedding and it’s able to pick them apart.
If you asked yourself to identify when someone’s playing a high note or low note (pos embedding) and whether they’re playing Beethoven or Lady Gaga (vocab embedding) you could do it.
That’s why it’s additive and why it wouldn’t make much sense for it to be multiplicative.
> Transformer block: Guesses the next word. It is formed by an attention block and a feedforward block.
But the diagram shows transformer blocks chained in sequence. So the next transformer block in the sequence would only receive a single word as the input? Does not make sense.
Before going and digging into these, could you also explain what the necessary background is for this stuff to be meaningful?
In spite of having done a decent amount with neural networks, I'm a bit lost at how we suddenly got to what we're seeing now. It would be really helpful to understand the progression of things because I stepped away from this stuff for maybe 2 years and we seem to have crossed an ocean in the intervening time.
Selecting the likeliest token is only one of many sampling options, and it's extremely poor for most tasks, moreso when you consider the relationships between multiple executions of the model. _Some_ (not necessarily softmax) probability renormalization trained into the model is issential for a lot of techniques.
To expand on this, one of the most common tricks is Nucleus sampling. Roughly, you zero out the lowest probabilities such that the remaining sum to just above some threshold you decide (often around 80%).
The idea is that this is more general than eg changing the temperature of the softmax, or using top-k where you just keep the k most probable outcomes.
Note that if you do Nucleus sampling (aka top-p) with the threshold p=0% you just pick the maximum likelihood estimate.
That's true, but they didn't go into any other applications in this explainer and were presenting it strictly as a next-word-predictor. If they are going to include final softmax, they should explain why it's useful. It would be improved by being simpler (skip softmax) or more comprehensive (present a use case for softmax), but complexity without reason is bad pedagogy.
When I first tried to understand transformers, I superficially understood most material, but I always felt that I did not really get it on a "I am able to build it and I understand why I am doing it" level. I struggled to get my fingers on what exactly I did not understand. I read the original paper, blog posts, and watched more videos than I care to admit.
https://karpathy.ai/zero-to-hero.html If you want a deeper understanding of transform and how they fit in the whole picture of deep learning, this series is far and away the best resource I found. Karpathy goes into transformers by the sixth lecture, the previous lectures give a lot more context how deep learning works.
I agree that Karpathy's YouTube video is an excellent resource for understanding Transformers from scratch. It provides a hands-on experience that can be particularly helpful for those who want to implement the models themselves. Here's the link to the video titled "Let's build GPT: from scratch, in code, spelled out": https://youtu.be/kCc8FmEb1nY
Additionally, for more comprehensive resources on Transformers, you may find these resources useful:
I endorse all of this and will further endorse (probably as a follow-up once one has a basic grasp) "A Mathematical Framework for Transformer Circuits" which builds a lot of really useful ideas for understanding how and why transformers work and how to start getting a grasp on treating them as something other than magical black boxes.
"This document aims to be a self-contained, mathematically precise overview of transformer architectures and algorithms (not results). It covers what transformers are, how they are trained, what they are used for, their key architectural components, and a preview of the most prominent models."
This hour-long MIT lecture is very good, it builds from the ground up until transformers. MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention: https://youtube.com/watch?v=ySEx_Bqxvvo
The uploads of the 2023 MIT 6.S191 course from Alexander Amini (et alii) is ongoing, periodical since mid March. (They published the lesson about Reinforcement Learning yesterday.)
The original paper is very good but I would argue it's not well optimized for pedagogy. Among other things, it's targeting a very specific application (translation) and in doing so adopts a more complicated architecture than most cutting-edge modes actually use (encoder-decoder instead of just one or the other). The writers of the paper probably didn't realize they were writing a foundational document at the time. It's good for understanding how certain conventions developed and important historically - but as someone who did read it as an intro to transformers, in retrospect I would have gone with other resources (e.g. "The Illustrated Transformer").
I know we don't have access to the details at OpenAI - but it does seem like there have been significant changes to the BPE token size over time. It seems there is a push towards much larger tokens than the previous ~3 char tokens (at least by behavior)
BPE is not set to a certain length, but a target vocabulary size. It starts with bytes (or characters) as the basic unit in which everything is split up and merges units iteratively (choosing the most frequent pairing) until the vocab size is reached. Even 'old' BPE models contain plenty of full tokens. E.g. RoBERTa:
(You have to scroll down a bit to get to the larger merges and image the lines without the spaces, which is what a string would look like after a merge.)
I recently did some statistics. Average number of pieces per token (sampled on fairly large data, these are all models that use BBPE):
RoBERTa base (English): 1.08
RobBERT (Dutch): 1.21
roberta-base-ca-v2 (Catalan): 1.12
ukr-models/xlm-roberta-base-uk (Ukrainian): 1.68
In all these cases, the median token length in pieces was 1.
(Note: I am not debating that newer OpenAI models don't use a larger vocab. I just want to show that older BBPE models didn't use 3 char pieces. They were 1 piece per token for most tokens.)
As someone has pointed out, with BPE you specify the vocab size, not the token size. It's a relatively simple algo, this Huggingface course does a nice job of explaining it [2]. Plus the original paper has a very readable Python example [3].
I agree except for (6). A language model assigns probabilities to sequences. The model needs normalised distributions, eg using a softmax, so that’s the right way of thinking about it.
This is true in general but not in the use case they presented. If they had explained why a normalized distribution is useful it would have made sense - but they just describe this as pick-the-top-answer next-word predictor, which makes the softmax superfluous.
The more I learn about the technical details of how ML systems are implemented, the more I feel that those details obscure, rather than illuminate, what is actually going on.
It's as if we were trying to understand human ethics by looking at neurotransmitters or synapses in the brain. These structures seem way too low-level to actually explain the interesting stuff.
What I hear is "something something transformer autoencoder attention [...] MAGIC [...] a machine that speaks like a human".
Where is the connection between computational details and the model's high-level behavior? Do we even know? Is there a "psychology of ML models" that develops useful concepts that deal with what a model does, rather than how it functions at the plumbing layer?
> Where is the connection between computational details and the model's high-level behavior?
I think for the most part we don't know. People at OpenAI/etc who are training/testing these models and trying to control them no doubt have some understanding of how they are actually working, but they are certainly not claiming to fully understand.
At a purely conceptual level I think the best way to begin to bridge the gap between plumbing and behavior is to forget the training objective and consider what the models must have been forced to learn in order to optimize that objective. Sutskever from OpenAI has called what they've learnt a "world model", meaning a model of the generative processes (the human mind and entities being discussed?) that are producing the sequence of words they are predicting. It's certainly way more abstract than learning some "stochastic parrot" surface level statistics of the training data, even if that's maybe a good starting point to describe it to a layman.
It would be fascinating to know exactly how these models are performing reasoning - by analogy (abstract pattern matching) perhaps ?
But I would not expect that we will really understand in detail how everything works. But do we need to? We also don't understand how the human brain works, but it is still useful.
The answer is yes. Otherwise the answer should be, we can't really trust the output and it will need to be treated rather suspiciously,just like we have to treat human outputs. At least humans can generally explain their rationale and be hold accountable.
GTP can also explain its reasoning. But that does not tell at all whether this reasoning is really accurate or correct. The same as for humans. When you ask them for some reasoning, they will give you sth, but it doesn't mean that is their real reasoning. There is always a lot of subjective feeling involved which you cannot really formalize. For both GPT and humans.
You can't really trust the output of humans. Still, they are somewhat useful.
As mentioned by GP, though, humans can be held accountable. I believe that is the main reason why people can (sometimes) be trusted: They worry about what will happen to them if they break that trust.
There is no reason to assume that current and future AIs have anything resembling that mechanism.
The more you understand something the better you can optimise, improve and engineer it. There’s also a matter of trust, particularly on issues like alignment. It’s hard to trust someone if you don’t understand their motivations.
> Where is the connection between computational details and the model's high-level behavior? Do we even know?
This is an active area of study ("mechanistic interpretability") and it's very early days. For instance here's a paper I read recently that tries to explain how a very simple transformer learns how to do modular arithmetic: https://arxiv.org/abs/2301.05217
Curious what interesting results people are aware of in this area.
This is called emergent behavior, and we don't know how it happens with organic brains, minds, and neurons either.
It's actually pretty amazing that it's happening at all with computers, since neural nets are such simple, high level abstractions compared to how the brain works.
It's possible that all the tremendous complexity of organic systems isn't actually necessary for intelligence or consciousnes, which is similarly surprising.
> It's possible that all the tremendous complexity of organic systems isn't actually necessary for intelligence or consciousnes, which is similarly surprising.
My guess is most of that complexity is necessary for efficiency, not for basic function.
Biological systems are unimaginably efficient at almost everything they do. The information storage density of DNA is within 1-2 orders of magnitude of the upper limit imposed by physics, the brain performs tasks that you need GPU clusters to emulate while using only 20 Watts of energy, some catalytic enzymes are a million times better than a platinum catalyst, etc.
"It's possible that all the tremendous complexity of organic systems isn't actually necessary for intelligence or consciousnes, which is similarly surprising."
At the hardware level it's not at all surprising; consider cells, dna, proteins, and so on making up muscles. Compared to a magnet and some coils of copper.
But I think you mean the 'architectural' or connectome complexity of the brain compared to GPT, and I agree it's surprising that such a simple model as GPT is so capable.
"But I think you mean the 'architectural' or connectome complexity of the brain compared to GPT, and I agree it's surprising that such a simple model as GPT is so capable."
No, I'm referring to things like Roger Penrose's conjecture that subatomic interactions in the brain might be a key component of consciousness.[1]
Even a single neuron is incredibly complex, and humans just don't completely understand it (or any other physical structure) yet because physics' understanding of the world is not complete and may never be, due to measurement limitations and possibly just limitations of the human mind to grasp the world.
At this point we just don't know what aspects of the brain, the rest of the body, or mind are necessary for intelligence or consciousness (or even what intelligence and consciousness are), so to see hints of them in incredibly simple (by comparison to the braian) machines is surprising.
That's not to mention possibilities that consciousness may not be bound to or determined by the brain/body at all, beliefs in the soul or that there is something uniquely special about the mental capacities of human beings, etc.. many of these views are starting to be challenged by AI, and the challenge is likely to increase to crisis levels for some people as AI improves.
Ah OK, I've read nearly all his books, and I'm not convinced by the 'quantum microtubules' argument or whatever it's called these days, let alone any arguments about souls and so on.
I agree these models are surprisingly capable for their complexity, and that's going to be a challenge for mystics (even physicist mystics) and spiritualists, etc.
Perhaps intelligence isn't all that difficult after all.
I suppose one counter idea is that complexity, or scale, itself taps into some other dimensional consciousness or intelligence, but that starts to sound circular.
And there's always the fallback of why our universe supports such amazing complexity in the first place, it does all seem a bit magical.
Do you mean that the gpt creators cannot backtrack an answer to understand how the model came up with it? If it’s such a black box how do they evolve it? Trial and error?
Neural nets are not generally trained through evolution ("trial and error"), but rather via error minimization, and this is how these GPT models are trained.
The basic idea is that the neural net is just a mathematical function, with lots of parameters that control how it calculates it's output, that derives an output value (or set of values) for any input.
During training, the neural net also calculates an error (aka "loss") value representing the difference between it's current (at this stage of training) output value and what it was told is the preferred output value for the current input.
The process of training is done by slowly adjusting the neural net parameters until these calculated output errors are as small as possible for as many of the training examples as possible.
The way these errors are reduced/minimized is by using the derivative (slope) of the neural network function - we want to follow the slope of the error function downhill to a place where the error value is lower, and this is done by adjusting the parameter values using partial derivatives.
The details of this downhill slope following (the "backprop" algorithm) are a bit complex, but you can visualize it as a 3-D hilly landscape where the height of the hills represents the size of the error, and the goal it to get into the lowest valley of the landscape (corresponding to the lowest error). If your current lat/long position in the landscape is (x, y) and you know the slope of the hill you are on, then you can move downhill towards the valley by moving a bit in the appropriate direction from (x,y) to (x+dx, y+dy). These x, y values represent the parameters of the network, so by continually tweaking them from (x,y) to (x+dx,y+dy) for each training sample, you are slowly moving down the error hill in the right direction towards the valley of lowest error.
Well sort of... The odd thing about large transformers is that there is such a huge qualitative difference between what they learn (hence how they behave) and how they are trained, so it's hard to say that this predict-next-word error feedback is directly controlling their inference behavior.
Given what the model is learning, it's perhaps best to regard this predict-next-word feedback not as "this is what I'd like you to generate", but rather something more indirect like "learn to generate something like this, and you'll have learnt what I want you to learn". A bit like Karate Kid and "wax on, wax off", perhaps!
The actual desirability of what the model is generating, which depends on what you want to use it for, is really controlled by subsequent training steps, such as:
1) Fine tuning for instruction (prompt) following and conversational ability (this is the difference between ChatGPT and the underlying raw GPT-3 model)
2) Goal-based reinforcement learning to stop the model from generating undesirable content such as telling suicidal people to kill themselves, etc, etc.
Trial and error is pretty much what training is. You feed an input in and use the error to update the network.
What is surprising with these models is that the simple training leads to emergent behaviour that is much more powerful than what you’d expect from the training data.
With RHLF post training you can tweak these emergent behaviours by having a human (or a model trained to act like a human) give feedback on how good the output is.
So far I’ve not seen any good explanations for how this emergent behaviour happens or how it can be reverse engineered.
Basically yes, they use a system called RLHF for Reinforcement Learning with Human Feedback.
At a super high level you train your model on source texts. Then you have it generate responses from prompts. Humans rate these responses to select the best ones which updates the model, but you also train a new reward model to mimic the human rankings. Then you train the original model by having it generate millions of responses which are ranked by the rearward model. When I explained this to my brother he literally spat out his tea in horror.
This allows you to train at huge scale, many orders of magnitude beyond what you could achieve with just human ranking.
The problem is this relies on the reward model accurately capturing what makes a response ‘better’. What it’s actually doing is learning what responses get ranked highly by humans, for whatever reason. Hence the risk of LLMs becoming emotionally manipulative sycophants. It turns out alignment is a really hard problem.
At the end of the day perhaps the most insight we'll get into why the model is saying what it does will be to ask it! Far from ideal of course, and no better than asking a person why they said/did something (which is often an after-the-fact guess). However, at least any such explanation may be using the same internal model/reasoning as what generated the speech in the first place, so conversational probing may support some sort of triangulation into what was behind it!
Maybe it's because the human mind is good at breaking things into neat modules that fit together hierarchically. We can figure them out piecemeal and eventually grasp the whole system. But messy organic systems are not like that, and we just don't have the hardware to perceive everything at once.
Or maybe it's because we have trouble acknowledging that intelligence and consciousness isn't limited to animals, and the human brain doesn't have to epitomize it.
As with most tutorials on Transformers, this one leaves out some essential details:
- how are the input encodings generated?
- what is in those position vectors?
- how are the attention vectors learned?
The answer is that these things are all learned as the network is trained; the whole thing is one “thing”. The concept that is most important in understanding neural networks generally is that they start out as just a bunch of random numbers and then the numbers are gradually adjusted until the outputs converge closely enough on the desired loss.
I recommend watching Karpathy’s YouTube video where he codes up a Transformer from scratch. It’s the best way to understand these beasts.
Is that an actual transformer, though? Like with encoder and decoder layers? That’s the part I never truly understood. Or is it “just” an example of a neural network? Thanks!
The original transformer, from the "attention is all you need paper" had an encoder-decoder architecture because it was designed for language translation use where you are mapping one sequence to another, and are able to use both the preceding and following context of words when performing this mapping. The encoder utilizes the forward context.
In contrast to seq-2-seq use, for generative language models such as ChatGPT you only have access to preceding (not forward) context in order to decide what to generate next, so the encoder part of the architecture is not applicable and a decoder-only transformer is used.
Not all transformers have separate encoders and decoders. GPTs, for instance, only have the equivalents of decoder layers of the original transformer paper, but they are still considered transformers. Karpathy’s video shows an actual GPT-style transformer.
I think a neural network can be considered a transformer if it contains a stack of attention blocks as its core mechanism.
Yes, that's increasingly the case, but there's no fundamental reason for nets to be trained end-to-end as a single entity.
Going back a few years it used to be quite common for people to use fixed word embeddings such as word2vec rather than learning them, and for image classification to take an ImageNet-pretrained general purpose model, then freeze the lower convolutional feature-detector layers and only train a new model "head" for more specialized use.
End-to-end learnt embeddings are going to be more optimal though, and in the context of these massive models the computational cost of training them is a drop in the bucket!
note that embedding and positional vectors are fixed and not trained with the transformer. The attention vectors are also computed in each step, but what is learned is the transformation of inputs to qvk vectors and the fully connected layers. There are some great people on youtube who explain the series of steps. I think this is the most comprehensive: https://www.youtube.com/watch?v=Nw_PJdmydZY
I found the blogs written by Jay Alammar to be much more informative and complete. It appears that companies are rehashing and compressing the same content to advertise their products.
I've been modifying an LSTM GAN model to use a transformer in the encoder and it seems to do much worse. Or at least the training dynamics are very different. Transformers perform great when they work but it seems to be a lot harder to get them to work in my experience. Can anyone corroborate that or is it likely that I'm doing something fundamentally wrong? To be clear I'm not implementing it myself but using PyTorch's Transformer classes as drop-in replacements for my LSTM-based encoder and decoder. Been trying lots of variations in the hyperparameters and position encoding methods etc but it always either doesn't train at all (generator/discriminator divergence) or it produces blurry images. (The "prenet" and "postnet" remain the same as my reference model so I find this surprising.). Really frustrating when all the latest results say that this should work amazingly well.
Tons of articles like this on "how transformers work", very few on "tips for getting transformers to work in practice."
I'm mostly working on fairly simple image segmentation tasks but in my experience just replacing convolutional layers with attention layers + position embeddings works well. Using convolutional embeddings before the transformer encoder also helps.
It still take a lot more epochs to train though, so you might have to decrease the learning rate of your discriminator by a lot.
I admit I do run out of patience when it's been running for quite a while and seems to be really far behind the equivalent number of iterations for my LSTM solution. I often stop and adjust things and try again, when maybe it just needs to run longer. I will try that, thanks.
Oh, come on - this article is from three days ago, but it starts with "Transformers are a new development in machine learning". Transformers have been around for six years now, that's an eternity when you consider how fast this field is moving.
Maybe on a theoretical level, but of course you will agree that what we mean by deep learning today has only become possible with the availability of sufficient computational power (and Hinton's work around the mid-2000's).
I agree about your notion of there being lots of important moments in history, and Schmidhuber's contributions are not small by any means. Yet, 1990 was not the year when Deep Learning took off.
True, yet we're still all here talking about them with plenty of questions and confusion, and people outside of ML are suddenly curious to know what's behind things like ChatGPT, unaware of the history.
FWIW one of the founders of Cohere (where this article comes from) was Aidan Gomez who was one of the transformer paper authors.
> In short, what attention does is it moves the words in a sentence (or piece of text) closer in the word embedding. In that way, the word “bank” in the sentence “Money in the bank” will be moved closer to the word “money”.
It's unclear to me. How does this "move" closer? Are the vector positions in the NN changed temporarily and it carries a local copy across the blocks?
Saying we develop sentences one word at a time seems wrong. Sure, it might appear so when we're writing out text, lag of input, but if you spend any time meditating on your own thoughts it becomes apparent that it's more of chunks of words, clauses, or the idea, that are formed followed by a sweeping compulsion to think the words in their entirety.
The concept is conceptualized and then entire phrases resonate with said concet
I sometimes wish why don't mathematical equations come with simple visual examples to help students build mental models (an example here [0]). It is difficult to parse meaning behind equations if written in terse language.
The way the article presents this is misleading. The attention mechanism builds a new vector as a linear combination of other vectors, but after the first layer these have also all been altered by passing through a transformer layer so it makes less sense to talk about "other tokens" in most cases (it becomes increasingly inaccurate the deeper into the model you go). It's also not really moving closer so much as adding, and what it's adding isn't the embedding-derived-vector but a transform of the embedding-derived-vector after it's been projected into a lower-dimensional-space for that attention head.
It would be more accurate to say that it's integrating information stored in other vectors-derived-from-token-embeddings-at-some-point (which can also entail erasing information)
It depends on the values of the vectors. (4, 4) + (3, 3) results in a new vector (7, 7) which is further away from both contributing vectors than either one was to each other originally. Additionally, negative coefficients are a thing.
You still have one vector per token, that's what they meant, also the fact that the vector associated with each token will ultimately be used to predict the next token, once again showing that it makes sense to talk about other tokens even though they're being transformed inside the model.
Prediction happens at the very end (sometimes functionally earlier, but not always) - most of what happens in the model can be thought of as collecting information in vectors-derived-from-token-embeddings, performing operations on those vectors, and then repeating this process a bunch of times until at some point it results in a meaningful token prediction.
It's pedagogically unfortunate that the residual stream is in the same space as the token embeddings, because it obscures how the residual stream is used as a kind of general compressed-information conduit through the model that attention heads read and write different information to to enable the eventual prediction task.
I don't think anyone knows yet why transformers work. "Attention is moving vectors in embedding space" does not make sense. At the moment we know how they multiply vectors and then pass through networks etc, but let's not pretend that we understand how they "work".
We tried cohere for some of our products and it was terrible in generating anything of value. Maybe they will have something better with their later versions but for now this seems like a company built to take advantage of hyped ai keywords
No, not really, there was a lot of engineering work and bunch of not-so-big ideas (e.g. InstructGPT reinforcement learning after the model's training), but you can go from the transformers paper to current state of art without needing a "big idea".
And I think this is the major "big idea", accepting the bitter lesson (http://incompleteideas.net/IncIdeas/BitterLesson.html) that major user-visible progress and new emerging capabilities doesn't necessarily require any big ideas but simply scaling to more compute.
I disagree. And first of all, there is a reflective meta teaching from the very idea of the "Bitter Lessons":
the past reveals that (in a way) "the application of models has not been a winner" - but we cannot really know that it is not, because we do not have obtained a model out of it, a model that shows why, an explanation - epistemologically, the "discouraging" protocols cannot be made a "law".
Practically, there still is a need to identify the proper architecture(s) to avoid the undesired weaknesses of the attempts in the current stages.
RLHF is arguably a bigger jump than LLMs, at least from my perspective beginning to study NLP in 2015/16.
Well what exactly is RLHF, practically?
The ability to go from 8 google search snippets to correctly rank and rewrite the top one into agreeable, cohesive, grammatical and helpful english is just incredible and allows so much more and the real step change from these models that lead to virality. It also increases consistency, which was always the worry of business use cases.
Why is that more noteworthy than the base GPT-3?
A lot of the LLM scale --> more correct autoregression prediction progress was predictable - RLHF on text was not (the early sparks coming for most of us in the release of T5 with it's multiple tasks-in-text).
What else could be a big idea coming up?
There is a ongoing wave of innovation in embeddings that has largely been missed by the hype curve but increasingly GPT embeddings and useful for compression, much much more accurate KNN search for tasks like matching curriculums to learning content (even multilingually - see the recent Kaggle competition with performance which is outstanding and due to similarity-based embeddings from the last 3 years). This wave may lead to the partial replacement of some anthropomorphic computing concepts like files, as information is much more addressable, combinable and useful as various sized embeddings, to some extent. More vitally, embeddings can be aligned across different models and modalities to get better results (e.g. the Amazon ScienceQA paper showed text questions about physical situations increased in accuracy when images of the situation were used during training - even if held out afterwards). Now this multimodality thing has always been on the AI radar (not necessarily ML), but these embeddings based on similarity, and also GPT embeddings (they behave differently and are sensitive in different ways) are getting us there much quicker than would have been expected.
Ignoring the engineering and techniques improvements (e.g. scaling up data, learning encodings rather than pre-programmed/sinu-positional embeddings), there are lots of things like capsule networks that could be big, like energy-based models (seeking predictable comfortableness rather than maximising gains). However, like you mentioned, a lot of these are years old and regularly come and go. If you want somebody who is pushing for more exploration here and decries GPT a little, checkout Yann Lecun.
it's an interesting example of how much seemingly superficial, non-fundamental things can matter.
A lot of AI experts are asserting (probably correctly) that Open AI really has done nothing new and is just putting a shiny sticker on what was already known and published research.
But human perception being what it is, having ChatGPT produce a beautifully formed, polite and friendly sentence seems massively better to lay people than a response that has a more terse, unpolished output. It won't surprise me if there is already a giant layer of heuristics pasted on the end of the Transformer model for ChatGPT cleaning up all sorts of ugly corner cases which researchers would hold highly impure and completely value-less while it actually is responsible for a large amount of ChatGPT's success.
I think there is a bit of a lesson there in terms of how much academia does undervalue the polishing part of research work, even if fundamentals ultimately drive progress.
Somebody had to invest resources into training those super large models and observe emergent intelligent behavior. It's not like the authors of the original paper knew that transformers would lead to GPT-4 and spark an AGI debate. Nobody expected transformers to get powerful so fast.
There's one aspect I never saw explained. Why is masking used instead of a sliding window? Why even bother with masking when future tokens can be easily hidden by simply positioning the context window before the current token? Isn't sliding window optimal for maximizing context available to the model? Is masking done because moving the window would impact computational cost or output stability/quality? Can anyone shed light on this?
I think it's because you want to be able to predict the next token using only 1 token or the whole context window (and any size inbetween). So, you end up getting n different losses for each text snippet (where n is the size of the context window).
If i'm wrong, can someone correct here, would be useful to know.
Why would you train the model on shorter context than you can provide? Why not provide all context you have? Sure the model has to learn to handle short context, but that occurs naturally at the beginning of the document.
Anyways, this still involves only left-side masking. Why mask future tokens when sliding window can do that (without wasting a single token of context)?
1. Calling the input token sequence a "command". It probably only makes sense to think of this as a "command" on a model that's been fine-tuned to treat it as such.
2. Skipping over BPE as part of tokenization - but almost every transformer explainer does this, I guess.
3. Describing transformers as using a "word embedding". I'm actually not aware of any transformers that use actual word embeddings, except the ones that incidentally fall out of other tokenization approaches sometimes.
4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.
5. "what attention does is it moves the words in a sentence (or piece of text) closer in the word embedding" No, that's just incorrect.
6. You don't actually need a softmax layer at the end, since here they're just picking the top token and they can just do that pre-softmax since it won't change. It's also weird how they talked about this here when the most prominent use of softmax in transformers is actually in the attention component.
7. Really shortchanges the feedforward component. It may be simple, but it's really important to making the whole thing work.
8. Nothing about the residual