I'm not a fan of the database lookup analogy either.
The analogy I prefer when teaching attention is celestial mechanics. Tokens are like planets in (latent) space. The attention mechanism is like a kind of "gravity" where each token is influencing each other, pushing and pulling each other around in (latent) space to refine their meaning. But instead of "distance" and "mass", this gravity is proportional to semantic inter-relatedness and instead of physical space this is occurring in a latent space.
Basically a boid simulation where a swarm of birds can collectively solve MNIST. The goal is not some new SOTA architecture, it is to find the right trade-off where the system already exhibits complex emergent behavior while the swarming rules are still simple.
It is currently abandoned due to a serious lack of free time (*), but I would consider collaborating with anyone willing to put in some effort.
I don’t recall if there was ever a difference between “abort” and “fail.” I could choose to abort the operation, or tell it … to fail? That this is a failure?
Makes sense. Maybe I ran across a proper use a time or two back then and just don’t remember. But the two being the same was the overwhelming experience.
This was a final project for a graphics class where we used WebGL a lot. Also I was just more familiar with OpenGL and haven't looked that much into webGPU
Since its in Cyrillic you should perhaps use a translation service. There are some screens showing results, though as I was really on a tight deadline, and its not a PHD but masters thesis, I decided to not go into in-depth evaluation of the proposed methodology against SPIDER (https://yale-lily.github.io/spider). Even though you can find the simplifed GBNF grammar, also some of the outputs. The grammar, interestingly it benefits/exploits a bug in llama.cpp which allows some sort of recursively-chained rules. Bibliography is in English, but really - there is so much written on the topic, by no means comprehensive.
Sadly no open inference engine (at time of writing) was both good enough in beam search, and grammars, so this whole things needs to perhaps be redone in pytorch.
If I find myself in a position to do this for commercial goals, I'd also explore the possibility of having human-catered SQLs against the particular schema, in order to guide the model better. And then do RAG on the DB for more context. Note: I'm already doing E/R model reduction to the minimal connected graph which includes all entities of particular interest to the present query.
And finally, since you got that far - the real real problem with restricting LLM output with grammars is the tokenization. Because all parsers work reading one char at a time, and tokens are very often few chars, so the parser in a way needs to be able to "lookahead", which it normally does not. I believe OpenAI wrote they realized this also, but I can't really find the article atm.
> LLMs that haven't gone through RL are useless to users. They are very unreliable, and will frequently go off the rails spewing garbage, going into repetition loops, etc...RL learning involves training the models on entire responses, not token-by-token loss (1).
Yes. For those who want a visual explanation, I have a video where I walk through this process including what some of the training examples look like: https://www.youtube.com/watch?v=DE6WpzsSvgU&t=320s
The best part is that you can debug and step through it in the browser dev tools: https://youtube.com/watch?v=cXKJJEzIGy4 (100 second demo). Every single step is is in plain vanilla client side JavaScript (even the matrix multiplications). You don't need python, etc. Heck, you don't even have to leave your browser.
I recently did an updated version of my talk with it for JavaScript developers here: https://youtube.com/watch?v=siGKUyTk9M0 (52 min). That should give you a basic grounding on what's happening inside a Transformer.
The analogy I prefer when teaching attention is celestial mechanics. Tokens are like planets in (latent) space. The attention mechanism is like a kind of "gravity" where each token is influencing each other, pushing and pulling each other around in (latent) space to refine their meaning. But instead of "distance" and "mass", this gravity is proportional to semantic inter-relatedness and instead of physical space this is occurring in a latent space.
https://www.youtube.com/watch?v=ZuiJjkbX0Og&t=3569s
reply