More

ianand · 2026-01-08T06:32:34 1767853954

I'm not a fan of the database lookup analogy either.

The analogy I prefer when teaching attention is celestial mechanics. Tokens are like planets in (latent) space. The attention mechanism is like a kind of "gravity" where each token is influencing each other, pushing and pulling each other around in (latent) space to refine their meaning. But instead of "distance" and "mass", this gravity is proportional to semantic inter-relatedness and instead of physical space this is occurring in a latent space.

https://www.youtube.com/watch?v=ZuiJjkbX0Og&t=3569s

skinner_ · 2026-01-08T23:16:02 1767914162

Then I think you’ll like our project which aims to find the missing link between transformers and swarm simulations:

https://github.com/danielvarga/transformer-as-swarm

Basically a boid simulation where a swarm of birds can collectively solve MNIST. The goal is not some new SOTA architecture, it is to find the right trade-off where the system already exhibits complex emergent behavior while the swarming rules are still simple.

It is currently abandoned due to a serious lack of free time (*), but I would consider collaborating with anyone willing to put in some effort.

(*) In my defense, I’m not slacking meanwhile: https://arxiv.org/abs/2510.26543 https://arxiv.org/abs/2510.16522 https://www.youtube.com/watch?v=U5p3VEOWza8

art_mach · 2026-01-08T14:18:46 1767881926

This is an excellent analogy! Thank you!

ianand · 2025-05-10T15:31:04 1746891064

The site’s domain name is the best use of a .fail tld ever.

jagged-chisel · 2025-05-10T20:11:00 1746907860

OT from TFA, so high jacking your thread …

I don’t recall if there was ever a difference between “abort” and “fail.” I could choose to abort the operation, or tell it … to fail? That this is a failure?

¯\_(ツ)_/¯

bombcar · 2025-05-11T01:17:23 1746926243

Take reading a file from disk.

Abort would cancel the entire file read.

Retry would attempt that sector again.

Fail would fail that sector, but the program might decide to keep trying to read the rest of the file.

In practice abort and fail were often the same.

jagged-chisel · 2025-05-11T15:59:33 1746979173

Makes sense. Maybe I ran across a proper use a time or two back then and just don’t remember. But the two being the same was the overwhelming experience.

ianand · 2025-05-02T17:34:03 1746207243

As the guy who did GPT2 in Excel, very cool and kudos!!

Curious why you chose WebGL over WebGPU? Just to show it can be done?

(Also see my other comment about fetching weights from huggingface)

nathan-barry · 2025-05-02T17:41:27 1746207687

This was a final project for a graphics class where we used WebGL a lot. Also I was just more familiar with OpenGL and haven't looked that much into webGPU

divan · 2025-05-02T20:58:20 1746219500

Someone needs to implement Excel using graphics shaders now.

ianand · 2025-05-03T08:20:22 1746260422

Lol. You joke but... https://www.reddit.com/r/GraphicsProgramming/comments/pg9dw8...

ronsor · 2025-05-02T17:38:07 1746207487

Probably because WebGPU support is still rather iffy.

littlestymaar · 2025-05-02T17:59:28 1746208768

> Curious why you chose WebGL over WebGPU? Just to show it can be done?

For a WebGPU implementation, one can use transformers.js directly (or many other libraries actually), maybe WebGL is more original.

[1]: https://huggingface.co/docs/transformers.js/index

85392_school · 2025-05-02T20:39:42 1746218382

Transformers.js wraps the ONNX runtime which is rather versatile (WASM, WebGL, WebGPU, and WebNN). It's not the backend that makes it novel.

_dijs · 2025-05-02T19:03:43 1746212623

ianand, I immediately thought of you when I saw this post. Miss you friend.

ianand · 2025-05-03T08:09:52 1746259792

Dude, been forever. Thanks. Will DM you.

ianand · 2025-05-02T17:30:45 1746207045

Checkout https://github.com/jseeio/gpt2-tfjs fetches the weights for GPT2 from huggingface on the fly.

ianand · 2025-04-21T21:40:41 1745271641

Sounds like an interesting masters thesis. Is your masters thesis available online somewhere?

larodi · 2025-04-21T23:58:53 1745279933

Well, not sure about the final doc that went to the university, but this is the almost final draft.

https://docs.google.com/document/d/e/2PACX-1vSyWbtX700kYJgqe...

Since its in Cyrillic you should perhaps use a translation service. There are some screens showing results, though as I was really on a tight deadline, and its not a PHD but masters thesis, I decided to not go into in-depth evaluation of the proposed methodology against SPIDER (https://yale-lily.github.io/spider). Even though you can find the simplifed GBNF grammar, also some of the outputs. The grammar, interestingly it benefits/exploits a bug in llama.cpp which allows some sort of recursively-chained rules. Bibliography is in English, but really - there is so much written on the topic, by no means comprehensive.

Sadly no open inference engine (at time of writing) was both good enough in beam search, and grammars, so this whole things needs to perhaps be redone in pytorch.

If I find myself in a position to do this for commercial goals, I'd also explore the possibility of having human-catered SQLs against the particular schema, in order to guide the model better. And then do RAG on the DB for more context. Note: I'm already doing E/R model reduction to the minimal connected graph which includes all entities of particular interest to the present query.

And finally, since you got that far - the real real problem with restricting LLM output with grammars is the tokenization. Because all parsers work reading one char at a time, and tokens are very often few chars, so the parser in a way needs to be able to "lookahead", which it normally does not. I believe OpenAI wrote they realized this also, but I can't really find the article atm.

ianand · 2025-04-22T06:58:32 1745305112

Thanks. Took a quick look and definitely needed to use Google Translate but seems to have worked to get the gist of it.

ianand · 2025-03-28T19:33:07 1743190387

Fun fact: A decade ago the designer of HAML and Sass created a modern alternative to XSLT. https://en.wikipedia.org/wiki/Tritium_(programming_language)

ianand · 2025-03-27T19:12:34 1743102754

The model architecture is the same during RL but the training algorithm is substantially different.

ianand · 2025-03-27T19:05:44 1743102344

> LLMs that haven't gone through RL are useless to users. They are very unreliable, and will frequently go off the rails spewing garbage, going into repetition loops, etc...RL learning involves training the models on entire responses, not token-by-token loss (1).

Yes. For those who want a visual explanation, I have a video where I walk through this process including what some of the training examples look like: https://www.youtube.com/watch?v=DE6WpzsSvgU&t=320s

ianand · 2025-01-30T02:58:10 1738205890

hey, creator of spreadsheets-are-all-you-need.ai here. Thanks for mentioning!

I now have a web version of GPT2 implemented in pure JavaScript for web developers at https://spreadsheets-are-all-you-need.ai/gpt2/.

The best part is that you can debug and step through it in the browser dev tools: https://youtube.com/watch?v=cXKJJEzIGy4 (100 second demo). Every single step is is in plain vanilla client side JavaScript (even the matrix multiplications). You don't need python, etc. Heck, you don't even have to leave your browser.

I recently did an updated version of my talk with it for JavaScript developers here: https://youtube.com/watch?v=siGKUyTk9M0 (52 min). That should give you a basic grounding on what's happening inside a Transformer.

ianand · on Jan 2, 2025

Reminder that Microsoft ships RWKV with Windows (~1.5 billion devices), making it probably the most widely deployed non transformer model out there. Amazing work! https://blog.rwkv.com/p/rwkvcpp-shipping-to-half-a-billion

ps Eugene you should brag about that on the homepage of RWKV.