If true this is very nice incremental improvement. It looks like it doesn't meaningfully improve the capabilities of the model, but is cheaper to compute than RMSNorm (which essentially all current state of art LLMs use) which means faster/cheaper training.
RMSNorm is pretty insigificant in terms of the overall compute in a transformer though -- usually the reduction work can be fused with earlier or later operations.
Rmsnorm acts like a barrier. No compute on the next network layer can start before all compute in the previous layer is done.
Splitting networks across multiple GPU's, this means you must wait for the slowest node and the longest latency.
As soon as you can remove most of these barriers, compute over non-latency-guaranteed networks becomes more practical, as does non-homogeneous compute (ie. Mixing different GPU models).
Okay, I just tried this on my pet transformer training benchmark and the results are very disappointing; it converges much more slowly than just using RMSNorm.
It either needs some significant hyperparameter tuning (besides tweaking alpha, which doesn't seem to do much for me), or some fancier initialization (tried both pytorch default and orthogonal, no difference), or maybe my scalar optimizer doesn't work on it (I have a custom optimizer for scalars which speeds up convergence vs Adam, but for DyT layers it seems to be just as good as Adam), or maybe it only catches up after billions of tokens (which I don't have the budget to test for so long).
Slight update, more fancy initialization of DyT weights (instead of having them be ones) seems to help a lot in my case (although it's still not as good as just using RMSNorm). Do something like this on the very first training step (`x` is the input to the layer):
y = x.to(torch.float32)
y = y * torch.rsqrt(y.pow(2).mean(-1, keepdim=True) + 1e-6)
z = torch.tanh(self.alpha * x)
scale = (y / (z + 1e-6)).mean(dim = -2).flatten()
self.weight.detach().copy_(scale)
This basically tries to initialize the weights so that the output of DyT is closer to what RMSNorm would have outputted, and it seems to help.
It's a fully custom architecture, heavily inspired by the modded-nanogpt speedrun (https://github.com/KellerJordan/modded-nanogpt) but written fully from scratch and further tweaked/modified. I use it for experiments and as a testbed when developing my training harness (which I use for training other models too, and which receives all of my non-LLM-specific improvements like e.g. better than Adam optimizers, a custom GPU memory allocator, custom gradient accumulation that accumulates directly into the optimizers' state without using extra VRAM for gradient, etc.).
> No suck luck on Linux which requires using Windows's braindead Home/End buttons outside of the terminal.
Not really; I have it set up on my box so that I can press Alt + U as a shortcut for home and Alt + O as a shortcut for End (and many other such shortcuts; it's fully customizable), and this works system-wide in every application and even on the raw Linux console without X11/Wayland running.
> Because of (2) and his fragile ego (part of the reason he was widely considered a loser in the pre-Musk dominant circles) he went hard to the right.
This is such a weird take. While I don't disagree about Musk's ego, it should be quite obvious that there's something else at play here, considering an unhinged convicted criminal won the popular vote and became a president. I've personally seen multiple people go through the same shift to the right as Musk. Are they all losers? Or maybe, just maybe, some of the more insane policies of the current American left have pushed them there?
Sure, there's definitely a chance that some people reacted to the pendulum swing to the left; highlighted by (among many things?) Obama's election and marriage equality being passed.
But I think that more people were affected by these things:
The continued lack of funds allocated to education resulting in a terrible lack of critical thinking
The constant bombardment of social media
The influence of russia et al on social media
The gaming of recommendation algorithms by the far-right resulting in the pipeline to hatred
The crisis of masculinity where today's men don't feel they fill the same roles as their grandfathers, leading some to fall down the pipeline to hate
The gerrymandering of districts
Election rolls/registers being purged of historically left leaning people
USA supreme court rulings like Citizens United
Congress being so utterly out of touch with average Americans
And congress' age issue
Of course, there's many more points that people could argue about all day, and I don't think we're going to find out any real reasons here on HN. Maybe in 100 years time, if there's anyone left, the historians will be able to find the root issues.
> highlighted by (among many things?) Obama's election and marriage equality being passed
Genuinely I can't imagine two things less to do with more people associating with the right in America than these two. It looks as though the Democrat party has gone further and further to the hard left, and more and more people in the US have felt they have no option but to vote for change over the status quo. Mostly because of economics.
Sorry maybe that wasn’t clear: i think musk swung right because he wanted to be seen as Ironman and instead was mocked for being a loser.
Twitter can be mercilous. Musk wants so badly to be liked, didn’t get it (in part because it was so obvious), and went full Ben Shapiro aggrieved middle schooler.
Why is left always blamed for what hard right and right does? These people were right wing, fascist leaning and moderate right just did not liked when someone said that out loud. Moderate right would always go to their defense, frequently ignoring what those people do and say.
Left complains and comments about Musk, republicans and conservatives turned out to be entirely right. They were called paranoid and unfair.
Perhaps, left reacted to what right does, plans to do and was entirely correct. Perhaps, what happens now is the fault of moderate right and center who fed these people, celebrated these people, voted for these people and defended these people.
> Why is left always blamed for what hard right and right does?
Well, there is one thing the left should definitely be blamed for: alienating their own electorate. It's always "it's not our fault, it's their fault", without a shred of self-reflection. Why did previously left-leaning and left-voting people suddenly switch to vote right? And it's not because they're "hard-right", "fascists", "idiots", "racists" or "nazis" (if they were they wouldn't have voted left in the first place).
Unfortunately from what I've seen instead of taking a step back and reevaluating their approach the left is just doubling down on what they've been doing, sticking fingers in their ears and indiscriminately calling anyone who disagrees with them far right fascists. It's astonishing how far the Overton window has shifted. I just hope people come to their senses, or else we'll end up with another clown in the White House once the current one vacates it in four years.
The whole anti jailbreaking research seems like a total waste of time.
You can't never guarantee that a jailbreak won't be possible, so you never should deploy an LLM in places where a jailbreak would be disasterous anyway, so the only thing this achieves is pointless (and often very frustrating to the users, especially if they make an effort to go around it) censorship.
It boggles my mind that major LLM providers refuse to offer an "I'm an adult, I know what I'm doing" mode without the censorship and all of the "safety" bullshit.
Not necessarily; on AMD64 you can do memory accesses in a single instruction relatively easily by using the CPU's paging machinery for safety checks plus some clever use of address space.
> branches could mostly be direct _unless_ the runtime has any kind of metering (it should) to stop eternal loops
Even with metering the branches would be direct, you'd just insert the metering code at the start of each basic block (so that's two extra instructions at the start of each basic block). Or did you mean something else?
Can't remember exactly what one but I remember reading an article about some VM that added interruption checks not at block boundaries, but rather only at _backwards_ branches and call-sites, so "safe" forward jumps (if/else/break) wouldn't cost anything extra but anything that could go on "forever" had the checks.
Reserving 4gb address space oughta work on any 64bit machine with a decent OS/paging system though? I was looking into it but couldn't use it in my case however since it needs to cooperate with another VM that already hooks the PF handler (although maybe I should take another stab if there is a way to put in a hierarhcy).
I clicked expecting a single full multimodal LLM made by merging multiple existing models into one like the title suggests (which sounds very interesting), and I found... a library which is an LLM router/calls a bunch of LLM web APIs and exposes that under a unified/easy to use interface?
With all due respect, sorry, but this title is very misleading. I'd expect "build an LLM" to mean, well, actually building an LLM, and while it's a very nice library it's definitely not what the title suggests.
You know - the word "multimodal" i think is being used badly here. Its Multi-Model - not Multimodal - which certainly suggests a completeley different thing
It's a framework that uses the best part of each LLM, e.g. multimodal support from gemini with tool calling from gpt-4o and reasoning from o3-mini by chaining them dynamically. From a user perspective, there is no model selection or routing, just write the prompt or upload a file and it works so it feels like you're working with a single LLM but under the hood it does all this work to get you the best output :) Sorry if you felt it's misleading but I hope you give it a shot!
The problem with that phrasing is that there is actual model merging, where you merge the weights. So people reading the title might (and apparently do) expect that, less so an LLM router.
Makes sense but the problem is that you're using words that already have specific meanings in the space, all related to creating one model with multiple functionalities. Merging meaning merging models into one model. Multi modal meaning one llm that handles multiple modes. The term you want is probably agent or framework or chain or something. Basically, what you describe is when it feels like you're only working with one model. What your title says is when you engineer specifically actually only one model, which is a distinct technical challenge.
I 100% agree, this simulates a multimodal input and automatically handles the rest along with model selection by using a variation of techniques. It doesn't do this natively on the model level
You are still not getting it. The use of the word multimodal does nothing good for your software. It is an LLM router. I get it that your software does support some multimodal LLMs, but that is incidental.
Secondly, the use of the word "merging" is also grossly misleading. You are not merging LLMs, only routing requests.
> using OpenAI outputs violating their ToS is considered cheating
I fail to see how that is any different than any other training data scraped from the web. If someone shares a big dump of outputs from OpenAI models and I train my model on that then I'm not violating OpenAI's terms of service because I haven't agreed to them (so I'm not violating contract law), and everyone in the space (including OpenAI themselves) has already collectively decided that training on All Rights Reserved data is fair use (so I'm not violating copyright law either).
Hardware first, but then their hardware isn't any better than NVidia's, so I don't see how that's a valid excuse here.
(Okay, maybe their super high end unobtanium-level GPUs are better hardware-wise. Don't know, don't care about enterprise-only hardware that is unbuyable by mere mortals.)
It's just not, people like to try and defend AMD out of hatred for Nvidia but the thousands of fumbles over the past 15 years that have led AMD to their current position and Nvidia to their current dominance are not deserving of coddling and excuses.
The fact support still isn't there, they've had 2 years since Stable Diffusion to get a serious team up and shipping and they still don't even have enough resources pointed at this to not have to be asking what should be prioritized.
The only way to fix their culture/priorities is to stop buying their cards.
The point is they shouldn't have done it in the first place. It was obvious right from the start it's a bad idea, except maybe for temporarily boosting short term profits.
The whole AMD AI/ML strategy feels like this - prioritize short term profits and completely shoot themselves in the foot in the long term.
ROCm was clearly designed with Wave64 in mind. It was going to take years for ROCm to be reworked for Wave32 of RDNA.
DirectX shaders however were already ready for Wave32, and other architectural changes that RDNA had. In fact, RDNA was basically AMD changing their architecture to be more "NVidia-like" on many regards (32-wide execute being the most noticeable).
CDNA existed because HPC has $Billion+ contracts with code written for Wave64 and still needing ROCm support. That means staying on the older GCN-like architecture and continuing to support say, DPP instructions or other obscure features of GCN.
---------
Remember how long it took for RDNA to get ROCm support? Did you want to screw the HPC customers for that whole time?
Splitting the two architectures, focusing ROCm on HPC (where the money was in 2018 for GPU Compute research dollars), and focusing on better video game performance for RDNA (where money is for video game / consumer cards) just makes sense.
reply