Seems like that would be fine for places where you have < 0.25GWt energy needs, and need a safe power source. Like remote installations/towns. Antarctic research stations, etc.
The maximum temperature from nuclear reactors is far below what you need for steel production because otherwise you’d melt the steel used to construct them.
Waste heat can be useful for district heating systems because houses don’t need to reach high temperatures, but few designs give you access to even 300C and nothing currently hits 1,000+C.
SEEKING WORK - Data scientist, remote worldwide, email in profile.
A colleague and I lost out on a large optimization project (network scheduling during cascading failures) with an airline and I'm now looking for work. My ideal is a small/medium company where I can do no BS data science - ML, AI, and OR are my jam.
I love avant garde problems, and have a pretty unique toolset (LLMs+KB, OR+ML, etc). I've worked with names you know in tech, automotive, and other spaces.
I've worked on projects including large language models, knowledge extraction, and the usual ML stuff - failure prediction, translation, maintenance optimization, automated sports highlights, and more. I've also done some more interesting work including vehicle route planning in constrained environments.
The perfect fit is a small/medium sized company with a hair on fire green or brownfield problem that needs solving.
I'm available immediately, and champing at the bit. Even if you're unsure, get in touch. Email in profile.
SEEKING WORK - Data scientist, remote worldwide, email in profile.
A colleague and I lost out on a large optimization project (network scheduling during cascading failures) with an airline and I'm now looking for work. I like no BS data science - ML, AI, and OR are my jam.
I love avant garde problems, and have a pretty unique toolset (LLMs+KB, OR+ML, etc). I've worked with names you know in tech, automotive, and other spaces.
I've worked on projects including large language models, knowledge extraction, and the usual ML stuff - failure prediction, translation, maintenance optimization, automated sports highlights, and more. I've also done some more interesting work including vehicle route planning in constrained environments.
My calendar is currently entirely clear until Christmas (Willing to take longer projects of course), so I'm willing to do what it takes to fill that up. Get in touch. Even if you're unsure, get in touch. Email in profile.
that silly softmax1 blog post is not worth the read. no one uses it in practice
if you think about it, the "escape hatch" is the design of the entire transformer dictionary. if Key/Query attention misaligns with Value's weights, you get a layer head that does not attend to anything...
Yep. From what I've seen, if the head wants to do nothing, it can attend to itself = no inter-token communication.
Still, differential attention is pretty interesting & the benchmarking good, seems worth a try! It's in the same vein as linear or non-softmax attention, which also can work.
Note that there is an error below Eq. 1: W^V should be shape [d_model x d_model] not [d_model, 2*d_model] as in the Q, K matrices.
Idea: why not replace the lambda parameterization between softmax operations with something more general, like a matrix or MLP? E.g: Attention is the affine combination of N softmax attention operations (say, across heads). If the transformer learns an identity matrix here, then you know the original formulation was correct for the data; if it's sparse, these guys were right; if it's something else entirely then who knows...
I've tried that in a small transformer that I trained from scratch and it didn't really make any difference. I also made a version where I made this trainable somehow, probably by replacing the 1 with a constant associated with the layer, and that didn't make any difference either.
I didn't follow Miller's proposal quite as he wrote it though and I put the mechanism in all the layers rather than avoiding it at the end.
My test doesn't absolutely rule out usefulness-- there's always different ways of applying something, but I saw no indication of it.
You referring to Miller's blogpost?[0] There's not an error in attention. Adding the +1 actually makes it not attention because you no longer generate a probability distribution[1]. There's nothing really preventing attention to have a zero in any of the entries, the thing is that you probably won't get -inf (very large negative number) inside inner product and you're going to have a difficult time updating those weights via gradient descent.
I've also tested it on many networks and different types of attention and I've yet to see a meaningful improvement (or even an improvement), even in generalization.
It really is the training method...
As to the paper, I'm also still at a big lost and honestly, if reviewing could not accept it. The results look good, but I can't tell why and there's some "black magic" going on here.
- Figure 3 has "Transformer" and doesn't specify. Is this StableLM-3B-4E1T?
- What fucking dataset is this on? Stable has a WandB link[2] for that project and I don't see any experiment with similar (presumably entropy?) loss values (come on... this is fucking research... label your fucking graphs...)
- Where the fuck is the ablation? (Yes, I saw Fig 6 and Sec 3.8)
- How do I know that (assuming this is Stable) that the difference isn't just hyperparemeters? Or worse, GPUs! (yes, number of GPUs can change results due to sharding and this changing the statistics)
- How do I know it isn't down to 1k warmup steps instead of 5k?
- What about hidden size, layers, heads, or FFN size? Stable has 32/2560/32/? and this has 28/3072/12/8192 (these all will mess with sharding statistics too). Is the head dimension the same?
- How do I know it isn't down to the tokenizer?
- What is this magic? `0.8 - 0.6 * math.exp(-0.3 * depth)`
- Was this learned? Hand picked? This is a huge factor
- Any information about the learned parameters? Their final values? Trajectories?
- The code does not seem to be the same as whats in the algos...
Obviously they improved something, but there is nothing in the paper that is convincing me that it is the differential attention. There are too many parameters at play and how am I supposed to know that the difference is by the thing they are proposing. And more importantly, how much it is improved by that specific thing and not by other things.
[0] https://www.evanmiller.org/attention-is-off-by-one.html
[1] This is a bit convoluted but without this condition many "alternative forms" you see would be equivalent to other architectures like linear layers or gated units. Term is not well defined, but this really appears to be the only agreed upon aspect, even if only implicitly stated. This is a much longer conversation though.
[2] https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo
[2.1] The config: https://github.com/Stability-AI/StableLM/blob/main/configs/stablelm-3b-4e1t.yml
I feel like that blogpost was almost just ragebait for ai researchers. It goes between calling not including the +1 an error (which to me implies it would improve training losses, which it doesn't really https://news.ycombinator.com/item?id=36854613) and saying possibly it could help with some types of quantization (which could very well be true but is a much weaker statement) and the author provides basically no evidence for either.
It's the stereotypical computer scientist who thinks they know something others don't and don't feel the need to prove their claim. Specifically when it disagrees with experts. And unsurprisingly it's been something others have already investigated and even written about. Definitely not all CS people, but it is a stereotype many other fields believe.
I know he's an economist btw. I was also surprised he got a job at anthropic a few months after. I wonder if they're related.
Haven't gone through the paper fully, but just looking at the functional form of their attention, it seems more like a constraint on a standard MHA than an architectural discovery.
Take a vanilla MHA, tie the V projection between consecutive heads, make the output projection subtract consecutive heads, with some fixed prefactor and voila, you're most if not all of the way there.
Or a state-sponsored bounty. Though you don't want to encourage breeding them. Maybe reward all county taxpayers with a three-month-long tax holiday if they can be rid of nutria for a year.
Technologies/techniques: Machine learning/AI, LLM's/RAG/KB's, Transformers, GenAI, MCMC, MCTS, Bayesnets, Neural architecture search, Python, Pytorch/Tensorflow/Keras, All the tools surrounding ML and AI, A whole host of others.
I'm a data scientist who's worked with companies you know. I'm looking for the next project.
I just lost a large project with an airline to a large consulting shop, so I'm available right away.
The project was network scheduling during cascading failures in airline routes. A pilot calls in sick,
or a plane has an unrepairable fault. How does the airline reschedule crews/planes to keep the maximum number of flights/passengers satisfied, taking into account cascade effects. Involves some deep ML, and some really trick optimization methods.
I've worked on other projects including large language models, knowledge extraction, and the usual ML stuff - failure prediction, translation, maintenance optimization, automated sports highlights, and more. I've also done some more interesting work including vehicle route planning in constrained environments.
The perfect fit is a small/medium sized company with a hair on fire green or brownfield problem that needs solving.
I'm available immediately, and champing at the bit before boredom sets in. Email in profile.
I'm a data scientist who's worked with companies you know. I'm looking for the next project.
I just lost a large project with an airline to a large consulting shop, so I'm available right away.
The project was network scheduling during cascading failures in airline routes. A pilot calls in sick,
or a plane has an unrepairable fault. How does the airline reschedule crews/planes to keep the maximum number of flights/passengers satisfied, taking into account cascade effects. Involves some deep ML, and some really trick optimization methods.
I've worked on other projects including large language models, knowledge extraction, and the usual ML stuff - failure prediction, translation, maintenance optimization, automated sports highlights, and more. I've also done some more interesting work including vehicle route planning in constrained environments.
The perfect fit is a small/medium sized company with a hair on fire green or brownfield problem that needs solving.
I'm available immediately, and champing at the bit before boredom sets in. Email in profile.
reply