adhd'er here too. maybe the practice is good, but it takes a lot of energy, which is finite. i find that leaning on my strengths gets me far, far better results than trying to get up to par with everyone else on things im bad at. if a tool just lets you get started, and you can breeze through getting started on things that you might otherwise just never even start, it seems like using the tool is the way to go.
ive been fighting the way my brain works my whole life, and only recently have i switched to trying to work with the way it wants to work. i get so many more things done that are important to me, and i get them done without the implicit "i need to flagellate myself with this thing i hate because there is something wrong with me" that comes with those fights.
and yeah, the ai's come with their own problems. but the trade is so exponentially in the direction of being worth it. even just the being a decent rubber duck aspect of them can keep me on a task when i would never otherwise hope to see it through.
this is neat but to me seems like the circuitous path to just skipping autoregression, whereas the direct path is to just not do autoregression. get your answers from the one forward pass, and instead of backprop just do lookups and updates as the same operation.
dude thats sick! i tried it out and it works. theres a couple layers in there that are part of the voidy block that doesnt do much for the selected answer, so i narrowed it down to L48-53 where this model is mapping out its reasoning strategy, and repeated that twice, i got a big improvement over the original config (i chose some questions from atropos and claude code made some up so idk not like a real dataset).
so thats about %15 more compute per forward pass with 0 extra memory which is just nuts, so for a streaming or disk-based setup its just free better answers. def wasnt gonna think of this myself.
looks like the model gets a second/third go at figuring out how to approach the problem and it gets better answers.
i tried a matrix of other configurations and stuff gets totally weird. like playing em through backwards in that block doesnt make much of a difference / order doesnt seem to matter (?!). doubling each layer got a benefit, but if i doubled the layers and doubled that block there was interference. doubling the block where the model is architecting/crystallizing its plans improves reasoning but at the cost of other stuff. other mixes of blocks showed some improvements for certain kinds of prompts but didnt stand out as much.
im kind of wondering like what the ceiling would be on reasoning for something like the 1.5T models with the repeating technique, but they would take a long time to download. i think if you have them already it would take maybe an hour or so to check against a swath of prompts. whats the reasoningest open model at the moment?
my guess is that large models trained on large corpuses there is just some ceiling of "reasoning you can do" given the internal geometry implied by the training data, cause text is lossy and low-bandwidth anyway, and theres only really so much of it. past some point you just have to have models learning from real-world interactions and my guess is we're already kind of there.
I have Deepseek etc, but inferencing on DDR5 would take about 2-3 weeks for a simple scan. I think this works best with dense models, but it also seems ok with MoE.
@everyone: Can someone hook me up with Nvidia sponsorship?
oh neat ill check that one out. i dont get that much speedup from ssd/128gb unified vs vram if im doing like a predefined set of prompts, since i have it load it from disk anyway and im just doing one forward pass per prompt, and just like load part of it at a time. its a bit slower if im doing cpu inferencing but i only had to do that with one model so far.
but yeah on demand would be a lot of ssd churn so id just do it for testing or getting some hidden state vectors.
think of the latent space inside the model like a topological map, and when you give it a prompt, you're dropping a ball at a certain point above the ground, and gravity pulls it along the surface until it settles.
caveat though, thats nice per-token, but the signal gets messed up by picking a token from a distribution, so each token you're regenerating and re-distorting the signal. leaning on language that places that ball deep in a region that you want to be makes it less likely that those distortions will kick it out of the basin or valley you may want to end up in.
if the response you get is 1000 tokens long, the initial trajectory needed to survive 1000 probabilistic filters to get there.
or maybe none of that is right lol but thinking that it is has worked for me, which has been good enough
Hah! Reading this, my mind inverted it a bit, and I realized ... it's like the claw machine theory of gradient descent. Do you drop the claw into the deepest part of the pile, or where there's the thinnest layer, the best chance of grabbing something specific? Everyone in everu bar has a theory about claw machines. But the really funny thing that unites LLMs with claw machines is that the biggest question is always whether they dropped the ball on purpose.
The claw machine is also a sort-of-lie, of course. Its main appeal is that it offers the illusion of control. As a former designer and coder of online slot machines... totally spin off into pages on this analogy, about how that illusion gets you to keep pulling the lever... but the geographic rendition you gave is sort of priceless when you start making the comparison.
My mental model for them is plinko boards. Your prompt changes the spacing between the nails to increase the probability in certain directions as your chip falls down.
i literally suggested this metaphor earlier yesterday to someone trying to get agents to do stuff they wanted, that they had to set up their guardrails in a way that you can let the agents do what they're good at, and you'll get better results because you're not sitting there looking at them.
i think probably once you start seeing that the behavior falls right out of the geometry, you just start looking at stuff like that. still funny though.
Probably better could have described it as the distance between the pegs being the model weight, and the prompt defines the shape of the coin you’re dropping down.
I was half asleep when I wrote it the first time and knew it wasn’t what I wanted to say but couldn’t remember what analogy I was looking for.
I can sort of see one angle for it, and the parent story kind of supports it. Bad software is a forcing function for good hardware - the worse that software has gotten in the past few decades the better hardware has had to get to support it. Such that if you actually tried like OP did, you can do some pretty crazy things on tiny hardware these days. Imagine what we could do on computers if they weren't so bottlenecked doing things they don't need to do.
if im not sitting on my right foot with left knee under my chin my thinking takes a hit, but i also have to constantly switch how im sitting so i dont get annoyed. its hard not to slouch/melt into whatever im sitting on and i think the only way to offset all that is the gym.
> GraphQL isn’t bad. It’s just niche. And you probably don’t need it.
> Especially if your architecture already solved the problem it was designed for.
What I need is to not want to fall over dead. REST makes me want to fall over dead.
> error handling is harder than it needs to be
GraphQL error responses are… weird.
> Simple errors are easier to reason about than elegant ones.
Is this a common sentiment? Looking at a garbled mash of linux or whatever tells me a lot more than "500 sorry"
I'm only trying out GraphQL for the first time right now cause I'm new with frontend stuff, but from life on the backend having a whole class of problems, where you can have the server and client agree on what to ask for and what you'll get, be compiled away is so nice. I don't actually know if there's something better than GraphQL for that, but I wish when people wrote blogs like this they'd fill them with more "try these things instead for that problem" than simply "this thing isn't as good as you think it is you probably don't need it".
If isomorphic TS is your cup of tea, tRPC is a nicer version of client server contracting than graphql in my opinion. Both serve that problem quite well though.
I do like the look of this! It seems like it nicely provides that without like kicking you into React, which I have ended up having to draw a hard line against in development after my first couple experiences not only with it, but how the distributions in AI models make it a real trap to touch. I'll swap this in in one of my projects and give it a go. Thanks!
The execution didn't finish; it started. Big policy changes typically take time to solidify, and it'll probably take a bit to get a reliable read on its trajectory. But there is international momentum on this, so making predictions based on whatever percentage of people that were supposed to have their accounts deactivated actually did the day of (if we even have that data, and I doubt that we do), is probably not going to be useful.
ive been fighting the way my brain works my whole life, and only recently have i switched to trying to work with the way it wants to work. i get so many more things done that are important to me, and i get them done without the implicit "i need to flagellate myself with this thing i hate because there is something wrong with me" that comes with those fights.
and yeah, the ai's come with their own problems. but the trade is so exponentially in the direction of being worth it. even just the being a decent rubber duck aspect of them can keep me on a task when i would never otherwise hope to see it through.
reply