It seems the article is now behind a paywall (can only get a reddit link w/ a screenshot atm [0]) but there was a study that showed body composition after 25% weight loss in terms of fat mass and fat-free mass (and the portion of fat-free mass that was skeletal muscle mass) after traditional several interventions, and also compared the breakdown of weight lost in terms of FM and FFM for several GLP-1 medications:
- diet alone
- diet + extra protein
- diet + exercise
- retatrutide
- tirzepatide
- semaglutide
tldr is that, despite some muscle loss, muscle as a percentage of body composition is higher (~50% FFM at start, whereas weight lost with GLP-1 meds ranged from 25%-39% of muscle). It also seems like the muscles will likely function better with less insulin resistance:
> Intentional weight loss causes a greater relative decrease in body fat than FFM or SMM, so the ratio of FFM/SMM to fat mass increases. Accordingly, physical function and mobility improve after weight loss despite the decrease in FFM/SMM, even in older adults with decreased FFM and SMM at baseline. In addition, weight loss improves the “quality” of remaining muscle by decreasing intramyocellular and intermuscular triglycerides and increasing muscle insulin sensitivity
The bit in Permutation City about siphoning compute by exploiting the magnitudes of vector computations as a kind of scratch space out of algorithms that only needed the resulting angles… wonder if you could modify the DoRA parameter-efficient finetuning algorithm to do something like that lol, since it also splits up the new weights into angular and magnitude components..
I was literally thinking of the same similarities. Barbour's exposition of the principle of least action as being time is interesting. There's a section in The Janus Point where he goes into detail about the fact that there are parts of the cosmos that (due to cosmic inflation) are farther apart in terms of light-years than the universe is old, and growing in separation faster than c, meaning that they are forever causally separated. There will never be future changes in state from one that result in effects in the other. In a way, this also relates to computation, maybe akin to some kind of undecidability.
Another thing that came to mind when reading the part about how "black holes have too high a density of events inside of them to do any more computation" is Chaitin's incompleteness theorem: if I understand it correctly, that basically says that for any formal axiomatic system there is a constant c beyond which it's impossible to prove in the formal system that the Kolmogorov complexity of a string is greater than c. I get the same kind of vibe with that and the thought of the ruliad not being able to progressively simulate further states in a black hole.
>There's a section in The Janus Point where he goes into detail about the fact that there are parts of the cosmos that (due to cosmic inflation) are farther apart in terms of light-years than the universe is old, and growing in separation faster than c, meaning that they are forever causally separated. There will never be future changes in state from one that result in effects in the other. In a way, this also relates to computation, maybe akin to some kind of undecidability.
Ho, I love this hint. However even taking for granted that no faster than light travel is indeed an absolute rule of the universe, that doesn't exclude wormhole, or entangled particles.
It would be nice if this was a problem with decidablity, but often it is a problem with indeterminacy that is way stronger than classic chaos.
The speed of causality or I information is the limit that is the speed of light.
Even in the case of entanglement, useful information is not ftl, If I write true on one piece of paper and false on another and randomly seed them to Sue and Bob, Sue instantly knows what Bob has as soon as she opens hers. While we teach QM similar to how it was discovered, there are less mystical interpretations that are still valid. Viewing wave function collapse as updating priors vs observer effects works but is pretty boring.
While wormholes are a prediction of the theory, we don't know if the map matches the territory yet. But it is a reason to look for them. But if we do find them it is likely that no useful information will survive the transit through them.
Kerr's rebuke of Hawkings assumption that black hole singularities are anything more than a guess from a very narrow interpretation of probably unrealistic, non rotating, non charged black holes is probably a useful read.
The map simply isn't the territory, but that doesn't mean we shouldn't see how good that map is or look for a better one.
Actually, the parts of the universe receding from us faster than the speed of light can still be causally connected to us. It’s a known “paradox” that has the following analogy: an ant walks on an elastic band toward us at speed c, and we stretch the band away from us by pulling on the far end at a speed s > c. Initially the ant despite walking in our direction gets farther, but eventually it does reach us (in exponential time). The same is true for light coming from objects that were receding from us at a speed greater than c when they emitted it. See https://en.m.wikipedia.org/wiki/Ant_on_a_rubber_rope
Yes it does, look at the caption of Fig. 1:
"Photons we receive that were emitted by objects beyond the Hubble sphere were initially receding from us (outward sloping lightcone at t <∼ 5 Gyr). Only when they passed from the region of superluminal recession vrec > c (gray crosshatching) to the region of subluminal recession (no shading) can the photons approach us".
I can’t reply to your last reply. I agree, in fact I said those regions can be still causally connected to us, not that they are.
It shows that SOME “superluminal” photons can reach us, not that ALL can. With accelerating expansion, eventually all galaxies fall out of that interval and become unreachable.
Was just going to mention that it seems that it should be possible to make a Flash Attention version of this algorithm and was pleasantly surprised to see they already included an implementation of one :)
O_i = softmax(...) * V_i and softmax is between 0 and 1, so O_i = alpha * V_i for some alpha between 0 and 1 so that makes it convex, and it makes the O_i just a shrunken version of V_i. Whereas if you have the diff of softmaxes, you get O_i = (alpha - beta) * V_i, which can range from -V_i to +V_i, so its output could rescale /or/ flip V_i. And yes this is happening in every head in parallel, then they get summed.
By simply inputting your comment in to 4o, with no other context about the paper, I was able to get a pretty good analysis of the dual-head concept's implications.
Uh, this is extracting a LOT from very little data. I don't understand where it's coming from but it's explanation just keeps going into more and more detail ... that doesn't seem to follow from the data it's got.
I just don't see how you could answer these questions without trying it out. And chatgtp DEFINITELY isn't doing that.
Plus the obvious question I'd pose is not in there. What's the difference in performance between this trick and just "softmax() - 0.5 * 2" ? That seems very relevant.
It's definitely slept on. I do think it ought to be very powerful given enough compute to throw at it, hopefully. I think short description length algorithms such as simple compression algorithms or instant-ngp could be interesting to play with through that paradigm.
Do you mean something more than putting the current date / timestamp in the system prompt? Which is something I think most of the top chatbots are doing behind the scenes.
For God knows why reason, the original PSPs used to come with an IR LED. I put a homebrew program on my PSP that let you control it, and fed it a txt file with thousands of TV IR codes. What a blast!
> I wouldn't call o1 a "system". It's a model, but unlike previous models, it's trained to generate a very long chain of thought before returning a final answer
That answer seems to conflict with "in the future we'd like to give users more control over the thinking time".
I've gotten mini to think harder by asking it to, but it didn't make a better answer. Though now I've run out of usage limits for both of them so can't try any more…
not in a way that it is effectively used - in real life all of the papers using CoT compare against a weak baseline and the benefits level off extremely quickly.
nobody except for recent deepmind research has shown test time scaling like o1
I remember Dario Amodei mentioned in a podcast once that most models won't tell you the practical lab skills you need. But that sufficiently-capable models would and do tell you the practical lab skills (without your needing to know to ask it to in the first place), in addition to the formal steps.
- diet alone
- diet + extra protein
- diet + exercise
- retatrutide
- tirzepatide
- semaglutide
tldr is that, despite some muscle loss, muscle as a percentage of body composition is higher (~50% FFM at start, whereas weight lost with GLP-1 meds ranged from 25%-39% of muscle). It also seems like the muscles will likely function better with less insulin resistance:
> Intentional weight loss causes a greater relative decrease in body fat than FFM or SMM, so the ratio of FFM/SMM to fat mass increases. Accordingly, physical function and mobility improve after weight loss despite the decrease in FFM/SMM, even in older adults with decreased FFM and SMM at baseline. In addition, weight loss improves the “quality” of remaining muscle by decreasing intramyocellular and intermuscular triglycerides and increasing muscle insulin sensitivity
https://www.reddit.com/r/tirzepatidecompound/comments/1dtzr2...
reply