ramity's comments

ramity · 2025-03-10T23:11:22 1741648282

You absolutely can. Deterministic inference is achievable, but it isn't as performant. The reason why sadly boils down to floating point math.

ramity · 2025-03-10T21:02:44 1741640564

The way I understand it, an LLM response is a chain of tokens where each is the most probable token. Maybe there exists more complicated candidate and selection approaches than that, but biggest number works for me. For the sake of simplicity, let's just say tokens are words. You'd have access to the probability of each word in the ordering of the sentence, but I'm not sure how that would then be used to evaluate to the probability of the sentence itself or its truthiness.

nthingtohide · 2025-03-10T22:10:14 1741644614

https://plato.stanford.edu/entries/self-locating-beliefs/

Is this helpful?

ramity · 2025-03-10T22:57:49 1741647469

I could have worded my reply better, but the simplified explanation stands :b

ramity · 2024-05-22T05:06:55 1716354415

#4 was a profound and very validating read for me understanding how the brain handles traumatic events. Thanks for sharing.

MrDresden · 2024-05-22T07:21:02 1716362462

Not to deny your experience with it, however looking up the author and the book I notice that it has drawn some serious criticisms for inaccuracies and lack of empirical data backing up claims.

reflexco · 2024-05-22T12:41:59 1716381719

> lack of empirical data backing up claims

Doesn't that describe almost all books on psychology?

Psychology studies tend to be so hilariously unscientific that I'd rather get the coherent opinions and gut feelings of an experienced practicing expert, rather than half-arsed studies.

gosub100 · 2024-05-22T16:31:25 1716395485

You could level some pretty damning claims against hard science as well due to the ongoing reproducibility crises in academia (LK99, the "faster than light" accidents that have been reported,the "EM Drive"), or the enormous amount of money (and people's brains) sunk into string theory. Somehow those are/were considered science even though there is no evidence.

biorach · 2024-05-22T12:42:19 1716381739

Links? He cites a _lot_ of empirical research and the book is generally highly regarded in the field

infecto · 2024-05-22T14:22:36 1716387756

Do you have links yourself for those claims?

wing-_-nuts · 2024-05-22T14:57:42 1716389862

You could see the bibliography of the book itself. You could google a bit and see the guy is a leader in this field, and a pioneer of this research. Answering a request for sources of your assertion with a request for theirs isn't done in good faith.

infecto · 2024-05-22T16:17:21 1716394641

I made no such assertion? I was following the thread and I think its a fair request. It was an easy google search to see that he was fired for bullying and creating a hostile workplace a few years back...not sure where that landed. And I saw a number of articles relating to pseudoscience that he recommends in the book. It was a simple ask for their simple ask.

ramity · on July 9, 2023

While somewhat tangential to this, I'd like comment on a different problem. "What's the fewest number of plates you need to go from 45lb (bar) -> 240lb at a resolution of 5lb?" Look no further than the 185lb set[1]. Composed of pairs of 2.5lb, 5lb, 2x10lb, 25lb, 45lb plates, one can do just this. Need to lift 245lb? Buy another pair of 45lb plates and your range expands to 45lb -> 330lb. Another pair of 45lb plates yields 45lb -> 420lb. You get the idea. This approach comes with the added bonus of being the most cost efficient method of buying plates, as higher pound plates yield slightly better $/lb.

1: https://github.com/ramity/athena/blob/master/notebooks/plate...

Side note: I've yet to do the calculations for kg sets, but I'm certain something similar to this exists.

Edit: I made a laughably simple mistake interpreting my results and corrected the above. Many thanks to those who replied and brought this to my attention (credited in the commit https://github.com/ramity/athena/commit/4a17a3d16058f850d09e...).

wyre · on July 9, 2023

This leaves out the ability to get to 85, 90, 175, and 180 lbs. Another set of 10s hits these missing numbers quite easily tho and allows for 330 lbs in total which should be plenty for most recreational lifters.

nemetroid · on July 10, 2023

> Side note: I've yet to do the calculations for kg sets, but I'm certain something similar to this exists.

Metric plate math is trivial in comparison.

In the pounds case, dividing the common plate numbers by 5 lb yields:

1/2, 1, 2, 5, 9

Dividing the metric series by 2.5 kg (~5 lb) gives:

1/2, 1, 2, 4, (6), 8, (10)

hervature · on July 9, 2023

The set of weights you listed cannot do 85lbs or 90lbs since nothing adds up to 20 or 22.5. Your notebook shows this. Easily rectified by adding in another pair of 10lb plates.

spencerflem · on July 9, 2023

Doesn't this set skip 20lbs? Since 10+5+2.25 has a gap before the next size of 25

AndrewOMartin · on July 9, 2023

To get 20, you just use the 25 and put the 5 on the other end.

mellavora · on July 9, 2023

Old schoolers lift metric.

The bar is 20kg.

The weights (pairs of each):

1.25

2.5

5

10

20

you might want some 25kg bumpers for deadlifts

ramity · on June 14, 2023

I don't really subscribe to the idea of "finishing" or "completing" a project; I'm sure my personal github can attest to that. I think "real" software is never done. The only things in software that are completed are the fractions of software we abstractly define (tasks, features, sprints, deliverables, etc). Much like us, real software lives until it doesn't. It changes through time sometimes regressing and expanding. Software whose goalposts remain static becomes deadware.

Outside of commercial projects, I program for the joy of creation and commonly, and paradoxically, automating for the sake of "not automating." I jump from project to project, sure, but I've found the largest source of not wanting to go back to a project is the difficulty of doing so. Having to pick things back up to juggle and going through the motions of learning what my software did and what needs to be done was always a pain.

My real breakthrough was "optimizing being able to leave." Comments like I'd be picking up the software months/years later, READMEs detailing build steps and rationale and planned features, automating dev environment setups with docker, break features/work into pieces so it isn't overwhelming, etc. These are just some of the many ways to make it easier.

Sometimes you don't want to go back because all you can think of is the known (or unknown) work that lies ahead of you. The fewer the reasons to not go back, the easier it is, and if it's easy to pick back up, you'll find yourself picking things back up when the time is right. Sometimes inspiration hits while working on other stuff, and I say that's fine. Embrace that.

Commercial software is a bit more narrow in the selection of how one can start and stop on work (I call this "task shopping"), but being in tune with yourself and vocalizing that during standups/meetings/whatever can help. Can't seem to finish a task? Maybe the task was too big to begin with, scope/feature creep set in, or whatever. Create tasks for what you've gotten done and what needs to be done. Lay out some groundwork to explain how someone might pick up the new tasks. Do that and you'll find yourself "optimizing being able to leave."

"You must become comfortable with the grind-it-out nature of the last 10% of a project." I really don't align with this statement. Software can and should be a joy to do. Sure, there are aspects that can make it feel like a grind, but this is a question of framing. After all software development is technically data entry (don't think about this too much).

ramity · on May 23, 2023

The precision used should match the requirements of the dataset, the training process, and the available compute. There are practical uses to 16-Bit FP training.

"Our findings demonstrate that pure 16-bit floating-point neural networks can achieve similar or even better performance than their mixed-precision and 32-bit counterparts." This is a very deceptive statement. Take 100 initialization states and train a FP16 vs a FP32 network, and you'll find FP32 will have an accuracy advantage. It's certainly possible to conclude this if a small sample of networks are trained. This paper goes on to state, "Lowering the precision of real numbers used for the neural network’s weights to fixed-point, as shown in [11], leads to a significant decrease in accuracy.", while later concluding, "we have shown that pure 16-bit networks can perform on par with, if not better than, mixed-precision and 32-bit networks in various image classification tasks." The results certainly do, but that doesn't really give an accurate evaluation of what's really going on here. A FP64 network can fall into a local minima and be outperformed by a PF16 network, but is it correct to say the FP16 network is better. I'm getting a lot of mixed signals.

I feel like, "significant implications" is quite a stretch.

A few concerns: Besides figure 3, other results do not provide side-by-side test vs validation accuracy to attempt demonstrate the network is not overfit, and the only mention of normalization was the custom batch normalization operation.

This may more be a rant about the current state of ML, but in a perfect world, we wouldn't use GPUs/would enforce deterministic calculations, results would be replicable, we'd train hundreds if not thousands of networks to draw conclusions from, we'd better understand how to visualize network accuracies and overfitting, and all datasets would be free of bias and accurately generalize the problem attempting to be modelled. We can dream.

tysam_and · on May 23, 2023

This is generally incorrect. FP16 matches FP32 via bfloat usually with almost no sweat, and generally any additional noise tends to have a positive regularization effect.

I train directly in _pure_ fp16/bf16 with no issues and the benefits greatly outweigh the tradeoffs. On smaller networks, I use 0 gradient clipping whatsoever.

FP32 has almost no uses outside of bizzarely intricate simulation-kinds of things, in which case FP64 is still generally important.

ramity · on May 24, 2023

I appreciate your input on bfloat. I've always been under the impression that precision matters a lot when attempting to avoid local min/maxima if the landscape of the error function is jagged, but I suppose there's a good argument to be made that any floating point format can be used if the data, learning rate, network structure, etc are molded to match. Perhaps it's my perspective or maybe there actually isn't enough discourse on FP format being equally or more important factor to consider than just its affect on compute and memory requirements.

The use of FP64 could aid against vanishing gradients and just general information loss in deep networks, but that's probably comparable to using an atomic bomb to power a wind turbine. It certainly works, but is it the best way to go about it?

I personally think the use of mixed precision in deep networks will become more common as time goes on. I'm doubtful that all of a network really benefits from having large amounts of precision.

tysam_and · on May 25, 2023

Well, if I could guide a bit in terms of focus, it's not necessarily the precision of the floating point values as much as the structure of information flow and expressivity in the network. Gradients are going to die basically regardless of precision or not, you're maybe saving yourself a few steps but if you're at the point of using precision to stave off dead gradients it's like several orders of magnitude less efficient than a decent solution is.

My personal belief on experience is that training in pure FP8 is maybe possible with some hacks, but that our limit for needing mixed precision to stabilize things might come into play around 3-6/7 bits or so (a wide range, sorry). I could be wrong though, maybe there is some really cool discrete training method out there that I'm not aware of.

A good way to prevent information loss in neural networks is to minimize all of your subpath lengths. You also want a really short shortest path for information from your first to your final layer. That will do a lot.

Also, as far as things being jagged -- remember that floating point only loses a lot of precision on large numbers, which should be really coarse anyways. Having large, perfectly precise numbers means we are likely overfitting. Small and detailed means that we can afford to have high precision. Think of it as a beneficial tradeoff like knowing momentum and/or velocity to some exchangeable extent in quantum mechanics. If we impose that on our precision, we get some nice benefits in the end.

Hope that helps sort of expound on the subject a bit more, feel free to let me know if you have any questions and much love! <3 :))) :D :)

imtringued · on May 23, 2023

From what I can tell the architecture is more important anyway and having smaller but more parameters gives the model more chances to figure out the optimal architecture on its own.

tysam_and · on May 23, 2023

My best understanding is that architecture is predetermined, which determines the number of parameters up front?

I do think that, however, having shallower bit depths over time will require some slightly deeper networks to compensate, as a result. Sorta makes sense when you think about it a bit. :) <3 :DDDD :)

ramity · on April 12, 2023

I didn't feel like the parent comment was worth downvoting, and I share most of your perspectives. I agree that AI safety is ripe for contributions, but I'd personally argue AI safety is a very wide and encompassing topic that requires a good foundational understanding of AI. In the context of ML, AI safety is applicable to the entire process; from dataset creation, training, utilization, etc.

In regards to your comment on AI safety being "underutilized", my thoughts are that it's just simply difficult to do. Let's put aside all the difficulties of training, verification, etc and just look at the data problem.

If you wish to make certain that your system meets some given AI safety standard, then you must somehow prove two things: the data the model ingests when deployed will always return the correct response and that the dataset composes/generalizes the data the model will ingest when deployed. For simple problems, this may be doable. For complex, multidimensional problems wherein the dataset must only hope to generalize the complex input it will encounter during deployment, this may be next to infeasible.

I'm definitely getting off topic here, but bias of all kinds exists even human operated systems eg car. I can't say I've ever seen a firetruck stopped on the highway before, but perhaps I'd know what it is and how to avoid it. If a dataset does not contain that event, how can we be certain an AI system would understand? I'm not sure if it's possible to create a dataset that will be without bias in the case of complex problems, but I'm certain we can create one that's performant at driving than I. So the questions of "how safe is enough?", then proving/demonstrating that safety, and more are particularly open topics. I enjoy making the point that there is the lack of rigorous standards for humans, as we hold computers to far higher standards, but ML models probabilistically navigate decisions similar to us.

I'm sure this reply could extend further, but this and more are my defense of why I believe AI safety is a wide topic. None of the above should dissuade beginners from exploring the subtopic, but it's certainly not something you'd be able to learn first without strong, foundational context.

bumby · on April 12, 2023

I think we probably come to some of the same conclusions, but I'm approaching this problem slightly differently.

To use your car example: say I'm driving in front of a park where there are lots of parked cars lining the street. Then a ball rolls out into the street from between two parked cars. I may have never personally seen a child run into the street from between two parked cars before, but I can infer (i.e., imagine) that from the context of the scenario. So I slow waaaay down in case that event happens. I don't need to see all edge case to still cover an awful lot of them.

I'm not sure AI is to that point (yet). There are some arguments for approaches like reinforcement learning that say they perform quite well on unseen edge cases from past learning. But when the stakes are high, I'm not sure that is good enough.

(And regarding the 'it only has to be slightly better than the average human' counterpoint): I disagree. I think one of the reasons that we are comfortable with sharing the road with other ape-driven vehicles is that we have a theory of mind and can intuit what someone else is thinking and are able to 'imagine' their course of action. We've evolved to have this sense. We do not, however, have the ability to intuit what a computer will do because it 'evolved' under very different circumstances. So our intuitions about whether or not to trust it may be out of whack with whether or not it performs better. And, like it or not, the policy that governs if AI-controlled cars is legal will be highly dependent on public trust.