More

programjames · 2025-11-10T22:02:20 1762812140

Far too much marketing speech, far too little math or theory, and completely misses the mark on the 'next frontier'. Maybe four years ago, spatial reasoning was the problem to solve, but by 2022 it was solved. All that remained was scaling up. The actual three next problems to solve (in order of when they will be solved) are:

- Reinforcement Learning (2026)

- General Intelligence (2027)

- Continual Learning (2028)

EDIT: lol, funny how the idiots downvote

whatever1 · 2025-11-10T22:18:59 1762813139

Combinatorial search is also a solved problem. We just need a couple of Universes to scale it up.

programjames · 2025-11-10T22:23:00 1762813380

If there isn't a path humans know how to take with their current technology, it isn't a solved problem. It's much different than people training an image model for research purposes, and knowing that $100m in compute is probably enough for a basic video model.

7moritz7 · 2025-11-10T22:04:20 1762812260

Hasn't RLHF and with LLM feedback been around for years now

programjames · 2025-11-10T22:15:10 1762812910

Large latent flow models are unbiased. On the other hand, if you purely use policy optimization, RLHF will be biased towards short horizons. If you add in a value network, the value has some bias (e.g. MSE loss on the value --> Gaussian bias). Also, most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly. So, basically, there's a lot of biases that show up in RL training which can make it both hard to train, and even if successful, not necessarily optimizing what you want.

storus · 2025-11-10T22:33:23 1762814003

We might not even need RL as DPO has shown.

programjames · 2025-11-10T23:27:23 1762817243

> if you purely use policy optimization, RLHF will be biased towards short horizons

> most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly

l9o · 2025-11-10T22:08:07 1762812487

What do you consider "General Intelligence" to be?

programjames · 2025-11-10T22:17:59 1762813079

A good start would be:

1. Robust to adversarial attacks (e.g. in classification models or LLM steering).

2. Solving ARC-AGI.

Current models are optimized to solve the current problem they're presented, not really find the most general problem-solving techniques.

stirfish · 2025-11-10T22:36:53 1762814213

I like to think I'm generally intelligent, but I am not robust to adversarial attacks.

Edit: I'm trying arc-agi tests now and it's looking bad for me: https://arcprize.org/play?task=e3721c99

koakuma-chan · 2025-11-10T22:06:56 1762812416

In my thinking what AI lacks is a memory system

7moritz7 · 2025-11-10T22:11:25 1762812685

That has been solved with RAG, OCR-ish image encoding (deepseek recently) and just long context windows in general.

Eisenstein · 2025-11-11T01:59:28 1762826368

RAG is like constantly reading your notes instead of integrating experiences into your processes.

koakuma-chan · 2025-11-10T22:18:29 1762813109

Not really. For example we still can’t get coding agents to work reliably, and I think it’s a memory problem, not a capabilities problem.

atlex2 · 2025-11-11T00:11:28 1762819888

On the other hand, test-time weight updates would make model interpretability much harder.

programjames · 2025-11-01T15:38:30 1762011510

This is only for review/position papers, though I agree that pretty much all ML papers for the past 20 years have been slop. I also consider the big names like, "Adam", "Attention", or "Diffusion" slop, because even thought they are powerful and useful, the presentation is so horrible (for the first two) or they contain major mistakes in the justication of why they work (the last two) that they should never have gotten past review without major rewrites.

programjames · 2025-10-31T17:45:17 1761932717

No, they aren't. Locally, the person with the most esoteric knowledge is probably a weird nerd. It's mostly an accident that they chose to invest time in things typically associated with smarts. But globally, the best wizards got there by making it their profession. So maybe at your middling university, the people who could land a job at a frontier lab were nerdy wannabe frats, but at decent universities like MIT or Tsinghua, they're usually just better in every aspect of their lives. E.g. MIT has "math olympiad fraternities" all the cool kids join.

whimsicalism · 2025-10-31T18:17:15 1761934635

I went to a top 5 ranked school globally (~these lists fluctuate) and have been in elite circles since then. I can promise you that even there the autistic nerd fully outcompetes the renaissaince man.

programjames · 2025-10-23T21:44:57 1761255897

The human traffic code is also written in blood. But humans are worse at applying the patch universally.

hamdingers · 2025-10-23T21:59:57 1761256797

We don't even try. In the US you demonstrate that you know the rules at one point in time and that's it, as long as you never get a DUI you're good.

For instance, the 2003 California Driver's Handbook[1] first introduced the concept of "bike lanes" to driver education, but contains the advice "You may park in the bike lane unless signs say “NO PARKING.”" which is now illegal. Anyone who took their test in the early 2000s is likely unaware that changed.

It also lacks any instruction whatsoever on common modern roadway features like roundabouts or shark teeth yield lines, but we still consider drivers who only ever studied this book over 20 years ago to be qualified on modern roads.

1. https://dn720706.ca.archive.org/0/items/B-001-001-944/B-001-...

Natsu · 2025-10-23T22:25:44 1761258344

Some places will dismiss a traffic ticket if you attend a driver's education class to get updates, though you can only do this once every few years. So at least there have been some attempts to get people to update their learning.

hamdingers · 2025-10-23T23:00:27 1761260427

This only happens if you get a traffic ticket, which is rare and getting rarer.

Ironically this means the people with the cleanest driving record are least likely to know the current ruleset.

Detrytus · 2025-10-24T01:38:38 1761269918

Which, ironically, would mean that knowing the current rule set is not needed to drive safe.

hamdingers · 2025-10-27T23:22:18 1761607338

Not getting tickets does not mean you are a safe driver. No amount of crashing results in traffic school, just certain kinds of tickets.

dragonwriter · 2025-10-30T23:56:15 1761868575

> No amount of crashing results in traffic school, just certain kinds of tickets.

Well, sufficient at-fault crashing will suspended your license, and among other requirements for restoring the license may be traffic school, DUI school, or some other program depending on the reason for suspension, so this is not strictly correct. You can't use optional voluntary traffic school to clear points from a collision from your record BEFORE getting a suspension the way you can with minor moving violations without a collision, but that doesn’t mean collisions won’t force you into traffic school.

smcin · 2025-10-30T23:46:56 1761868016

Which states/counties/cities? IME that rarely happens, tickets are often used for revenue-raising. And some recent laws e.g. 2008, 2009, 2025 CA cellphone use laws cannot be discharged by traffic school, AFAIK.

Natsu · 2025-10-31T21:42:59 1761946979

Phoenix, Arizona

kelnos · 2025-10-24T01:39:54 1761269994

> Anyone who took their test in the early 2000s is likely unaware that changed.

That's silly. People become aware of new laws all the time without having to attend a training course or read an updated handbook.

I took the CA driver's written test for the first time in 2004 when I moved here from another state. I don't recall whether or not there was anything in the handbook about bike lanes, but I certainly found out independently when it became illegal to park in one.

kevincox · 2025-10-24T14:39:41 1761316781

I don't doubt that many people are aware of many of the new laws. But I strongly suspect that a very significant number of drivers are unaware of many new laws.

programjames · 2025-10-23T20:52:59 1761252779

The model seems pretty shitty. Does it only look on a frame-by-frame basis? Literally one second of video context and it would never make that mistake.

programjames · 2025-10-23T12:02:30 1761220950

What dates? 2000 August–2025 August gives 3.18%/year for the consumer index and 7.12%/year for egg prices. Even if you assume egg prices will come down to $3/dozen, it's still 6.24%/year.

programjames · 2025-10-22T01:30:21 1761096621

Hmm, I think a mixture of beta distributions could work just as well as cateogrical here. I'm going to train it for PixelRNN, but it's going to take hours or days to train (it's a very inefficient and unparallelizable architecture). I'll report back tomorrow.

programjames · 2025-10-24T03:10:41 1761275441

Update 2:

After another 24 hours of training and around 100 epochs, we get down to 4.4 bits/dim and colors are starting to emerge[1]. However, an issue a friend brought up is that log-likelihood + beta distribution weights values near 0 and 1 much higher:

     log(Beta likelihood) = alpha * log(x) + beta * log(1-x)
                                      ^
                               log(0) --> oo

This means we should see most outputs be pure colors: black, white, red, blue, green, cyan, magenta, or yellow. 3.6% of the channels are 0 or 255, up from 1.4% after 50 epochs[2]. Apparently, an earth-mover loss might be better:

    E_{x ~ output distribution}[|correct - x|]

I could retrain this for another day or two, but PixelRNN is really slow, and I want to use my GPU for other things. Instead, I trained a 50x faster PixelCNN for 50 epochs with this new loss and... it just went to the average pixel value (0.5). There's probably a way to train a mixture of betas, but I haven't figured it out yet.

[1]: https://imgur.com/kGbERDg [2]: https://imgur.com/iJYwHr0

programjames · 2025-10-23T01:25:32 1761182732

Update 1: After ~12 hours of training and 45 epochs on CIFAR, I'm starting to see textures.

https://imgur.com/MzKUKhH

programjames · 2025-10-25T02:55:01 1761360901

Update 3:

Okay, so my PixelCNN masking was wrong... which is why it went to the mean. The earth-mover did get better results than negative log-likelihood, but I found a better solution!

The issue with negative log-likelihood was the neural network could optimize solely around zero and one because there are poles there. The key insight is that the color value in the image is not zero or one. If we are given #00, all we really know is the image from the real world had a brightness between #00 and #01, so we should be integrating the probability density function from 0 to 1/256 to get the likelihood.

It turns out PyTorch does not have a good implementation of Beta.cdf(), so I had to roll my own. Realistically, I just asked the chatbots to tell me what good algorithms there were and to write me code. I ended up with two:

(1) There's a known continued fraction form for the CDF, so combined with Lentz' algorithm it can be computed.

(2) Apparently there's a pretty good closed-form approximation as well (Temme [1]).

The first one was a little unstable in training, but worked well enough (output: [2], color hist: [3]). The second was a little more stable in training, but had issues with nan's near zero and one, so I had to clamp things there which makes it a little less accurate (output: [4], color hist: [5]).

The bits/dim gets down to ~3.5 for both of these, which isn't terrible, but there's probably something that can be done better to get it below 3.0. I don't have any clean code to upload, but I'll probably do that tomrrow and edit (or reply to) this comment. But, that's it for the experiments!

Anyway, the point of this experiment was because this sentence was really bothering me:

> But categorical distributions are better for modelling.

And when I investigated why you said that, it turns out the PixelRNN authors used a mixture of Gaussians, and even said they're probably losing some bits because Gaussians go out of bounds and need to be clipped! So, I really wanted to say, "seems like a skill issue, just use Beta distributions," but then I had to go check if that really did work. My hypothesis was Betas should work even better than a categorical distribution because the categorical model would have to learn nearby outputs are indeed nearby while this is baked into the Beta model. We see the issue show up in the PixelRNN paper, where their outputs are very noisy compared to mine (histogram for a random pixel: [6]).

[1]: https://ir.cwi.nl/pub/2294/2294D.pdf [2]: https://imgur.com/e8xbcfu [3]: https://imgur.com/z0wnqu3 [4]: https://imgur.com/Z2Tcoue [5]: https://imgur.com/p7sW4r9 [6]: https://imgur.com/P4ZV9n4

programjames · 2025-10-20T20:46:58 1760993218

I would prefer you refer to it as "courtesy" or "consideration" rather than "freedom".

programjames · 2025-10-18T05:40:31 1760766031

And paper money is a Chinese invention. Doesn't mean it's worthwhile to spend two weeks in an anthropology class talking about how much awesomer they are.

programjames · 2025-10-17T19:44:44 1760730284

That isn't a comparison to the state of the art, just a naive quantum clock.