At which point is a simple markov chain same/better?

yobbo · 2025-08-14T20:03:48 1755201828

I can't find references to HMM-based large language models. Small HMM language models generate gibberish very similar to this.

A HMM consists of a state space, a state transition matrix, and an output probability matrix. A token space of 50k and a state space of something like 60k would have seemed impossible 10-20 years. It has only recently become viable.

Training using Baum-Welch on a big enough text data set would be interesting. It should be much faster than back-propagation with a transformer-model.

visarga · 2025-08-14T13:31:59 1755178319

Output text is word salad every few words apart. You can't scale n-gram counting enough to make it work.

sadiq · 2025-08-14T13:48:07 1755179287

You might find https://arxiv.org/abs/2401.17377v3 interesting..

JPLeRouzic · 2025-08-14T16:18:37 1755188317

Only if you have access to corporate-level hardware:

"It took us 48 hours to build the suffix array for RedPajama on a single node with 128 CPUs and 1TiB RAM"

protomikron · 2025-08-14T21:53:12 1755208392

It's okayish. Considering 64G to 128G are available for (nerd) high-end consumers you're just off with a factor 5 (if we can squeeze out a little bit more performance).

Thas is pretty astonishing in my opinion.

JPLeRouzic · 2025-08-14T16:21:38 1755188498

Not exactly a few words in my experience, I would say every 100 words, if you sophisticate your Markov Chain (n-gram = 3 at minimum, using a good tokenizer, making it tailored to the training data, large training set (500Kbytes or +), intelligent fallback instead of random, etc.).

Nevermark · 2025-08-14T13:51:35 1755179495

It is the other way around.

Neural-type models have long passed the point where markov chains made any sense by many orders of magnitude.

Markov models fail by being too opinionated about the style of compute.

In contrast, a linear tensor + non-linear function has incredible flexibility to transform the topology of information. Given large enough tensors, two such layers, with recurrence, can learn any mapping, static or dynamical. No priors (other than massive compute) needed.

All other neural architectures then are simply sparser arrangements, that bring compute demands down. Where the sparseness is fit to the type of problem.

Sparseness can be deeper but narrower information flows (thus “deep” learning). Or in lower numbers of weights to weight application (I.e. shared weights, like convolutions).