It's an interesting option. My gut instinct is that if you need need 128GB of memory for a giant model, but you don't need much compute - like fine tuning a very large model maybe - you might as well just use a consumer high core CPU and wait 10x as long.
All the frameworks work on CPU. At the time I tried it, the 5950X was about 10x slower than my GPU, which was a 1080Ti or 2080Ti. GAN not a transformer though.
I think they are saying train (or at least fine-tune) on a CPU.
This can work in some cases (years ago I certainly did it for CNNs - was slow but you are fine tuning so anything is an improvement) but I don't know how viable it would be on a transformer.
5950X CPU ($500) with 128GB of memory ($400).