For those of us without 200GB of GPU RAM available... How possible is it to do i...

toxik · on June 23, 2022

These models are not character-based, but token-based. The problem with CPU inference is the need for random access to 250 GiB of parameters, meaning immense paging and orders of magnitude slower than normal CPU operation.

I wonder how bad it comes out with something like Optane?

amelius · on June 23, 2022

It's not really random access. I bet the graph can be pipelined such that you can keep a "horizontal cross-section" of the graph in memory all the time, and you scan through the parameters from top to bottom in the graph.

toxik · on June 23, 2022

Fair point, but you’ll still be bounded by disk read speed on an SSD. The access pattern itself matters less than the read cache being << the parameter set size.

lostmsu · on June 23, 2022

Top SSDs do over 4GB/s so you can infer in 50 seconds if disk bound.

You can also infer a few tokens at once, so it will be more than 1 char a minute. Probably more like sentence a minute.

toxik · on June 23, 2022

You can read bits at that rate yes, but keep in mind that it’s 250 GiB /parameters/, and matrix-matrix multiplication is typically somewhere between quadratic and cubic in complexity. Then you get to wait for the page out of your intermediate result etc etc.

It’s difficult to estimate how slow it would be, but I’m guessing unusably slow.

lostmsu · on June 23, 2022

The intermediate result will all fit into a relatively small amount of memory.

During inference you only need to keep layer outputs until the next layer's outputs are computed.

If we talk about memory bandwidth, it is space requirements that are important, not so much time complexity.

guywhocodes · on June 23, 2022

I wonder if you can't do that LSH trick to turn it into a sparse matrix problem and run it on CPU that way.

nmfisher · on June 23, 2022

That's pretty much what SLIDE [0] does. The driver was achieving performance parity with GPUs for CPU training, but presumably the same could apply to running inference on models too large to load into consumer GPU memory.

https://github.com/RUSH-LAB/SLIDE

julienfr112 · on June 23, 2022

What about 250gb of ram and use a cpu ?

hnechochamber2 · on June 23, 2022

    $ dd if=/dev/zero of=/swapfile bs=1G count=250 status=progress
    $ chmod 600 /swapfile
    $ mkswap -U clear /swapfile
    $ swapon /swapfile

pflanze · on June 23, 2022

If you bother to set the permissions, I suggest to do it in a way that doesn't leave a time window during which it still is unprotected (note that non-priviledged processes just need to open the file during that window; they can keep reading even after your chmod has been run). Also, not sure what the point of `-U clear` was, that's setting the uuid for the swap, better leave it at the default random one?

    $ ( umask 077; dd if=/dev/zero of=/swapfile bs=1G count=250 status=progress )
    $ mkswap /swapfile
    $ swapon /swapfile

jstimpfle · on June 23, 2022

Is there a reason why it is required to fill the swapfile with zeroes here? Normally you'd see something like "dd of=/swapfile bs=1G seek=3 count=0", creating a file of size 3G but with no space allocated (yet). It's much quicker to complete the setup this way.

wongarsu · on June 23, 2022

I assume if you force the file system to allocate inodes you are likely to have a less fragmented file than if you create a sparse file that gets inodes assigned over time when each part is used.

olddustytrail · on June 23, 2022

Interesting guess but wrong I'm afraid :)

It's simply because it's an easy way to create a file of a certain size that most Linux users would be familiar with.

The quicker way (and possibly more "proper" way) is to use fallocate, but who has even heard of that vs dd ?

theblazehen · on June 23, 2022

Which won't matter on SSDs

wongarsu · on June 23, 2022

On all the benchmarks of SSDs I've seen they perform 1.5 to 4 times better on sequential reads than on random reads. That's a much better ratio than HDDs, but still enough to care about it.

You're also likely to get less write amplification if your swap file is continuous.

Of course with all the layers of indirection it's a numbers game, you don't know if your file system allocates adjacent inodes, and you don't know how your SSD will remap the blocks. But all else being equal, trying to make the file as sequential as possible seems preferable.

Aardwolf · on June 23, 2022

Way too slow on CPU unfortunately

But this does make me wonder if there's any way to allow a graphics card to use regular RAM in a fast way? AFAIK built-in GPU's inside CPU's can but those GPU's are not powerful enough

yarandex · on June 23, 2022

Assuming running on CPU is memory-bandwidth limited, not CPU-limited, it should take about 200GB / (50GB/sec) = 4 seconds per character. Not too bad.

lostmsu · on June 23, 2022

That's per token. And you can generate quite a few per pass.

julienfr112 · on June 23, 2022

Slow, but is it still practical, like taking minutes to generate few words ca still be useful for testing or on certain low usage use-cases ?

easytiger · on June 23, 2022

I thought cuda had a unified memory system? Maybe I misunderstood

Cu3PO42 · on June 23, 2022

Unified memory exists, but it's not a magic bullet. If a page is accessed that doesn't reside on device memory (i.e. on the GPU), a memcpy is issued to fetch the page from main RAM. While the programming model is nicer, it doesn't fundamentally change the fact that you need to constantly swap data out to main RAM and while not as bad as loading it from the SSD or HDD, that's still quite slow.

Integrated GPUs that use a portion of system memory are an exception to this and do not require memcpys when using unified memory. However, I'm not aware of any powerful iGPUs from Nvidia these days.

easytiger · on June 23, 2022

Sure. Makes sense. So I guess for discrete GPUs the unified memory stuff provides a universal address space but merely abstracts the copying/streaming of the data.

There does seem to be a zero copy concept as well and I've certainly used direct memory access over pcie before on other proprietary devices.

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ind...