144GB VRAM to load the weights at FP16, 72GB quantized to FP8. To figure out the KV cache size you'll need for an LLM, you can use the following formula: https://x.com/AlpinDale/status/1841305040545329535
(Edit: I accidentally swapped in some of the vision config bytes in my original calculation; these are the corrected numbers.) So, for NVLM 1.0 72B, that works out to 640kb per token assuming FP16 KV cache. If you use the entire 32k context length, that's an extra ~20GB of overhead for the KV cache. Then depending on how you're running the LLM, there might be extra overhead e.g. compiled CUDA graphs.
You can cut this down lower by using grouped query attention as described here: https://medium.com/@plienhar/llm-inference-series-4-kv-cachi... This allows you to divide that number by the number of grouped heads, although it trades off accuracy for VRAM usage.
But TLDR, a minimum of around 164GB of VRAM at full accuracy. To me that seems fairly low, and I think vLLM would OOM without significantly more than that, but that's about as low as you could go in theory if you're running everything at FP16. Half that, of course, for FP8.
You'll typically need to have a copy of the KV cache per GPU, if you're using multiple GPUs, so multiply the KV cache overhead by the number of GPUs you're using. This will depend on what the specs for the GPUs you're using are; for example, you'll need 3 H100s (really four, since vLLM wants the number of heads to be evenly divisible by the number of GPUs); if you're using L40Ses, you'll need eight of them; but most likely only a single AMD MI300x.
Is there an online calculator to help you find the optimal combination of # of drives, raid level, and block size?
For example, I'm interested in setting up a new RAID-Z2 pool of disks and would like to minimize noise and number of writes. Should I use 4 drives or 6? Also, what would be the optimal block size(es) in this scenario?
Moody’s Investor Services and S&P Global Ratings agreed to pay the heftiest fines,
a $20 million civil penalty each. Fitch Ratings agreed to pay $8 million, A.M. Best
Rating Services agreed to pay $1 million, HR Ratings de México, S.A. de C.V. $250,000,
and Demotech agreed to pay $100,000, respectively.
It's 0.3% of their 2023 revenue. It's proportional to a $360 fine to a person who makes $120K. Not even remotely deterring. I'd be surprised if they even noticed it was gone.
Homomorphic encryption (HE) is a cryptographic technique that enables computation on encrypted data without revealing the underlying unencrypted data to the operating process. It provides a means for clients to send encrypted data to a server, which operates on that encrypted data and returns a result that the client can decrypt. During the execution of the request, the server itself never decrypts the original data or even has access to the decryption key. Such an approach presents new opportunities for cloud services to operate while protecting the privacy and security of a user’s data, which is obviously highly attractive for many scenarios.
Have you heard of functional encryption which allows a server to compute a function on encrypted data without knowing the decryption key and return the result in the clear. Therefore the server learns the result and nothing else about the encrypted data is leaked. There are possibility results but there is no practical, implementable scheme at present for general functions but there are schemes for specific functions like inner products (linear functions) and quadratics. With FHE, only the client, who has the decryption key, can learn the result so some interaction is needed for the server to learn the result. Each has their own applications.
reply