First, thanks for writing this up. Too many people just take a “buy the box, divide by number of hours in 3 years approach”. Your comparison to a 3-year RI at AWS versus the hardware is thus more fair than most. You’re still missing a lot of the opportunity cost (both capital and human), scaling (each of these is probably 3 kW, and most electrical systems couldn’t handle say 20 of those), and so on.
That said, I don’t agree that 3 years is a reasonable depreciation period for GPUs for deep learning (the focus of this analysis). If you had purchased a box full of P100s before the V100 came out, you’d have regretted it. Not just in straight price/performance, but also operator time: a 2x speedup on training also yields faster time-to-market and/or more productive deep learning engineers (expensive!).
People still use K80s and P100s for their relative price/performance on FP64 and FP32 generic math (V100s come at a high premium for ML and NVIDIA knows it), but for most deep learning you’d be making a big mistake. Even FP32 things with either more memory per part or higher memory bandwidth mean that you’d rather not have a 36-month replacement plan.
If you really do want to do that, I’d recommend you buy them the day they come out (AWS launched V100s in October 2017, so we’re already 16 months in) to minimize the refresh regret.
tl;dr: unless the V100 is the perfect sweet spot in ML land for the next three years or so, a 3-year RI or a physical box will decline in utility.