> 1) FP8 half-precision training: NVidia is artificially disabling this feature ...

> 1) FP8 half-precision training: NVidia is artificially disabling this feature in consumer GPUs to charge more for Tesla / Volta.

No, this physically is not present on consumer chips. You can't subdivide the ALUs like that even on Tesla P5000 cards. Of course you can promote FP8 to FP32 without an issue, on any card, but you don't gain any performance either.

At the time Pascal was designed it didn't make any sense to waste die space on FP16 support let alone FP8, since games are purely FP32. This is changing now that Vega has FP16 capability ("Rapid Packed Math") and titles may be using this capability where appropriate. I would not be surprised to see it in Volta gaming cards at all.

It's funny, everything old is new again. Someone comes up with this idea about once every 10 years. Using FP16 or FP24 used to be big back in the DX9 days.

> 2) A licensed / clone of AMD SSG technology to give massive on-GPU memory: NVidia's 12 GB memory is not sufficient for anything beyond thumbnail or VGA sized images.

You're looking for NVIDIA GPUDirect Peer-to-Peer, which has existed since like 2011.

https://developer.nvidia.com/gpudirect

AMD's product is actually purely marketing hype, it's simply a card that contains a PLX chip to interface a NVMe SSD. It is the same technology that is used for multi-GPU cards like the Titan Z or 295x2, and it offers no performance advantages vs a regular NVMe SSD sitting in the next PCIe slot over.

This is something that people didn't know they wanted until AMD told them they wanted it. But you can do this on any GeForce card even, no need to shell out $7000 for some crazy custom card that doesn't even run CUDA.

The bigger problem is that there really isn't much of a use-case for it. NVMe runs at 4 GB/s, which is painfully short of the ~500 GB/s that the GPU normally runs at. That is even significantly less bandwidth than host memory can provide (a 3.0x16 PCIe bus limits you to 16 GB/s of transfers regardless of whether that's coming from NVMe or host memory).