Hi HN, We’re a small team of OS, virtualization, and ML engineers, and after three years of development, we’re thrilled to launch the beta of our CUDA abstraction layer! We decouple the Kernel Shader execution from applications that use CUDA into a Wooly Abstraction layer. In this abstraction layer, we compile these to a new binary, and Shaders are compiled into a Wooly Instruction Set. At runtime, Kernel Shader launch events initiate a transfer of Shader over the network from a CPU host to a GPU host, where they are recompiled. Their execution is managed by Wooly Server software to achieve maximum GPU resource utilization, isolation between workloads, and cross-compatibility with hardware vendors before being converted to be passed on to the respective GPU hardware runtime and drivers. In principle, the wooly abstraction layer is similar to an Operating System, which sits on top of the hardware and enables the most efficient and reliable execution of multiple workloads.
We built a GPU Cloud service(WoolyAi Acceleration Service) using this abstraction layer with "Actual GPU Resources Used" billing, NOT "GPU Time Used" billing.
Looking forward to getting your feedback and comments.