
C# Hotspot Parallelization by Cekirdekler API (GPGPU)(multi Device)(OpenCL) - tugrul_hbm
This API uses OpenCL to use all devices and apply a load-balancing algorithm to minimize latency of computations. For example,<p><pre><code>     for(int i=0;i&lt;16384;i++)
         array[i]=expensive_function(array[i]);
</code></pre>
can be partitioned to all OpenCL-capable devices:<p><pre><code>     &#x2F;&#x2F; use all GPUs and CPU at the same time
     var numberCruncher = new ClNumberCruncher(AcceleratorType.GPU|AcceleratorType.CPU,
    @&quot;__kernel void acceleratedLoop(__global float *a)
       {
            int threadId=get_global_id(0);
            a[threadId]=pow(tanh(sqrt(cos(sin(a[threadId])))),0.3f);
       }&quot;);

    ClArray&lt;float&gt; buffer = array;
    buffer.compute(numberCruncher,1,&quot;acceleratedLoop&quot;,16384);
    &#x2F;&#x2F; now array has computed values by 16384 workitems on different devices such as 
    &#x2F;&#x2F; gpus cpus igpus and fpgas 
</code></pre>
you can view a quick tutorial and download binaries (for lazy developers) here:<p>https:&#x2F;&#x2F;www.codeproject.com&#x2F;Articles&#x2F;1181213&#x2F;Easy-OpenCL-Multiple-Device-Load-Balancing-and-Pip<p>if you want to build the source on your computer yourself and to read a detailed wiki:<p>https:&#x2F;&#x2F;github.com&#x2F;tugrul512bit&#x2F;Cekirdekler&#x2F;wiki
======
tugrul_hbm
It also does pipelining if enabled. This reduces the array access overhead or
even hides completely in perfect conditions.

