Isn't the whole point of SIMD being as similar to original x86 instructions as possible? reusing as much the existing cpu as possible? Otherwise you would have something like the ps3?
Yes and no. SIMD (Single Instruction Multiple Data) as a concept has nothing to do with x86, it's basically just the concept of vectorizing the code and is used on many platforms.
The x86 SIMD extensions such as SSE and AVX, on the other hand, aim to integrate that concept with x86 and are therefore pretty similar.
If you care about latency, a modern 8-or-more core x86 with its L1/L2 cache segmentation and penalized-but-shared L3 cache is almost as complex. It becomes even more complex if you use the CPU topology to make inferences hyperthreading shared caches or need to deal with the shared FPU on older AMD processors.
My understanding is that the largest difference is that some of the Cell cores had different opcodes that meant you could schedule some threads on some cores but not any thread on any core.
I have written quite a bit of SPE code. The primary issue is that the SPE processor could only read/write to 256kB localized memory (without doing a DMA). So literally object orientated code doesnt even work (because of VTables). The c/c++ model is not designed for this type of architecture. Yes there were also limitations like vector only registers and memory alignment but the biggest issue was the local memory.