Opera Mobile works, but sadly it's only available for Android. Only mobile browser that can handle full desktop sites without excessive horizontal scrolling, most of the time well enough desktop versions become preferable to "mobile" versions. Not to be confused with rather useless Opera Mini.
Intel CPUs resemble GPUs more and more over time. I think just scatter, GPU style ultra slow (high latency) but wide memory interface and texture lookup is missing in Skylake (Xeon).
Gather was already added in Haswell, although it performs badly so far.
Skylake (Xeon AVX-512) handles 16 float wide vectors (512 bits) and can dual issue per clock, bringing effective rate to 32. That's definitely comparable to modern GPUs.
Wasn't Nvidia WARP just 16 float wide per clock cycle? Or 32? For comparison, high end Nvidia 980 GTX GPU has only 16 of such SIMD execution cores. However, they count those 16 cores as 2048 in their marketing literature.
I do wonder if Intel is planning to unify CPU and GPU in 10 years or less. Things sure seem to be moving that way.
If Intel can add significant amounts of eDRAM in package, x86 CPUs aren't that far from being capable of handling GPU duties as well.
Okay, so how this works is you have a processor that is 16 scalar cores wide. Each scalar core is really just an out of order scheduler, for 32 in-order pipelined, boring, ALU's. These ALU's can each execute the same instruction, together, giving you the illusion that the scalar core is doing vector processing.
The reality is far weirder. I.E.: If you encounter a branch, the scalar processor can, and will execute both branches on different ALU's, and execute the branch statement on another, allowing for a 10 instruction section of code to run in ~3 instructions time. Trying doing that with a vector processor.
Technically in CUDA you can schedule each ALU itself, thus marketing stuff.
> The reality is far weirder. I.E.: If you encounter a branch, the scalar processor can, and will execute both branches on different ALU's
That's not so different than on x86 SSE/AVX. You'd execute both sides of the branch (dual issue) and blend / mask the results away you don't want. This is typically much faster than having a data dependant, unpredictable branch.
Another way is to SIMD sort data according to criteria to different registers and process them separately. This completely sidesteps having to execute both sides of the branch, although some computational resources are still wasted.
>That's not so different than on x86 SSE/AVX. You'd execute both sides of the branch (dual issue) and blend / mask the results away you don't want.
What your talking about is how x86_64 processors can optimize away some branches. Which it does this with the cmov instruction. This has nothing to do with SSE/AVX. Its common to confuse this b/c intel says the branches are executed in parallel (and they often are), just in parallel as the OoO pipeline allows, which is actually quite a few.
Both sides of the branch are pre-computed, then the branch is computed. But its output is sent to a cmov, which just re-assigns a register, instead of jmp into a branch. This avoids pipeline flushes. cmov isn't prefect still costing ~10 cycles, but compared to the ~100 of a pipeline flush its still cheaper.
Provided the same operations are being done on both branches then SSE/AVX can be used. As both branches are just values, and that is literally what vector processors are good at. The chain will end with a cmov.
It has absolutely nothing to do with CMOV. I'm talking about computing, say, 16 results in parallel in a SIMD register, for both sides of "if"-statement. Then masking unwanted results out. SSE/AVX can simulate "CMOV", but for 128/256/512 bit wide vectors.
To make it even more clear, there's not a single CMOV in my code, anywhere. The data doesn't usually even touch general purpose (scalar) registers, because that'd totally destroy the performance.
What you are talking about is how things were done until 1997-1999 or so. SSE in 1999 and especially SSE2 in 2001 changed radically the way you compute with x86 CPUs.
I'm talking about things like vpcmpw  (compare 8/16/32 of 16 bit integers and store mask), vpcompressd (compress floats according to a mask, for example for SIMD "sorting" if and else inputs separately), vpblendmd (blending type combining, this example is for int32), vmovdqu16 (for just selectively moving according to mask).
You can do most operations on 8, 16, 32, 64 unsigned and signed, and of course 32-bit and 64-bit floats. Some restrictions apply especially to 8 and 16 bit operands. When appropriate, it's kind of cool to process 64 bytes in one instruction. :)
GPUs have evolved with about the same pacing. Nvidia's Kepler architecture has a vector length of 192 (single prec.) per core and up to 15 of these cores on one chip.
The question really is, do you optimize the chip for heavily data parallel problems, saving overhead on schedulers and having a very wide memory bus, or do you optimize for single threaded performance of independent threads and give it some data parallelism (Xeon). As a programmer, when you're actually dealing with data parallel programs, doing so efficiently on a GPU is actually quite a bit easier since you have one less level of parallelism to deal with.
I think we're mixing up terminologies here. One SMX operates on up to 192 values in parallel (Nvidia calls this 192 "threads" per SMX). Functional units AFAIK is only used in terms of "special functional units" which isn't relevant for this discussion. One SMX has 6 Warp schedulers, but I'm not sure on how independant these can operate. My guess is that branch divergence will only NOP out one whole Warp, but I'm not sure whether the Warps can enter different routines or even kernels (my guess is yes for routines/no for kernels).
So the different functional units (this has a specific meaning in hardware design) are 32 wide and indeed if the instructions to be executed can utilize all 6 of them at the same time the smx will operate on 192 values but that wont be the case if you only need to executed a large number of double precision floating point operations.
Holy crap, that's the reason this algorithm isn't all over the place? That seriously irks me, and this one seems like a great patent to look for either Tesla-esque giveaway or just invalidation (EFF, anybody?).
Smart generic caching could do so much for the global infrastructure with all the data we move around these days.
While the patent may be annoying, in practice there are a number of other cache replacement algorithms and mechanisms that are just as effective as ARC. The tradeoffs inherent in cache replacement algorithms means there are numerous ways of achieving approximately equivalent results. ARC is not a globally optimal design or anything like that.
In that sense, the patent is kind of worthless. Anyone that knows what they are doing can design around it with no loss of functionality or efficiency. A lot of sophisticated systems end up designing custom algorithms tuned for the use case anyway.
> with Haswells mediocre improvements Intel has really not come up with anything substantial in years
Unlocking Haswell's performance (up to 2x integer/FPU) requires using AVX2 instructions. Which means at least recompiling and to truly extract the performance, optimizing to AVX2 intrinsics or assembler.
I've been using Linux as my primary desktop nearly continuously since 1999, and I'm typing this in Linux right now, but for what it's worth, you can now mount volumes in Windows as a path on another volume.
The difference might seem subtle, but on Amiga you could refer to volumes as "Volume name": like WorkDisk:. Then whenever anything referred to say WorkDisk:dir/file, the system would request you to provide this volume, "WorkDisk".
So mounting a filesystem to a directory is not same. When directories are mounted, you need to provide volume first and then you can refer to it. You could name some USB stick as "Important Files": and whenever you copy or save anything to drive "Important Files": the system would ask you to provide this named volume.