Can somebody explain to me why they wouldn't just do this on the GPU? Isn't the GPU already designed to perform "hardware-accelerated image processing"?
For the same reason Google's TPU was an order of magnitude more efficient than Nvidia's previous "pure" GPUs, or why video codec accelerators are also faster and more efficient than GPUs for video decoding. The GPU is still pretty "general purpose" compared to a chip that only does image processing and not much else.
I hope Google will eventually reveal if the IPU shares any DNA with the TPU at all.
These developments are quite funny. At one point GPUs were just fixed function hardware. As more flexibility was needed for novel applications, programmability was added.
Now we are going to back to fixed function units as the calculus has changed in favour of power savings and against flexibility.
I'm interesting in seeing what advance in tech will change it again in favour of flexibility.
Maybe because we are mixing up concepts and names ? You are comparing GPUs that were on tower computers and GPU that are now on mobile computers. Like you said, mobile require energy efficiency so having a fixed function chip makes sense right now, just like it made sense for tower computers back then.
> the calculus has changed in favour of power savings and against flexibility
This has always been a trade-off. For instance even with desktops, once video handling became common we first had video decode/encode being handled by GPU acceleration. But now most CPUs include dedicated h264 (and more recently hevc) decode/encode support, eg [0].
Although it's much lower computation, hardware acceleration is also offered for audio, and I'd take a guess the reason the Apple APIs for checking which hardware supports hardware-based audio encode/decode [1] has been deprecated is now because all supported devices provide all possible capabilities.
In the case with audio, it might be the opposite - even with a modern audio codec like Opus, an Apple CPU can decode an entire song into RAM in a fraction of a second. At that point, the CPU has to wake up anyway to download another song from Spotify or read one from disk.
As I understand it the main advantage of the TPU is that it is built around an 8 bit pipeline. It's counter-intuitive to me, but apparently many deep learning NN can be flattened down to layers with only 256 states and still perform really well.
That's one part but doing sequencing in hardware is another big factor. And I expect they're playing tricks to reduce the read pressure on their register files.
do you have an external source for this? a lot of analysts say the tpu is mostly to keep negotiation power against nvidia, that it's not really especially powerful or efficient
I'm a little bit confused too - all the operations they describe are usually quite fast on a GPU.
I wonder if it's not so much about being better at image processing, but having control and direct access to the hardware. For the Adreno, you have to go through Qualcomm's drivers and are subject to their limitations (there is freedreno which works great, but I don't think Qualcomm allows it on handsets). You're also stuck with the sizes of GPU that are available on Snapdragon SoCs - if you want a larger one, too bad.
Most SoCs have some kind of dedicated image processor these days as well as a GPU, even ARM's got in on the game with their own hardware designs. Unfortunately they tend to be pretty proprietary so not much good if you want to do your own image processing. As far as I can tell, the image processors traditionally have a more DSP-like architecture (small local buffers, instruction set optimised for efficient data transfer and processing, etc), but since it's proprietary it's hard to tell. Supposedly they're meant to be more power efficient than the alternatives.
(Which isn't surprising; GPUs are really designed for 3D rendering and they're definitely overkill for simpler tasks. The general assumption is that you're fetching small locally-contiguous groups of pixels from main RAM, doing texture mapping and computations on them, then conditionally blitting the result to other locally-contiguous areas of RAM. Most of the infrastructure used for this is going to waste if you're just using it to do 2D image processing.)
Indeed, several good examples of these DSPs are the Qualcomm Hexagon and the VideoCore IV, the latter of which has a 64x64 register bank, and is largely reverse engineered so you can actually figure out how it works [1]. They are really good for highly serial per-block operations, such as video encoding and decoding. They would also work OK for stuff like large convolutions, which is what I kind of imagine the IPUs are for, but not really significantly better than a GPU (the low serial latency is wasted).
I did find the original source to the Ars article [2] and it says that each IPU core has 512 ALUs. This seems more like a extremely wide GPU than the VPU. It also seems to be programmable in Halide [3], which presents a more "SIMT"-like interface, just like a GPU shader. It probably lacks the super fast bilinear texture mapping units that a GPU has, but otherwise seems very similar. It'd be interesting to know if the texture/pixel cache is handled automatically, or manually with a large register file like the VPU.
The article also features this quote, which also seems to back Google being unsatisfied with only high level access to the GPU:
>A key ingredient to the IPU’s efficiency is the tight coupling of hardware and software—our software controls many more details of the hardware than in a typical processor.
It would be really awesome if they opened this chip up to third party developers. Unfortunately the press release only mentions availability in the Camera API so far, relegating it to the same opaque blob status as all the other dedicated image processors :(
I was kinda feeling dumb reading this... SoC is "System On A Chip"... "A system on a chip or system on chip (SoC or SOC) is an integrated circuit (also known as an "IC" or "chip") that integrates all components of a computer or other electronic systems. It may contain digital, analog, mixed-signal, and often radio-frequency functions—all on a single substrate."
Just want to add my thanks to the growing list. Kind of surprising that they didn't follow the best practices for abbreviations and have "SoC (System on a Chip)" at the first instance of the abbreviation.
Thanks. I was confused as well... most articles would have started off with the term and then used the acronym... i knew the context of what it was, but not knowing what it actually stood for was driving me nuts...
It's only useful as a persistent exploit vector if it has persistence. If I was designing this, it'd be simpler to have it boot off an image downloaded from the main SoC rather than giving it its own flash.
Good point - if it's just got firmware uploaded from the SoC, it's more of an escalation / memory protection bypass vector versus a persistence vector. I also wonder what gets sent across to it - for example, if a malicious video from the web could land on the coprocessor.
Mostly it's just fun seeing more and more co-processing devices arrive in phones as they get broken :)
> Google says the Pixel Visual Core is designed "to handle the most challenging imaging and machine learning applications"
I suspect they hope it will be the latter rather than the former. Design models at HQ, then distribute them to be trained on each user's data on their own phone. That means you need to use less bandwidth to transfer training data and models, which leaves more available to serve ads!
I suspect you are being slightly tongue in cheek about the bandwidth for ads (yay 5G ;-).) With that said, I think a key benefit of on-device learning is the privacy angle. The Pixel 2's always-on song identification is said to be done on the phone without your audio data being sent to Google. Similarly the Google Clips camera apparently does its magic without sending image data to Google. With devices having less to differentiate on privacy is increasingly visible in the marketing arena particularly for devices that seem to be watching or listening at all times.
Also, what if you replace the ML model with something more nefarious? Like have the model classify things incorrectly. Or if a target face is in the picture then send a request to a server of your choosing.
So maybe Google wants to build a device that needs image processing, but the (initial) volume can't justify a full-on custom ASIC. (And for some other reason -- power, space, ... -- this hypothetical device can't use an FPGA.)
Maybe find a device that does have a custom ASIC (e.g. Pixel 2) and add the image processing functionality to that ASIC. Then perhaps use same custom ASIC for both devices. As long as the extra functionality doesn't increase the cost of the original ASIC, problem solved.
>If Google ever set out to compete with Qualcomm's Snapdragon line, an IPU is something it could build directly into its own designs. For now, though, it has this self-contained solution.
If Google wanted to do this they would probably have to buy out QCOM and their patents.
Or at the very least pay a boatload of money in licensing fees. Qualcomm owns a huge chunk of IP in the mobile network space - they were responsible for the initial push behind CDMA back in the early 90s.