However, the GPU-shader is far more constrained than a CPU-thread. A GPU-shader is "locked" to performing the same tasks as other cores. That is to say: 64 "GPU threads" on AMD systems have the same program counter / instruction pointer.
32 "GPU threads" on NVidia systems have the same program counter.
This means that if one thread loops 10,000 times in a program, then that will force 32 other threads (on NVidia) or 64 other threads (on AMD) to loop 10,000 times (even if those threads individually only need to loop 500 times).
This restriction doesn't really matter in cases of matrix multiplication: where all threads would loop the same number of times. But in cases of like, Chess Minimax algorithm, the "thread divergence" makes it difficult to actually port Chess Minimax to a GPU.
Machine learning... (EDIT: or more specifically, Convolutional Neural Networks) is almost purely a matrix multiplication problem, making it ideal for the SIMD GPU architecture.
CPUs have SIMD units by the way (SSE, AVX, etc. etc.), but GPUs are specialized systems that can only do SIMD. So a GPU's SIMD unit is inevitably going to be bigger, more parallel, and more efficient than a CPU's SIMD unit.
EDIT: Google's TPU goes beyond the GPU-model (SIMD architecture) and has dedicated matrix-multiplication units. These units can only perform matrix multiplication, nothing else, and are designed to do it as quickly and power-efficiently as possible.
NVidia's "Turing" line of GPUs have dedicated matrix multiplication units called "tensor units". NVidia basically grafted a dedicated FP16 matrix multiplication unit onto their GPUs.
Only the PDP-10 has been replaced by Google.
Back in the 1980s, a personal computer was loaded with an operating system that could only support a single user. A "mainframe" was a system that could support more than one user at a time, and you'd connect to it using telnet.
Today, your Linux PC loaded with SSH effectively functions as a mainframe. Heck, you can load virtual machines that themselves can pretend to be mainframes from the 1970s. Heck... your cell phone uses this feature and pretends that apps are different users for maximum security.
In contrast: GPUs are simply the 1980s style SIMD super-computers. It turns out SIMD Supercomputer style was the fastest way to calculate where pixels go on the screen (Conceptually: each pixel-shader basically gets to run its own thread: so a 1920x1080 screen has 2073600 threads to run per frame)
The best architecture to run millions of simple threads is a GPU or SIMD-computer (See https://en.wikipedia.org/wiki/Connection_Machine)
> The CM-1, depending on the configuration, has as many as 65,536 individual processors, each extremely simple, processing one bit at a time.
May be a terminology thing, or a technology I don't know, but I though '80s supercomputers like crays were vector processors, whereas I'm not aware of any commercial super at the time being SIMD (AKA array processor).
A vector processor as I understood it is nicely described here <https://www.quora.com/What-is-meant-by-an-array-processor-an... as "the earliest Crays had vector instructions, that quickly fed a stream of operands through a pipelined processor", which is not exactly the same implementation as SIMD, but after reflecting on your question, it does the same thing in the end.
So effectively no difference.
Also relevant <https://arstechnica.com/civis/viewtopic.php?t=401649> see reply of Accs who claims to have worked at cray, and also says in another post on the same thread "I think that Vector is nothing more than a special case of SIMD".
So, my error, hope above helps.
FYI you might want to read up on systolic arrays just for fun <https://en.wikipedia.org/wiki/Systolic_array>.
@tntn: thanks, my carelessness.
And, well, you know, “words”... whatever works :-D
Network requests are always very slow when compared to processor speed. You can buy a device for 1/500 the cost of a server, put it in your closet, and it will be faster at almost every task than offloading to the "cloud".
I work tangential to the Telecom space, and all anyone cares about is the "edge" i.e. Consumer Premise Equipment and the like. With how cheap computer power is and how high network latency is, having centralized datacenters has become less and less desirable.
So having this hardware in your phone means that, for example, face recognition can run faster, and scene analysis during video recording takes less battery.
But the neural network inference speed is rarely the issue that requires both running models on servers and collecting user data. In your case of speech recognition, the problem is that training state of the art models requires 10s of thousands of hours of speech data. A recent Amazon paper actually used a million hours. For language models you need even more. Modern phones actually can perform speech recognition locally (put your phone in airplane mode, dictation still works) but it's using models trained using data from users that did talk to the server.
I have a feeling that AI technology is inherently anti-privacy.
Looks like they're not uploading clips to themselves either, at least it doesn't show up in https://myactivity.google.com
They still store audio when you use Ok Google though.
Yes, powerful ARM embedded platforms are making this more and more possible and I'm at least excited about it.
Not only that, speech recognition and speech synthesis technologies are getting more efficient by the month. There was a paper just this past week (https://arxiv.org/pdf/1905.09263.pdf) which demonstrates speech synthesis with an order of magnitude faster performance. Advances like those make it easier to cram these technologies on local IoT platforms.
And there's certainly been a number of local focused AI projects over the past few years. I remember Snips.ai being decent for building local-only voice assistants. It worked on a Raspberry Pi, though you still have to hack together decent microphones for it.
All of this gives me hope for an open source voice assistant that will be competitive with the commercial options.
Now if the makers of smart accessories (lights, etc) would stop being anti-consumer for two minutes we might get decent accessories that don't require a cloud.
P.S. Chuck, you are a great inspiration and friend. Thank you for all the kindness you've shown me.
Disclaimer: I work in Google, but apparently have never touched any of this from the inside.
Voice recognition without the cloud has been possible for years, this is nothing new.
> computers you could talk to
I can't help but feel somewhat pessimistic about technological developments in the handheld device domain. I say this as a supporter of ARM though. I have no reason to dislike ARM, I'm just somewhat apprehensive regarding any more empowerment of the companies we keep in our pockets.
> ...it just happens that most implementations today are...
Also, in the wider socioeconomic picture, everyone once got the same computational technology eventually, because better was also cheaper.
Now, computation is stratifying, like any normal industry, where you get what you pay for. This ending of egalitarianism is bad.
What market forces stop vendors selling small quantities of far better tech at far better profits? Price discrimination is fundamental in marketing: different people will (can) pay different prices. An extreme example: I believe the military gets more advanced process nodes, when there are only small runs, before mass production.
I think the iPad Pro answers that. Could you buy or design a faster mobile processor than the A12X?
"High volume, high profit" trumps "low volume, extreme profit".
I think that's always been true for CPUs and GPUs. The faster ones were always the more expensive ones.
If you are cringing at the idea of a 3k+ cpu (or 9k+ gpu), it quite literally wasn't built for you.
As one extreme counter point, the cheapest smartphones today have better performance than the original iphone.
Audio is one of the big areas where we can see huge gains, and I think philosophies about optimization are changing. That said, there are myths out there like "recursive filters can't be vectorized" that need to be dispelled.
Things being tricky with AVX and cache-weirdness doesn't change the fact that if you're not vectorizing your arithmetic you're losing performance.
There's also an argument to be made if you're writing performance critical code you shouldn't optimize it for mid/bottom end hardware, but not everyone agrees.
Gotta admit I'm not real clear on what my phone does that needs on-device machine learning.
At least when some classical algorithm fails at performing certain task, people talk of the failure as a bug. Not as something that's "probably going to be solved with more training data".
Recent Snapchat and Instagram selfie filters
Google Keyboard's translation and prediction
Google Traduction's lens
Google Assistant's voice recognition
Google Message's instant replies
- Photos and a few other things on iOS, are all local. Same thing with Windows 10 (Photos app on Windows, Keyboard suggestion too I believe)
- Microsoft Edge pages suggestion (really, it's a onnx model)
- OCR Text Recognition on Windows
Of course, training a model takes longer given the same amount of processing power - but for applications like video processing, just applying the model can be pretty demanding.
I thought ML required massive amounts of data to be taught, most of which makes more sense in the cloud.
Am I way off here?
There can be more guarantees over data privacy, since your data can stay on-device. It also reduces bandwidth as there's no need to upload data for classification to the cloud. And that also may mean it's faster, potentially real time, since you don't have that round trip latency.
This is not necessarily for phones. Lots of (virtually all?) low power IoT devices have ARM cores. There are plenty of environments where the cloud or compute power isn't available.
Better hardware probably means less power drain and larger models.
There was some cool stuff about this in Google I/O keynote.
This requires hardware in which you can multiply these huge matrices quickly, even if the weights are downloaded from the cloud.
Surf has a keybinding Ctrl-Shift-S to toggle JS. Launch it with -s to disable by default.
Though FWIW, the article renders fine for me with uBlock Origin + Extra.
Assuming Apple continue like they do in the past, drop iPhone 7 and moves iPhone 8 down to its price range. They would have an entry level iPhone that is faster than 95% of all Android Smartphone on the market.
It boils down to:
CPUs: 10s of cores
GPUs: 1000s of cores
NNs: 100000s of cores
NNs have very simple cores (fused multiply-add and look-up table functions) but can run many of them in one cycle.
FTA: Because general-purpose processors such as CPUs and GPUs must provide good performance across a wide range of applications, they have evolved myriad sophisticated, performance-oriented mechanisms. As a side effect, the behavior of those processors can be difficult to predict, which makes it hard to guarantee a certain latency limit on neural network inference. In contrast, TPU design is strictly minimal and deterministic as it has to run only one task at a time: neural network prediction. You can see its simplicity in the floor plan of the TPU die.
Especially for mobile applications (most Arm customers), you pay extra energy for all that pipeline flexibility that isn't being used. A dedicated chip will save a bunch of power.
Another comment mentioned cores and I don't think that's a good way of looking at it, as in most ways a TPU is back to very "few" but hyper-specialized "cores". There is essentially no parallelism in a TPU or neural processor -- you feed it three matrixes and it gives you the result. You move on to the next one.
There's no way to avoid clicking on it, as it follows you around and grabs your attention, not letting you read.
I scrolled down to see how long the article is. That somehow also triggered a redirect to the root.
How is it possible that the most user hostile news sites often get the most visibility? Even here on HN which is so user friendly. What are the mechanics behind this?
The url should be changed to a more user friendly news source. How about one of these?
Except on mobile. The UI elements are tiny, try closing ten comments without clicking the timestamp by mistake at least once.
So the site shouldn't place a cookie 'session_id=12345678', but a cookie 'techcrunch_tracking_enabled=False' that's not linked to any particular user would be just fine.
And you don't think they really stop tracking you just because you said you didn't like it, right? What would you even do against it? Sue each and every website owner on the planet somewhere in the EU where they might not even care?
From my personal experience, Apple chips are fast, but I can't really compare that to anything else. JS benchmarks match desktop performance, but that is not really what I was/am interested in.
I've tried some 'cloud ARM-based metal servers' which promised to have similar performance as intel Atom cpu's basically, but they felt at least 10 times slower. So I gave up on the whole concept.
But I guess the take-away is, those specific systems where slow not so much ARM in general. I mean, it does make sense that there is some scale difference between a Xeon core vs a core meant for mobile use now that I actually think about it ;)
Decoding a big fat X265 file on your 2012 MBP would be atrocious because it doesn't have hardware X265 decode, but decoding it on your XS would probably be fine.
The result is the same with my own crafted image processing algorithms using single core. Apple makes really fast CPUs.
How much has x86 performance increased over the past six years?
This ~1.5x in single thread performance improvement matches with stats from geekbench.com benchmark:
my macbook pro 2012 cpu:
i7-3615QM => 3093 single thread 
equivalent current gen cpu with same TDP (45W) and similar clock (2.3GHz):
i7-8750H => 4617 single thread 
Disclaimer: I work on the MLIR/XLA team.
I suppose keyboards haven't really been a strong point!
ive's design group just hates standard keyboard design i guess