Hacker News new | past | comments | ask | show | jobs | submit login
Think Fast: Tensor Streaming Processor for Accelerating Deep Learning Workloads [pdf] (computer.org)
61 points by blopeur 23 days ago | hide | past | favorite | 24 comments

Though their primary testcase was just ResNet, at first glance the results here are encouraging. They claim a fairly staggering performance increase:

"Compared to leading GPUs [42], [44], [59],the TSP architecture delivers 5×the computational density for deep learning ops. We see a direct speedup in real application performance as we demonstrate a nearly 4×speedup in batch-size-1 throughput and a nearly 4×reduction of inference latency compared to leading TPU, GPU, and Habana Lab’sGOYA chip."

It is challenging to directly compare a GPU vs an ASIC style chip like this. I would like to see more detailed performance comparisons vs something like Google's TPU.

You need at least 10x over an established approach to have any hope of surviving in the mid-term. The results are for batch-size 1 as well. GPUs are throughput optimized. So I wouldn't be surprised if an A100 chip would be able to outperform this for batch size >4. I still think this is a really cool architecture though.

That's a good point, though these folks do specifically call this a "Streaming Processor", so I'm not sure that using anything besides a batch size of 1 for inference is entirely fair, but perhaps I'm misunderstanding what they mean by streaming. I was also thinking that the A100 comparison would be very interesting.

4x speedup at batch size 1 is really terrible. GPUs are terrible at small batch sizes. And that is comparing a research project with production hardware.

Honestly if you look at any of the other AI hardware startups they all advertise much more significant speedups.

Are people largely running inference on GPU/TPU? At my job, we run inference with pretty large transformers on CPU just because of how much cheaper it is.

To run inference on GPUs, people are typically using TensorRT or a similarly-optimized engine. That can make a big difference in cost tradeoffs vs. CPU. Ultimately, if you can keep a GPU reasonably well-fed, the GPU can come out much cheaper and lower latency. If your workload is very sporadic and infrequent, YMMV.

> if you can keep a GPU reasonably well-fed

That's the big if that most people seem to miss. And I even had people complaining their training was slow on a GPU...

I'm curious, what does cheaper mean exactly, and how much cheaper is it? Is that cheaper to acquire the inferencing hardware, or cheaper in terms of energy efficiency? Is your inferencing running on servers, or commodity hardware... in the cloud or on user/consumer devices?

I'd agree with the sibling comment; this depends completely on what you are trying to do, and how fast you need to do it. There's nothing wrong with inferencing on a CPU, and I have no doubt it's much cheaper in certain ways, but it's also slower than what you can do on a GPU or custom ASIC, and there are reasons some people need it to go faster. One example that's in wide deployment would be Nvidia's DLSS for video games. It'd be pretty hard to run that in real time at 4k resolution on a CPU.

I work for a for-profit public SaaS corporation, we are not optimizing for energy efficiency, we are optimizing for cloud compute costs.

What you are doing today to reduce the cost? What are the major obstacles to your organization's goal?

We at nascentcore.ai are looking at ways to reducing cloud training cost by enabling more alternative training asic chips available to the public.

Feel free to contact us at info@nascentcore.ai

It depends on the type of task you have, For some task inference CPU can't be real time. Example for some of those task are speech recognitions and friends, machine translation etc.

> For some task inference CPU can't be real time. Example for some of those task are speech recognitions and friends, machine translation etc

Funnily enough, the two services that my team is running on CPU are speech recognition and machine translation at real-time speeds, so that is definitely not true.

Heck, I can run an accurate real-time speech recognition service on my computer and only use like 5% of CPU.

Can you share more details? Are you using massive parallelism or model reduction techniques like distillation?

Sure, we are not using any advanced parallelism beyond just batching requests for translation and spinning up more servers if we have more requests. No distillation.

For translation, there's really not that much to say - we run the transformers on CPU and they seem to be pretty quick. We have a little more tolerance for latency here than with speech.

Real-time deep speech recognition on CPU is a little trickier. wav2letter++ has the best performance we've found. it's implemented entirely in C++ and streaming inference is quick on CPU. Without a GPU (and even with tbh), it is not feasible to do real-time decoding with a transformer LM, so we use n-grams.

So you are using a lower accuracy model and also a stripped down version of it. It might work for your domain but can't be generalizable. I have problem with people who make a blanket statement about running model on CPU. Heck Yeh I can run an kaldi ASR engine on "Raspberry Pi", but not that accurate.

What a surprisingly condescending response, I thought you were genuinely curious.

I never said anything about lower accuracy - we are running full size transformer models for translation.

And wav2letter++ inference models are SOTA on the Librispeech leaderboards, so try again. This is a completely different architecture than Kaldi and, frankly, conflating the two is wrong.

> have problem with people who make a blanket statement about running model on CPU.

What was my "blanket statement"? I said that the statement "For some task inference CPU can't be real time... speech recognitions and friends, machine translation etc" was false, because those tasks can be done in real time on CPU. The original claim seems to be much more of a blanket statement than my response.


Try it "Zamia Speech: Open source state of the art speech recognition" https://www.raspberrypi.org/forums/viewtopic.php?t=216638

What do you mean? I'm familiar with Kaldi - although haven't tried running it on an Rpi (nor with these particular models). That's materially different from w2l though.

If you have enough requests, batching and running through GPU is more cost effective.

Do you know of any performance/cost benchmarks that you could point me to showing this?

I am guessing that groq did the wrong thing here.

To my eyes, deep learning asics generally are only meaningful in 2 separate scenarios: a high power high scale data center training chip; or a low power highly efficient edge inference chip.

TSP appears a throughput oriented high power inference chip. I don't know any decent size market can support such chip from a start-up.

Drone / smartphone SLAM? I expected it years ago but have not yet seen it implemented in any kind of scale.

Smartphone emphasize efficiency. Most of them are computer vision that Convolution-based algorithms is powerful enough to be the general algorithm that any asic vendors can target. And that is also a highly mature and established market.

They often have similar architecture in terms of the execution workflow, but added NN-flavored instruction units and instructions. That drives down cost and makes them easier to program with.

Drone, they are more energy limited than smartphone, as the propulsion system consumes more energy. Inference throughput seems a secondary problem to drones.

Why did Wave Computing fail?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact