Hacker News new | comments | show | ask | jobs | submit login
Google supercharges machine learning tasks with TPU custom chip (googleblog.com)
823 points by hurrycane 339 days ago | hide | past | web | 276 comments | favorite



I'm happy to hear that this is finally public so I can actually talk about the work I did when I was at Google :-).

I'm a bit surprised they announced this, though. When I was there, there was this pervasive attitude that if "we" had some kind of advantage over the outside world, we shouldn't talk about it lest other people get the same idea. To be clear, I think that's pretty bad for the world and I really wished that they'd change, but it was the prevailing attitude. Currently, if you look at what's being hyped up at a couple of large companies that could conceivably build a competing chip, it's all FPGAs all the time, so announcing that we built an ASIC could change what other companies do, which is exactly what Google was trying to avoid back when I was there.

If this signals that Google is going to be less secretive about infrastructure, that's great news.

When I joined Microsoft, I tried to gently bring up the possibility of doing either GPUs or ASICs and was told, very confidentially by multiple people, that it's impossible to deploy GPUs at scale, let alone ASICs. Since I couldn't point to actual work I'd done elsewhere, it seemed impossible to convince folks, and my job was in another area, I gave up on it, but I imagine someone is having that discussion again right now.

Just as an aside, I'm being fast and loose with language when I use the word impossible. It's more than my feeling is that you have a limited number of influence points and I was spending mine on things like convincing my team to use version control instead of mailing zip files around.


Google has always been strategic about announcing what it was doing, even since the early days (I used to work there too). Think about the impact the first MapReduce and GFS/BigTable papers had.

My guess as to why they're announcing the TPU is that they are feeling the pressure from Facebook and other AI labs, and want to reinforce their reputation as being the best place to do AI research. By revealing that AlphaGo was based on this hardware, they indicate to researchers around the world that if you want to build the most advanced ML models you need to be at Google. Same reason they talked about MapReduce/GFS back in the day.


It really boosts Google Cloud Platform with image and prestige, they're coming out with what is pretty much tensorflow as a service with their machine learning product.

Coming out saying they can give you a service no one else can right down to a custom chip may sway a few buyers in the market.

https://cloud.google.com/ml/


Well this also shows their commitment to Tensorflow if they willing to pay to fab custom chips for it. Then again, in the Google scheme of things that's probably not a huge cost.

They've probably carefully modeled the energy savings over GPU/FPGA and found it to be substantial enough, even taking into account costs of design changes.


If you need enough chips it doesn't take particularly careful modelling to say that you'll beat an FPGA quite substantially in most cases.


For me this is true. GPUs on Amazon are not cheap, and if google passes the savings on from performance power consumption I'd certainly offload some work to Google cloud.


This is true - I've heard rumors that GCE has been trying hard to poach cloud consumers from other providers, this could be another stroke in that effort.


It might also give Nvidia some sorely needed worries.


> By revealing that AlphaGo was based on this hardware

Interesting, as the nature/science paper made no mention of this, it was exclusively trained on GPUs.


Note that the version of AlphaGo that beat Fan Hui and was presented in Nature is significantly different from the Version that played Lee Sedol. Unless you believe that they didn't work on it for half a year.


This explains a lot. Going into the match, Sedol thought he could beat DeepMind but thought it might only be a couple of years until the technology outpaced him. We knew Fan Hui and other Go professionals were helping the team, but a massive speedup is always nice too.

It's a bit underhanded, however. IMHO, the player should be able to study recent games before the match. But this is pretty typical, there were similar late-stage improvements with Chinook (checkers) and Deep Blue (chess).


It's possible it was trained on GPUs but ran on TPUs.


Training and inference are not the same thing.


Sure... I think you are missing the point. In Nov., the Nature paper which contains the AlphaGo algorithm, the hardware detailed was exclusively CPU+GPUs.

Between that time and the Lee Sedol match, the hardware running AlphaGo was switched, to these TPU.

From the paper: "The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs. We also implemented a distributed version of AlphaGo that exploited multiple machines, 40 search threads, 1,202 CPUs and 176 GPUs."


Or they consider the TPU to be something similar to a math co-processor, it just offloads work from the CPU. They also didn't discuss the custom built power modules, or the custom built network switches used for interconnects.


"We have an FPGA, oh but need to go faster, ok, build an ASIC then" is a natural thing to come up with. That's kind of what bitcoin farms did.

Obviously details and plans how it was done is where all the good stuff is, so that being hidden is understandable.


I'm not even a hardware or AI person, and even I could have told you ASICs would make way more sense than GPUs or FPGAs for machine learning. It's all about data locality. Fetching memory is the most costly thing a GPU does, and for ML (DNNs) there no big need for global access memory 99% of the time.

Anyone the casually follows AI knows that people have been talking about making DNN ASICS for some time. It was all a matter of time and $$$$$$

There is no doubt FB is working on them too. Which is why Google is finally publicaly saying that "we did it first ;)"


> It's all about data locality.

That's kind of what Movidius says, too, about its Myriad 2 VPU, which is kind of a GPU (SIMD-VLIW) with larger amounts of local memory combined with hardware accelerators.


> Google is finally publicaly saying that "we did it first ;)"

Makes sense. In that respect yeah, they probably wanted to keep it under wraps to avoid Facebook/others from getting a timeline estimate out of it and jump ahead.


It makes a lot more sense now why they open sourced Tensor Flow -- they have an ASIC design that supports it, and if they either want to sell the chips, or sell access to them via Google Cloud Platform, they'll need customers. And it's not like HP is about to roll out a server line sporting these any time soon, so Google will continue to have their advantage over the outside world.


> it's impossible to deploy GPUs at scale, let alone ASICs

A bit old but cf. DE Shaw and Anton https://en.m.wikipedia.org/wiki/Anton_(computer)


I think the jury is truly still out on whether the anton designs are worth it (compared to commodity clusters/GPUs, etc).

and Anton wasn't really "scale" in this sense. It was a vertically scaled single (or several) machines.


OK, so what is it? The announcement neither says what a TPU actually is nor what it can do. It's a magic black box. No specs. No price.


Well, it's a first announcement on a blog. They say it accelerates TensorFlow by 10x. They say it fits in an HDD slot. And the whole announcement must stay within a page or two.

It's a "more details to follow" type of thing. Pretty standard actually.


The cool thing here is that it's already widely deployed in production, so we know it actually works and that it's not just vaporware. The "more details to follow" is thus a lot more convincing than for a lot of similar-seeming announcements of things that don't actually exist yet (and often don't ever).


> They say it accelerates TensorFlow by 10x

They say 10x performance / watt, nothing about performance per unit time.


You can make some assumptions though. If the power consumption was equal, the performance is 10x.

The speed at which an ASIC will run is constrained by temperature (power dissipation) and and logic timing, which itself has a dependency on temperature.

So we could call that vertical scaling, to some power ceiling which may not take us all the way to 10x, but it's not impossible.

Then there is horizontal, which I assume is applicable to these problems... running more in parallel.

In both cases, I think it's safe to assume they are getting a performance increase in the instantaneous sense.


> You can make some assumptions though. If the power consumption was equal, the performance is 10x.

While I agree some performance per unit increase is likely, how does a direct 10x increased based on power savings follow? Less power usage does not mean that the chip can run through more flops in the same amount of time, right?


It does if power was the limiting factor in clock speed.


The relationship between clockspeed and power consumption is nonlinear.

http://electronics.stackexchange.com/questions/122050/what-l...

(see graph in the first answer)

Also, it's not known that the TPU have a way to allow to increase the clockspeed arbitrarily, nor is it known whether their architecture is capable of ensuring correctness at arbitrary clock frequencies. Some architectures make assumptions like "The time for this gate to reach saturation is very small compared to the clock frequency, so we'll pretend that it's instantaneous."


> They say 10x performance / watt

Well, that's the kind of metric you'd expect from a cloud provider. That's what's important to them.

If you're a tinkerer dabbling in TPU acceleration on your gaming/coding PC alongside with GPU acceleration, then the metric that would be interesting for you is speed increase per unit.


In March at the GCP NEXT keynote [1], Jeff Dean demos Cloud ML on the GCP. He casually mentions passing in the argument "replicas=20" to get "20 way parallelism in optimizing this particular model". GCE does not currently offer GPU instances. I've never heard the term replicas in the GPU ML discourse. These devices may enable a type of parallelism that we have not seen before. Furthermore, his experiments are apparently using the Criteo dataset, which is a 10GB dataset. Now, I haven't looked into the complexity of the model or to what extent they train it to, but right now that sounds really impressive to me.

1: https://youtu.be/HgWHeT_OwHc?t=2h13m6s


Since nobody has pointed this out, although he may be able to tell people "I was working on that thing", he likely can't say much more. Many large companies don't allow employees to disclose details that haven't been released publicly.


The article doesn't actually imply (though people might wish so) that the described TPU is intended to ever be a product available to others.

It's a post saying "that's how we do it" that serves a bunch of political and PR goals, but it's quite likely that this will stay an internal technology; maybe available indirectly as a cloud computing offering - in which case they'll give the specs and price of the whole solution, not of a particular model of TPU chip.


It seems it is a GPU-like device but tuned towards neural net applications, and with a more convenient HDD-like formfactor.


Hold up, you mean teams at Microsoft are "mailing zip files around!?"


How different are TPUs from GPUs? From the article, it sounds like TPUs use lower-precision arithmetic: are there any other differences?


Not just lower precision but probably less error correction in the lower bits. -Or- ... they convert the floating point values to analog values and do all the math in the analog domain and convert it back, just like old analog computers. But by going fully analog, I think the gains would be over 10x better, so probably not.

See 'low-power (inexact|approximate) computing'


They likely use half precision (i.e. 16 bit) floats in some of the computing stages. They've also discussed using even fewer bits of precision in some of their research output. When training and following an error gradient you kind of need lots of precision, but when running a learned model you often don't need much at all, 8 or even 4 bit numbers are sufficient in some calcs where rounding error propagation isn't an issue.


Now that you are free to talk about it, could you explain how ASICs stack up against GPUs, in practical terms?


ASIC stands for "application specific integrated circuit" and the name conveys a special purpose design of an integrated circuit, like for example bit coin mining.

But that's just terminology/convention. One could argue a GPU is an ASIC, and a CPU is an ASIC. The only thing to argue is how specific does an application have to be to call it an ASIC instead of some other made up name.

The prime differentiators are the process, which is the term used for the many steps of fabrication of an IC (processes are often referred to as nodes, distinguished by the smallest feature size of a transistor they create), and the ability of the design engineers to create an optimal design, in terms of boolean logic and semiconductor physics.

Given the right team of engineers, and a top notch foundry, and a great deal of experience in the problem domain (machine learning in this case), a custom IC could very likely trounce a GPU.

GPUs were designed for the domain of graphics processing, which happens to have some commonality with the processing in machine learning. But, at least until recently, GPUs weren't focused on machine learning. Just graphics.

Now the GPU vendors are trying to leverage their knowledge of graphics processing and building of graphics processors to create machine learning processors, but the thing they are leveraging could also be what handcuffs them. Which gives opportunity for a company like Google to do a fresh take on the problem domain without the baggage of the knowledge of graphics processing.


Wouldn't you typically prototype the design on an FPGA and then manufacture an ASIC once you'd worked out the kinks?


Chip designs are usually simultaneously tested in software simulation, FPGA and hardware emulators (basically special purpose super computers that run Verilog or HDL, not quite FPGA speeds but far better debug capabilities, approaching that of software RTL simulation, downside is they cost a few million a piece).

You'll also break a design down into units or blocks. These can be tested separately. E.g. you might see an L1 cache as a single unit and have a testbench (piece of HDL to stimulate and check a design) to test only that. This is useful as it's far quicker than a full design test however extensive full design testing is still required as many bugs you see as the consequence of the interactions between units.

Once you've done a lot of testing you'll tape out a chip. This may well have some issues so you'll do a new 'spin'.


I'm suspecting that the design contains a lot of repetitive units, each of which can easily be tested in software. So perhaps testing on an FPGA isn't even needed.


I think zymhan meant prototyping on an FPGA to see if you could actually see an improvement by mapping parts of your algorithm to custom hardware. This would be a sensible first step before committing to a full ASIC run. I'm sure Google did do this initially.

As for testing for correctness, ASIC designs are almost always tested in software simulation instead of on FPGAs. Mainly because RTL written for FPGAs can look quite different from RTL written for ASIC synthesis. There's also a lot incidental engineering work needed to get something working on an FPGA.


> ASIC designs are almost always tested in software simulation instead of on FPGAs. Mainly because RTL written for FPGAs can look quite different from RTL written for ASIC synthesis. There's also a lot incidental engineering work needed to get something working on an FPGA.

It's actually very common to test ASIC RTL on FPGA. You are right that if you are targeting an FPGA you may write things differently, but this doesn't mean RTL written for an ASIC won't work it's just it might not clock as fast as it could and may not make the most efficient use of resources.

There is a lot of engineering work to take an ASIC targeted IP and run it in an FPGA, but thorough verification of an ASIC design is extremely important so it's usually worth the time to bring up an FPGA that can run it (sometimes as it's too large you have to split it across multiple FPGAs).


Ah I'd state the opposite. Your design will have new problems showing up when going from FPGA to ASIC

Especially if you try to make it go faster


I've only dabbled in this area, so it's entirely possible my question doesn't make sense. Machine learning algos in general seems to have a bunch of iterative components, but within each iteration there is massive parallelism. Did you consider using a whole bunch of DSPs? They can also do parallel processing but might be cheaper in bulk than GPUs.


My vague impression is that there had started to be more and more research on use of ASICs for ML (though I can't say I follow it that closely). Enough of a changing tide that the competitive advantage was going?


What was your design flow? Did you use the standard cadence, Synopsys toolset to build this? How did you achieve basic design and JIT(or whatever it is you call the machine code part of the Python VM ) together?

Rrally interested to know how this was managed...


any idea which foundry google have used for the fabrication?


I wonder if this is a response as well to Nvidia's announced Tesla P100 chip?


There were no references to this TPU in the TensorFlow source code?


No but if you look at the code it's clearly designed to support pluggable hardware.


Very interesting observation -- this seems so much more obvious in retrospect given the meticulous mapping of graph to devices outlined in the paper.


So now open sourcing of "crown jewels" AI software makes sense.

Competitive advantage is protected by custom hardware (and huge proprietary datasets).

Everything else can be shared. In fact it is now advantageous to share as much as you can, the bottleneck is a number of people who know how to use new tech.


Indeed. I kind of realized this more recently when I saw some new chips promote that Tensorflow works on them. This is really what it was all about - getting chips everywhere to support Google's version of AI, which means that in the long term they can get those chips themselves in volume - chips that are optimized for their AI out of the box.



Joel Spoelsky : "commoditize your complements." Steve Ballmer : "developers! developers! developers!"

Developers are a complement to hardware, software, and data based businesses. Google probably spends more on employees than on hardware or data. They really want to commoditize developers that can work on their stuff.

Google's business is based on having more and better data than their competitors. In many cases they have been happy to commoditize hardware, making open their datacenter designs for instance. That helps to cheapen their costs for building datcenters.

In other cases they open source software, like tensorflow and go language. These choices are made to commoditize developers. Google wants there to be a big pool of people who know how to use the technologies that Google uses. More developers means less costs for Google to hire and train employees. Which is their biggest expense: win!

With the TPU, as long as no one else is doing that kind of hardware for machine learning, it is a proprietary advantage to keep it secret. But at some point the logic flips: when others start to do similar things Google would rather commoditize their version of the tech. Because the lesson of the last 40 years is any widely used hardware WILL become commoditized. The inertia is with software codebases and developer knowledge.

Developers developers developers developer developers!


Cool how he foreshadows the end of Sun (takeover by Oracle in 2010) in that article from 2002:

"Sun's two strategies are (a) make software a commodity by promoting and developing free software (Star Office, Linux, Apache, Gnome, etc), and (b) make hardware a commodity by promoting Java, with its bytecode architecture and WORA. OK, Sun, pop quiz: when the music stops, where are you going to sit down? Without proprietary advantages in hardware or software, you're going to have to take the commodity price, which barely covers the cost of cheap factories in Guadalajara, not your cushy offices in Silicon Valley."


Predicting that a company will fail without giving a date is not foreshadowing, it's stating the obvious.


What? He also predicted how and why it would fail. Sun was a big enough player then that it could have survived plenty of other ways. Apple of today looks nothing like the company in 2000, but Sun got caught out more or less exactly as described and never adapted.


Agree. To attract devs to using your spec, then eventually go to your platform. But since the spec are open, AMZN/MS can make their own too. So it is still a good thing.


I think this shows a fundamental difference between Amazon (AWS) and Google Cloud.

AWSs offerings seem fairly vanilla and boring. Google are offering more and more really useful stuff:

- cloud machine learning

- custom hardware

- live migration of hosts without downtime

- Cold storage with access in seconds

- bigquery

- dataflow


I think the difference you observe relates directly to the difference between what Google does outside of Cloud Platform and what Amazon does outside of AWS.


Vanilla? Boring?

I read "Vanilla" and "Boring" as "Horray, I don't have to spend time rewriting all this complicated code I already have!"

If I'm just dipping my toes into (say) Caffe or Theano, I don't have to rewrite it from scratch.

That is a huge advantage---not a disadvantage!---of AWS over google.


Your point is valid, but I think what the OP was saying is that Google is offering all this stuff IN ADDITION to the boring stuff.

Google does boring stuff very well too.. and one can argue much better than AWS as well.. take a look at Quizlet's story: https://quizlet.com/blog/whats-the-best-cloud-probably-gcp

(shamelessly biased Googler)


If I recall correctly, it took Google a while to actually offer the boring stuff. For a while, you could get a Google Compute Engine but you couldn't just get a dang VM image, because Google knows better than you and you should do things their way. They've fixed it now, but lost a lot of potential market share for that conceit.


"So"?

If you're evaluating something today, how does it change your decision that we were late to market with Compute Engine (and in this specific case "bring-your-own-kernel")?

If it's about future boring stuff, I think the list of boring stuff isn't too long ;).

Disclosure: I work on Compute Engine.


All given, the fact that Google itself doesn't extensively use GC is kind of a red flag(I know quite a few Googlers from search infrastructure and none of them said their teams used GCE internally).

A solid guarantee with AWS is if AWS goes down, then a multitude of Amazon's services also will go down(ex Amazonian myself), so it gives me a belief that AWS's uptime is more important to Amazon itself that it is for external customers.


Search Infra (and Ads for that matter) is an extreme case. Google Search might be one of the worlds most highly tuned infrastructure projects: a marriage of code and hardware design to maximize performance, scoring, relevance and ultimately ROI.

Before we had custom machine types (November 2015 GA), we wouldn't have been remotely close to what they needed. I'm not even sure we've had anyone evaluate the amount of overhead KVM adds in either latency or throughput.

tl;dr: Don't let Search be your "not until they do it". We've got folks in Chrome, Android, VR, and more building on top of Cloud (as well as much of our internal tooling being on App Engine specifically).


Google Firebase uses GCP including GCE extensively.

(Firebase Engineer here)


"So" Google lost potential business for a while from people who wanted to spin up VMs rather than wanting to ship code to a proprietary execution framework.


I think you're agreeing: We certainly missed a huge segment of the market at the time, but now that we've got GCE new business can certainly come our way.


If I recall correctly myself (getting old!), Google has had the ability to build your own custom image since GCE went GA 2.5 years ago. Now, admittadly, it took a while to get IAM and VPC going, but we done did it now!

I'd love to hear what other boring stuff has been a showstopper for you, in case we missed something dumb :)


Not having Managed Postgres and Managed Redis are 2 main showstoppers for me.


True but There are 3rd party managed services from both...


Minus the massive downtime that just occurred recently. https://status.cloud.google.com/incident/compute/16007

Not implying that AWS hasn't had them. It's just that adopting GCE this early makes you a bit of a guinea pig because GCE isn't used internally at Google.


Meaning even deeper level of vendor lock -- now you cannot even find the chips to run your application elsewhere!


No - tensorflow is open source and you can run it on many platforms. TPUs are about efficiency. You might not be able to do image recognition as efficiently without one, but you can still perform exactly the same tasks.

(I work on TF this year.)


Well, that is good to know. Thanks for clarification.


I would be shocked if tensorflow optimizations where useful 1:1 for stock Intel chips or GPU's. So, there is still plenty of lock-in even if your process runs. GPU vendors love to play this game by helping optimize games.


All of the comparable tools are practically locked into nVidia gpus/CUDA. TensorFlow is rapidly reaching performance parity on that hardware [1] and can now be run on this so it's actually sort of the least locked in framework.

[1] It has been climbing the charts at https://github.com/soumith/convnet-benchmarks for example.


It's possible, but I think that the majority of ML optimization as seen by a programmer using tensorflow is more about optimizing the balance of accuracy, training & inference speed, and memory use, and a lot of the solutions in this space are pretty hardware independent. There's an entire other type of optimization about, e.g., making conv2d insanely fast, but that's not something that a typical data scientist-type user deals with.

(To elaborate -- it's questions like "how deep should I make this convolution? Should I use tf.relu or tf.sigmoid? How many fully-connected layers should I put here, and how big should I make them?". These are really knotty deep learning design questions, but they're often h/w independent. Not always - we certainly have some ops on TF that we only support in CPUs and not on GPUs, for example - but often.)


I am more thinking in terms of:

  1. Best price/performance is tensorflow right now. So, the best software choice is platform X.  
  2. Then in 2 years.. Well we are using Platform X so tensorflow is clearly the best option.
In other words once you pick conv2d, you tend to also stick with whatever conv2d is optimized for. Which also means HW vendors love to help optimize popular platforms.


Others can make TF hardware too. Now that Google has opened the way, they will compete.


- Google Container Engine

- Cloud Shell

to name a couple more.


to start a REAL business you SHOULD act boring. for everything else there is google.


...are you saying Google isn't a real business?


no it's just that google has so many shiny things that come and go, if you build a business with shiny new things, you will propably fail if you do that over and over again, just because its shiny and new.


Interesting. Plenty of work has been done with FPGAs, and a few have developed ASICs like DaDianNao in China [1]. Google though actually has the resources to deploy them in their datacenters.

Microsoft explored something similar to accelerate search with FPGAs [2]. The results show that the Arria 10 (20nm latest from Altera) had about 1/4th the processing ability at 10% of the power usage of the Nvidia Tesla K40 (25w vs 235w). Nvidia Pascal has something like 2/3x the performance with a similar power profile. That really bridges the gap for performance/watt. All of that also doesn't take into account the ease of working with CUDA versus the complicated development, toolchains, and cost of FPGAs.

However, the ~50x+ efficiency increase of an ASIC though could be worthwhile in the long run. The only problem I see is that there might be limitations on model size because of the limited embedded memory of the ASIC.

Does anyone have more information or a whitepaper? I wonder if they are using eAsic.

[1]: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=701142...

[2]: http://research.microsoft.com/pubs/240715/CNN%20Whitepaper.p...


You can build an ASIC with fast external memory, it adds to the cost but then you can handle larger models similar to a GPU. Software support is an issue but for deep learning applications there's no reason in principle you couldn't add support to TensorFlow etc for new hardware to make it simple for application developers to adopt. Movidius has announced that they're doing this and it's likely that other ML chip vendors will do the same.


External memory kills your power budget. Efficient processors, whether they be GPUs, cpus, or asics all gain efficiencies by keeping data local to computation. Having your chip focused on off chip data is difficult to optimize power for if you don't have compute bound problems.


>> I wonder if they are using eAsic.

If you read the academic literature, they talk about 100x-3000x energy savings with asic vs GPU. So in that light Google's 10x improvement sounds low, and could certainly fit an eAsic story.

Furthermore, eAsic has fixed wires, but logic is defined via sram configuration(AFAIK) and not fixed, so it could offer a level of programmability .


This is huge. If they really do offer such a perf/watt advantage, they're serious trouble for NVIDIA. Google is one of only a handful of companies with the upfront cash to make a move like this.

I hope we can at least see some white papers soon about the architecture--I wonder how programmable it is.


There's no way Google lets this leave their datacenters. Chip fabrication is a race to the bottom at this point. [1]

Google is doubling down on hosting as a source of future revenue, and they're doing that by building an ecosystem around Tensorflow.

What I think is interesting is how weak Apple looks. Amazon has the talent and money to be able to compete with Google on this playing field. Microsoft is late, but they can, too.

Where's Apple? In the corner dreaming about mythical self-driving luxury cars?

[1]: http://spectrum.ieee.org/semiconductors/design/the-death-of-...


Apple designs their own CPUs. I think they'd be able to field a massively parallel FMAC chip if they thought that was a good idea.

Where Apple really looks weak is in datacenters, networking, and cloud services.


What does the iPhone of 2021 look like?

I get the feeling from today's announcements that Google sees the 2021 version of Google Now as the selling point for their 2021 Nexus line.

I don't think Apple is preparing to compete on that.


Apple's strength is in consumer (and to a lesser extent, developer) ecosystems; the cozy comfortable bubble you get when you're surrounded by everything Apple. Getting access to your stuff across multiple devices is virtually effortless and continually seamless, with almost no configuration required.

Whether that's good or not may be arguable, but it's certainly a selling point for many and I don't see Google or any other company's offerings approaching the same experience, and I suspect that's by design; they have to be more open and support all devices but that kinda dilutes everything. Apple will only get stronger in that aspect IMO.


I would say they're already not competing on the assistant side. Siri is considerably worse than Google Now, even though it came out first


I'm not sure I agree with you, long term (about the chips). I think that the value here is in the ecosystem. If Google can compete with CUDA, they'll be doing really well.


> There's no way Google lets this leave their datacenters. Chip fabrication is a race to the bottom at this point. [1]

I’d hope someone somewhere steals the blueprints and posts all of them publicly online.

The whole point of patents was that companies would publish everything, but get 20 years of protection.

But by now, especially companies like Google don’t do so anymore – and everyone loses out.

EDIT: I’ll add the standard disclaimer: If you downvote, please comment why – so an actual discussion can appear, which usually is a lot more useful to everyone.


There's little need for anyone to steal the blueprints. It's unlikely there's anything particularly "special" there other than identifying operations in Tensorflow that take long enough and are carried out often enough and are simple enough to be worth turning into an ASIC. If there's a market for it, there will be other people designing chips for it too.


Same misuse happens with copyright. Both were invented to foster publishing and not creating life-long monopolies. The life times of copyright and patents must be way shorter as well. Everyone builds on something that came before. It's impossible to build a better bike if you have to test drive on a street with patent mines.

Re EDIT: Downvotes must be comment-mandatory or not allowed otherwise.


I don't see any mention of offering these chips for sale. You can rent them it seems via cloud offerings & that's it.


Sure, but that's the deal. I'll buy the latest nVidia 1080 card as soon as I can but renting these custom chips per minute would be a way better option for me.


GPUs also have this nice side effect of being great at playing games on. Purely as a guess I'd think that the gaming market is bigger than the AI researcher market.


In a future where AI is everywhere, Nvidia hopes it can sell GPUs by the hundreds and thousands to large data centers. You can make a lot more money a lot faster selling your hardware this way, and Nvidia is very interested in it judging from how much they talked about it at their recent conference.


I would be surprised if they weren't working on their own specialised chips then, though Google have the advantage of already having the software specs to build for.


> Purely as a guess I'd think that the gaming market is bigger than the AI researcher market.

Machine learning isn't just targeting the AI researcher market though -- it's widely used by a huge number of companies, and of course, by many of Google's most important products. I would argue that those markets combined are larger than gaming.


Yup, I assume they're gonna keep them in house as a competitive advantage for a time. I doubt they'll do it forever; the most valuable part of NVIDIA's CUDA is the ecosystem, and I think Google knows that.


I would assume that the API to use these is tensorflow.

So... just use Google's machine learning cloud thingy.

The software can build the community, where the supercharging is only available when you run it on Google cloud.

(although GPU performance isn't bad either, so you don't have to, thus community)


Quantum computers, OpenPower, RISC-V, and now this - I'm really liking Google's recent focus on designing new types of chips and bringing some real competition into the chip market.


What are they doing with RISC-V?


They dumped a bunch of money into it, so presumably they're at least interested.


it's ASIC tuned for specific calculations, I'm sure it's better power consumption than general purpose GPUs. Same as crypto mining ASIC's crush GPU's in terms of power efficiency.

There isn't much data yet but I'm also guessing they probably have access to much more RAM than NVidia cards and can process much bigger data sets


I'm surprised by the perf claims. Nvidia isn't doing kids play. The graph implied they were untouchable in terms of perf...


Ndvidia has to be general purpose. This is not and thus can be better optimized.


"General purpose" isn't that general, if you look at the actual operations they support and their threading model. It's already fairly optimized for these sorts of operations, and this amount of claimed headroom makes me suspicious.


Google has a lot of potential options that NVidia doesn't have. They can size their cache heirarchy to the task at hand. They can partition their memory space. They can drop scatter/gather. They can gang ALUs into dataflows that they know are the majority of machine learning workloads. They can partition their register file at the ISA level or maybe even drop it entirely. They can drop the parts of the IEEE754 floating point spec they don't need and they can size their numbers to the precision they need.


The fact that I can compile arbitrary programs for the GPGPU means it is general purpose. NVIDIA isn't writing softmax or backprop into silicon as a CPU instruction.

Look at how much faster ASICs for bitcoin mining are than the GPU... orders of magnitude.


"Backprop" isn't even close to something that would be a "CPU instruction", it's an entire class of algorithm. It's like saying "calculus" should be a CPU instruction. Matrix multiplication & other operations, on the other hand, do neatly decompose into such instructions, which have been implemented by NVidia et al., since that's the core set of functionality they've been pushing for like a decade now.

Additional die space on additional functionality might hurt the power envelope (which is where the focus on performance / watt rather than performance kicks in) but it doesn't make your chips slower per se.


That was my impression too. ML under the hood was a lot of linear algebra, not very different than most shaders. But maybe Google decided to hardcode a few important ML primitives because the ROI was that good in terms of grabbing customers. Also they might have very large scale applications not found elsewhere that motivates this.


Ok I was obviously oversimplifying things but my point is since we can only speculate, it's clear that when you know specific algorithms/math operations/memory layouts/applications you want to optimize for you can create dedicated chips that optimize and do that quickly. That bitcoin miners are all dedicated chips and run circles around GPUs demonstrates exactly this fact.

Furthermore the fact that ML can be error tolerant means you also get to optimize certain floating point operations for speed or energy efficiency at the cost of accuracy. NVIDIA doesn't get to do this in their linear algebra support.


Bitcoin mining is an extremely well-defined task compared to machine learning. It remains to be seen how general these TPUs are in practice - whether they will support the neural network architectures common two years from now.


tbh I felt like realizing what you meant earlier at the end of my comment. I should have ps'd it.


If they balance compute to memory better than GPUs, you could definitely see a 10x. GPUs have large off chip memory and small caches (like 256kb). Cost to going to off chip memory can be 1-2 orders of magnitude more than on chip memory. You can certainly fit 4+MB on modern processors, but they likely bought designs from a company like Samsung because designing high performance, low power memory cells is tricky. I'm surprised they were able to keep things a secret.


What graph?


In the talk there was a two bar graph

    {(others, ~bottom) (google, ~top)}
Couldn't see more, but after Nvidia claiming overwhelming power with their latest GPU architecture including in the ML domain .. I was surprised.


Bah, SGI made a Tensor Processing Unit XIO card 15 years ago.

evidence suggests they were mostly for defense customers:

http://forums.nekochan.net/viewtopic.php?t=16728751 http://manx.classiccmp.org/mirror/techpubs.sgi.com/library/m...


Damn, what a riveting read would be to find out what IT toys defense has now! Even an informed speculation would be nice to read.


The term is pretty generic, Nervana also calls their chip a "tensor processing unit".


I remember going to see the SGI Origin 2000 installation at University of Alaska, Fairbanks. Pretty neat, I seem to remember they had a Cray T3E 900 and some J vector systems there too.


3 generations ahead of moore law??? I really wonder how they are accomplishing this beyond implementing the kernels in hardware. I suspect they are using specialized memory and an extremely wide architecture.

Sounds they also used this for AlphaGo. I wonder how badly we were off on AlphaGo's power estimates. Seems everyone assumed they were using GPU's, sounds like they were not. At least partially. I would really LOVE for them to market these for general use.


It seems entirely reasonable to me, and there's good historical precedent for it. Let's look at SHA256 hashing as an example. The maximum number of hashes that the best GPU around can do is around 1 GHash/s. However, for the same cost, of around $600, specialized hardware can do around 5 THash/s. That's about five thousand times the performance/price. There's no reason that hardware that is super-specific to neural network computation can't similarly have large gains.

Link to the specialized hardware for SHA256 hashing: http://www.amazon.com/Antminer-~4-73TH-25W-Bitcoin-Miner/dp/...


Understood, but GPU's are very good at linear algebra. GPU's are HORRIBLE for crypto, just very parallel.


Yeah, that's why the improvement isn't as big as the 5,000x it was for SHA256 hashing.


these are ASICs, Application Specific Integrated Circuits, emphasis on the SPECIFIC. It's a chip built specifically for Tensor Flow. Anytime you build a chip to handle a specific application you are going to see a significant performance improvement. You can move into a new apartment using a Honda Civic, but you are going to see considerable performance improvement using a vehicle designed specifically for moving.


But isn't 3 generations ahead just 8x? Which doesn't sound at all unreasonable for a custom hardware.


The rule of thumb I was taught was that going from a DSP/GPU to a custom ASIC would give you a 10X advantage in performance/power which is pretty close to this. And look at how much bitcoin mining ASICs out compete GPUs.


This is about right! 64-bit IEEE fp -> 16-bit IEEE-style fp[0] is a 4x bit size reduction, and multiplication is O(n^2) is silicon transistor count.

[0] If google is smart, they'd ditch +/- infinity and if they were ballsy, they'd ditch zero in their FP implementation.


Generally speaking GPUs are already very good at running with float32s, usually much better than they are at using float64s in fact. The big advantages of using an ASIC are mostly on the storage side but they also allow you to get away with non IEEE floating point numbers that don't necessarily implement subnormals, NaN, etc.


I doubt FP hardware size is the limiting factor in their implementation (it's not in GPUs and definitely not in high perf CPUs). More likely they came up with higher level architectural tricks that let them specialize for machine learning (i.e. taking better advantage of locality in the application, etc).


Actually for many neural network apps 8, 4 even 1 bit is sufficient for representing weights.


From the article: "TPU is tailored to machine learning applications, allowing the chip to be more tolerant of reduced computational precision, which means it requires fewer transistors per operation."


Do you reckon that means it's using small floats?


Based on recent blog posts from some Google folks regarding quantizing neural nets I'm going to guess 8-bit fixed point.

For example -> https://petewarden.com/2016/05/03/how-to-quantize-neural-net...


Probably.

In addition, there's a lot of literature on optimizing hardware implementations of fundamental arithmetic operations like addition and multiplication. I recall seeing a paper a while ago which talked about reducing the number of gates by allowing some bounded imprecision in the results - unfortunately, I don't remember the title right now, but it sounds like that's what they may be doing.


When I first read the blog post earlier in the day, it actually said they were using 8-bits. I remember it because it seem quite small to me.


Or small ints, possibly working in log space.


The article confirms that "AlphaGo was powered by TPUs in the matches against Go world champion, Lee Sedol, enabling it to "think" much faster and look farther ahead between moves."


Now this is really interesting. I've been asking myself why this hadn't happened before. Its been all software, software, software for the last decade or so. But now I get it. We are at a point in time where it makes sense to adjust the hardware to the software. Funny how things work. It used to be the other way around.


This is known as the Wheel of Reincarnation. Functionality moves to special-purpose hardware then back to software, and the cycle repeats. (The computer term is from 1968 so this has been happening for a long time.)

http://www.catb.org/jargon/html/W/wheel-of-reincarnation.htm...


I've been thinking for a while that with the end of silicon shrinkages, we may start seeing the final cycles of the wheel, with a final stop at mostly specialized hardware for greater power efficiency.


Not only power efficiency but software efficiency. Custom chips combined with DSLs are a powerful combination. At the expense of segmentation, of course.


Wonder what it will do the the industry term Full Stack Developer. Will people who call themselves that will now need to know about chip design?

OT: Cool blog! :)


At some point in the past, Rob Pike mentioned that when we was working on Voyager (that spaceship that almost 40 years after launch, has left the solar system and continues to send back valuable science data), he had a relatively good understanding of the system from the quantum level (transistors are based on quantum theory) to the solar system. He wasn't kidding, either.


Very interesting fact. But the average programmer is not Rob Pike. How do you see this panning out for the average programmer? Will people need to learn a bit about chips to build more efficient CRUD apps?


i'm an average programmer but reading "High Performance Computing" http://shop.oreilly.com/product/9781565923126.do made a huge difference for me even when writing more efficient CRUD apps. A big issue is understanding the cost of the operations in the stack you are using and the overheads caused by your stack.


I think there will always be an API. In the case of Deep Learning you already see Caffe, TensorFlow, etc. readily available for developers.

I don't think the average developer will need to understand chip design but I do think _many_ developers will need to know how to use deep learning frameworks.


It had definitely started as an API side detail. But it might become an industry thing. It allows very easy vendor lock-in for all the *AAS products (PAAS, SAAS, etc). You could be required an add-on for your server or machine to be able to use their product.

Oh, and imagine a Facebook chip. :)


I dunno... I'm not Rob Pike, but I could basically describe computing from the quantum level to the solar system dynamics and electromagnetic influences of a robotic interplanetary craft.

Then again, I've been coding for ~25 years, am an avid amateur astronomer, and have a degree in physics, so maybe the moral here is that some things just just take more time to master, beyond those first 6 weeks in a coding camp learning how to put together your first jQuery.

:-)


Interesting, I hadn't heard of this term before but it sounds about right!

Maybe the next iteration will be assembly instructions specifically for Neural Nets built into CPUs...actually something like a convolution assembly instruction wouldn't even surprise me at this point.


See also the VAX. It had an assembly instruction to evaluate a polynomial based on a table of coefficients in memory: POLY

http://uranium.vaxpower.org/~isildur/vax/week.html


A podcast I listen to posted an interview with an expert last week saying that he perceived that much of the interest in custom hardware for machine learning tasks died when people realized how effective GPUs were at the (still-evolving-set-of) tasks.

http://www.thetalkingmachines.com/blog/2016/5/5/sparse-codin...

I wonder how general the gains from these ASIC's are and whether the performance/power efficiency wins will keep up with the pace of software/algorithm-du-jour advancements.


I listen to the Talking Machines as well. Great podcast. Another question would be are the gains worth the cost of an ML-specific ASIC. GPUs have the entire, massive gaming industry driving the cost down. I suppose that as adoption of gradient-descent-based neural networks increases, it may be worth the cost in a similar way that GPUs are worth the cost. Then again, I have never implemented SGD on a GPU so I'm not aware if there are any bottlenecks that could be solved with an ML-specific ASIC. Can anyone else shed some light on this?


> massive gaming industry driving the cost down.

Per-unit manufacturing cost scales logarithmically. Even a single batch of custom silicon on yesterday's technology is only $30K. This is one of the reasons there is so much interest in RISC-V; hardware costs are not the barrier-to-entry that they used to be.

So yeah, the gaming market pushes the per-unit price of GPUs down, but even an additional 2x reduction in rackspace and power will pay for itself at the right scale.


Somewhat off topic, but if you look at the lower-left hand corner of the heatsink in the first image, there's two red lines and some sort of image artifact.

https://2.bp.blogspot.com/-z1ynWkQlBc8/VzzPToH362I/AAAAAAAAC...

They probably didn't mean to use this version of the image for their blog - but I wonder what they were trying to indicate/measure there.


It kind of looks like the original image had a cutaway section there and they didn't want that to be seen so pasted a top surface back on.


Did the image change? I can't seem to find what you're describing.


It's still there: the red lines outline the lower left hand corner of the heatsink (the big metallic structure).


Got it. That looks like a bad digital image stitching job. Especially the misaligned fins. But the red lines are odd indeed.


For the curious, that's a plaque on the side if the rack showing the Go board at the end of AlphaGo vs Lee Sedol Game 3, at the moment Lee Sedol resigned and AlphaGo won the tournament (of five games).


By the way, more merit for Lee Sedol now that we know what he played against.


I guess this explains why Google Cloud Compute hasn't offered GPU instances.


That's what I'm thinking. I was anticipating the release of GPU instances, but now I'm thinking that they will simply leapfrog over GPU instances straight to this.


I'm guessing that the performance / watt claims are heavily predicated on relatively low throughput, kind of similar to ARM vs Intel CPUs - particularly because they're only powering it & supplying bandwidth via what looks like a 1X PCIE slot.

IOW, taking their claims at face value, a Nvidia card or Xeon Phi would be expected to smoke one of these, although you might be able to run N of these in the same power envelope.

But those bandwidth & throughput / card limitations would make certain classes of algorithms not really worthwhile to run on these.


> IOW, taking their claims at face value, a Nvidia card or Xeon Phi would be expected to smoke one of these, although you might be able to run N of these in the same power envelope.

That seems unlikely, right? GPGPU software beating an ASIC? I guess it depends on just how abstract and adaptable the TPU IC is. Sounds like they use a custom number representation -- if they can squeeze it into less than half precision (IEEE FP16) then it would be super hard for a Phi or GPU to beat it.

But ultimately it all comes down to specific applications' bottlenecks. If they don't have to go off-card ever and their workset fits into the on-die memory, then they'd have no real advantage by using a PCIe GPU with GDDRx on it.


> I'm guessing that the performance / watt claims are heavily predicated on relatively low throughput, kind of similar to ARM vs Intel CPUs - particularly because they're only powering it & supplying bandwidth via what looks like a 1X PCIE slot.

Agreed. Also tells you that they don't need to communicate with the CPU much, given that it only has a PCIE. Reminds me of Knights Ferry, in this respect.

> a Nvidia card or Xeon Phi would be expected to smoke one of these

Will be very interesting to see some head-to-head benchmarks between these guys (on tensorflow and other libraries) in the next few months. Especially as Knights Landing starts to appear, and the new Nvidia card.


I am so happy that Nvidia got a real competitor in the ML hardware market. Maybe now they will be two times more creative.


There are 8 high-speed serial links on the front side of that board and probably 8 on the back. That's quite a bit of bandwidth.


Given the insane mask costs for lower geometries, the ASIC is most likely an Xilinx EasyPath or Altera Hardcopy. Otherwise the amortization of the mask and dev costs -- even for a structured cell ASIC -- over 1K unit wouldn't make much sense versus the extra cooling/power costs for a GPU.


Don't forget shuttle runs. Adepteva used those and otherwise good engineering practice to develop two products, latest in 65nm, with no more than $2mil. This one might be simpler and cheaper given its requirements.

http://www.adapteva.com/andreas-blog/a-lean-fabless-semicond...


True. That's another possibility.

I'd imagine since they'd want to squeeze every performance per want out of the chip they'd want to go for smallest node possible. Virtex7 EasyPath is 16nm! It is pin to pin compatible with the FPGA version -- because they just change the mask layer and you get it in about 6 weeks. Hard to beat that.


I didn't know they were still doing EasyPath. And I surely didn't know they did pin-for-pin on 16nm. Holy crap! :)


Yes. It is very popular for high-end FPGA use cases. Essentially instead of relying on the built-in routing matrix which makes FPGAs what they are, they modify the metal layer and connect the chip per your design as an ASIC. You get much lower power consumption and much faster clock rates. You also get it in about 6-8 weeks and is guaranteed to match your original design in functionality. It is 100% pin compatible because it is the same base silicon and packaging.


Cool stuff. Previously, I was looking at eASIC or Triad if I needed this cuz I thought FPGA people cancelled S-ASIC's. Good to know there's a high-end one from Xilinx. Here's the others in case you didn't know about them:

http://www.easic.com/products/28-nm-easic-nextreme-3/

https://www.triadsemi.com/reconfigurable-full-custom-asic/

eASIC has a maskless capability where they straight-up print your silicon for prototyping/testing. Triad brought S-ASIC's to analog/mixed-signal. They're top players. eASIC's basic prototyping was $50k for 50 chips on older ones. Idk now. Triad I heard is $400k flat. Need a price quote to be sure. ;)


That's awesome. I had heard about them before but never used. Crazy how low that price is.


Yeah, Triad is still in startup mode and picky. eASIC has been around quite a while. They also have ezCopy or something to produce ASIC's from their S-ASIC's. A side benefit is there's lots of pre-tested IP, including Gaisler OSS CPU.

So, worth considering. I need to get numbers on Xilinx, though, in terms of pricing and royalties. Esp if they have something for 28nm, 45nm, or 65nm that will be significantly cheaper than other one.


I would suspect they're aiming for orders of magnitudes > 1k units though?


Unless they're doing 100K plus for 2-3 year life of a chip, it makes sense to stay w/ EasyPath or the Altera Hardcopy. It is much easier and more flexible for making revisions.

Consider that they will most likely make another version, with new features in 2 years time. I'm sure the users of the chip will want that, just like any other system.


Altera Hardcopy has been discontinued.

https://www.altera.com/products/general/asic.highResolutionD...

> Altera no longer offers HardCopy structured ASIC products for new design starts. Altera continues to support HardCopy for existing designs.


I wonder if we will be seeing more of this in the (near) future. I expect so, and from more people then just Google. Why? Look at the problems the fab labs have had with the latest generation of chips and as they grow smaller the problems will probably rise. We are already close to the physical limit of transistor size. So, it is fair to assume that Moore's law will (hopefully) not outlive me.

So what then? I certainly hope the tech sector will not just leave it at that. If you want to continue to improve performance (per-watt) there is only one way you can go then: improve the design at an ASIC level. ASIC design will probably stay relatively hard, although there will probably be some technological solutions to make it easier with time, but if fabrication stalls at a certain nm level, production costs will probably start to drop with time as well.

I've been thinking about this quite a bit recently because I hope to start my PhD in ~1 year, and I'm torn between HPC or Computer Architecture. This seems to be quite a pro for Comp. Arch ;).


I wonder if this architecture is the same Lanai architecture that was recently introduced by Google on LLVM. http://lists.llvm.org/pipermail/llvm-dev/2016-February/09511...


No, this is an ASIC. It's not general-purpose.


I don't know much about this sort of thing but I wonder if the ultimate performance would come with co-locating specialized compute with memory, so that the spatial layout of the computation on silicon ends up mirroring the abstract dataflow dag, with fairly low-bandwidth and energy efficient links between static register arrays that represent individual weight and grad tensors. Minimize the need for caches and power hungry high bandwidth lanes, ideally the only data moving around is your minibatch data going one way and your grads going the other way.

I wonder if they're doing that, and to what degree.


How is this different from - say - synthetic neurons that IBM is working on, or what nvidia is building?


Without sounding crass:

1. It works already (IE it's already in use)

2. It works really well (or else they wouldn't be using it so broadly)

3. Considering how long this was said to be in development, it also likely means they are working on the next big improvement before these guys have even gotten the current one working.


IBM's TrueNorth chip is taking a much more neuromorphic design approach by trying to approximate networks of biological neurons. They are investigating a new form of computer architecture away from the classic Von Neumann model.

TPUs are custom ASICs that speed up math on tensors i.e. high-dimensional matrices. Tensors feature prominently in artificial neural networks, especially the deep learning architectures. While GPUs help accelerate these operations, they are optimized first and foremost for video rendering/gaming applications -- compute-specific features are mostly tacked on. TPUs are optimized solely for doing ML-related computations.


They are already using it in production and have been for over a year? That is a huge difference compared to what people are working on and building. They beat Lee Sedol using this hardware. Over 100 teams at Google are using machine learning. This hardware already accelerates their projects.


What is the capabilities that a piece of hardware like this needs to have to be suitable for machine learning (and not just one specific machine learning problem)?


AFAIK 16-bit "half-precision" floating point.


8 bit is enough and I suspect it's what the TPU is using: https://petewarden.com/2016/05/03/how-to-quantize-neural-net...


It is interesting how malleable are neural networks.

- you can drop half the connections and it still works, in fact it works even better, during training

- you can represent the weights on as little as one bit, but still use real numbers for computing activations

- you can insert layers and extend the network

- you can "distill" a network into a smaller, almost as efficient network or an ensemble of heavy networks into a single one with higher accuracy

- you can add a fixed weights random layer and sometimes it works even better

- you can enforce sparsity of activations and then precompute a hash function to only activate those neurons that will respond to the input signal, thus making the network much faster

It seems the neural network is a malleable entity with great potential for making it faster on the algorithmic side. They got 10x speedup mainly on exploiting a few of these ideas, instead of making the hardware 10x faster. Otherwise, they wouldn't have made it the size of a HDD - because they would need much more ventilation in order to dissipate the heat. It's just a specialized hardware taking advantage of the latest algorithmic optimizations.


Sacrifice generality, accuracy and ability to randomly access a lot of memory so that you can implement fast and power-efficient matrix operations with a single, low accuracy datatype thus requiring less memory, bandwidth and transistors.


I'm thinking that this has the potential to change the context of many debates about the "technological singularity", or AI taking over the world. Because it all seems to be based on FUD.

While reading this article, one of my first reactions was "holy shit, Google might actually build a general AI with these, and they've probably already been working on it for years".

But really, nothing about these chips is unknown or scary. They use algorithms that are carefully engineered and understood. They can be scaled up horizontally to crunch numbers, and they have a very specific purpose. They improve search results and maps.

What I'm trying to say is that general artificial intelligence is such a lofty goal, that we're going to have to understand every single piece of the puzzle before we get anywhere close. Including building custom ASICs, and writing all of the software by hand. We're not going to accidentally leave any loopholes open where AI secretly becomes conscious and decided to take over the world.


Just because you understand a machine doesn't mean it can't be dangerous. I could completely understand every aspect of a nuclear bomb, and I could still make a mistake and cause quite a bit of damage with a real one. Complex systems are notorious for having all sorts of unexpected problems, and mistakes happen all the time. How much complex software is entirely bug free?

The danger of AI is more than just a random bug though. It's that an intelligent AI is inherently not good. If you give it a goal, like to make as many paperclips as possible, it will do everything in it's power to convert the world to paperclips. If you give it the goal of self preservation, it try to destroy anything that has a 0.0001% chance of hurting it, and make as many redundant copies as possible. Etc.

Very, very few goals actually result in an AI that wants to do exactly what you want it to do. And if the AI is incredibly powerful, that will be a very bad outcome for humanity.


You can condition the AI on the well being and freedom of human population. Hard to define precisely what that means, but it can be approximated with indirect measures. This is just what Asimov thought of in his novels.

Another way to protect against catastrophe would be to launch multiple AI agents that optimize for the goal of nurturing humanity. They can keep each other in check.

Also, humans will evolve as well. Genetics is advancing very fast. We will be able to design bigger/better brains for ourselves, perhaps also with the help of AI. Human learning could be assisted by AIs to achieve much higher levels than today.

We will also be able to link directly to computers and become part of their ecosystem, thus, creating an incentive for it to keep us around. Taking this path would enable uploading and immortality for humans as well.

My guess is that we will all become united with the AI. We already are united by the internet and we spend a lot of time querying the search engine (AI), learning its quirks and, by feedback, helping improving it. This trend will continue up to the point where humans and AI become one thing. Killing humans would be for the AI like cutting out a part of your brain. Maybe it will want a biological brain of its own and come over to the other side, of biological intelligence.


We don't know how to "condition" an AI to respect the well being and freedom of humanity. It's an extremely complicated problem with no simple solutions. Making an AI that wants to destroy humanity, however, is quite straightforward. Guess which one will most likely be built first?

Building multiple AIs doesn't solve anything. They can just as easily cooperate to destroy humanity as to help it.

Uploading humans won't be possible until we can already simulate intelligence in computers. We can't have uploads before AI.


This seems very similar to the "Fathom Neural Compute Stick" from Movidius:

http://www.movidius.com/solutions/machine-vision-algorithms/...

TensorFlow on a chip....


Although that stuff seems to be more about on the fly reasoning at low wattage so you can embed a drone with neural nets... This is more for servers.


Tensor Processing Unit (TPU)

Using it for over a year? Wow


There is not a single number in this article.

Now these heatsinks can be deceiving for boards that are meant to be in a server rack unit with massive fans throwing a hurricane over them, but even then that is not very much power we're looking at there.


The Cloud Machine Learning service is one that I'm highly anticipating. Setting up arbitrary cloud machines for training models is a mess right now. I think if Google sets it up correctly, it could be a game changer for ML research for the rest of us. Especially if they can undercut AWS's GPU instances on cost per unit of performance through specialized hardware. I don't think the coinciding releases/announcements of TensorFlow, Cloud ML, and now this are an accident. There is something brewing and I think it's going to be big.


I'd love to have a place to experiment cheaply but still not be required to invest 2-3K$ in it.


It doesn't just need to be a place to experiment cheaply. Many companies building software around ML techniques still rent time on EC2. Unless you are training models 24/7 and have your machines located in a very cost efficient location in terms of power/cooling, It's probably better for your training to be done in the cloud. It think very few use cases fall into the latter category.


Is that a Go board stick to the side of the rack?

Maybe they play one move every time someone gets to go there to fix something? or could it be just a way of numbering the racks or something eccentric like that?


It appears to be a commemorative plaque from when they defeated world champion Lee Sodol.


Ah. That makes sense.


Likely it was one of the racks used by AlphaGo


The pictures have swapped captions.


It is interesting that they would make this into an ASIC, provided how notoriously high the development costs for ASICs are. Are those costs coming down? If so life will get very hard for the FPGA makers of the world soon.

It would be interesting to see what the economics of this project are. I.e., what are the development costs and costs per chip. Of course it is very doubtful I will ever get to see the economics of this project, it would be interesting.


Making a custom ASIC is expensive for a startup, but not that expensive. Many companies are making their own high-end chips for things like signal processing.


It's mainly a question of volume. ASICs have a big economy of scale, so the cost-per-chip goes down considerably once you go over a certain number of chips. Plus, there's all the NRE costs of an ASIC design over an FPGA design. Google probably figured they could use enough chips to make the cost of manufacturing ASICs worthwhile.

I don't think FPGAs are going to be beat out by ASICs for low volume applications anytime soon.


But even a custom run of silicon (on yesterday's technology) will only set you back $30K. That's one of the reasons there is so much interest in RISC-V.


I think the initial investment was cheaper than the projected GPU energy cost and if it is actually faster, then there are advantages for faster iteration.


I'd be interested to know more technical details. I wonder if they're using 8-bit multipliers, how many MACs running in parallel, power consumption, etc.


This is great, but can google stop putting tensor in the name of everything when nothing they do really has anything to do with tensors?


I'm unsure if Tensorflow actually uses this but I know in literature some models use tensor decompositions for learning.


My favorite part is what looks like flush-head sheet metal screws holding the heat sink on.

No wondering where you left the Torx drivers with this one.


I wouldn't be surprised if Google is looking to build (or done so already) a highly dense and parallel analog computer with limited precision ADC/DACs. I mean that's simplifying things quite a bit, but it would probably map pretty well to the Tensorflow application.


Computing with opamp primitives has somewhat fallen out of favor since you've been away..

The primary cost factor in anything computing is just power. You can always buy more of the things, but power is the ongoing cost and every watt in computing costs you extra in cooling power.


What are the advantages of analog computing for this application?


Analog circuits run at full-speed (no clock), use little power, and take up little space. An analog computer directly implements the mathematical function it represents as circuits instead of emulating it on a von Neumann-like machine. Long story short, they have issues that made people go digital. Yet, if you can use analog, you can get significant advantages. Example for math acceleration:

http://www.cisl.columbia.edu/grads/gcowan/vlsianalog.pdf

Brain is a bunch of components that are spread out 3D that operate like a mathematical function at slow speed. Mostly sounds analog. Results in us. So, a huge spread of analog components could get some results directly simulating something like that. Here's one of my favorites which is a wafer-scale, analog computer for neural networks.

www.kip.uni-heidelberg.de/Veroeffentlichungen/download.cgi/4713/ps/1856.pdf


If you have an application that can tolerate error (like classification), then analog computing can give enormous gains in terms of speed _and_ power efficiency. Essentially, the savings come from using physics to perform the math (see Kirchhoff's current law) vs. using discrete time steps vs. fully-unrolling the logic. Google may not be using analog processing for this version, but I read an analog neural network researcher's page who said he moved to Google last year. (Sorry, I can't find the page again, but I think he was from the UK.)


True. ML applications are well-suited to analog computation not only because they can tolerate errors, they also have an ability to adapt to errors, provided training algorithms are ran in hardware.


What do you think about http://optalysys.com/ or http://lighton.io?


I think Optalysys looks interesting!

For the curious, Optalysys has built a general purpose optics-based correlation/pattern matching machine. From some of their predecessor-company marketing material: The correlator performs pattern matching on large data sets such as high-resolution images, providing a measure of similarity and relative position between objects within the input scene. This allows large images [and general data converted to images] to be analysed far faster than electronic equivalents.

Going back to the topic of NN-based computing, I found this talk to be intriguing: https://www.youtube.com/watch?v=dkIuIIp6bl0. The main argument is that because Moore's law may no longer be in effect, it will become increasingly important to explore alternate computing solutions. (Google's TPU could be supporting evidence for this argument.) The speaker also co-authored a paper which I liked "General-Purpose Code Acceleration with Limited-Precision Analog Computation".


This video was very cool. Are there any IC's that can perform analog computing for neural networks on the market now? I'm picturing something like an FPGA but with a bunch of op amps that you can connect into summers or amplifiers.

If not, how would one practically implement an analog computer for neural network programming (without several tables full of op-amps?)


Glad you liked the video!

You can implement an analog neural network yourself using a Field Programmable Analog Array. (I've never done it, but you'll see academics online writing papers about it.)

Another thing that is sort of related is Lyric Semiconductor; they built these cool application-specific probabilistic processors; they were purchased by Analog Devices a while back.


Thank was a fantastic talk, thanks for the link.


This was a fantastic talk, thanks for the link.


I'm curious to know; is this announcement something that an expert in these sorts of areas could have (or did?) predict months or years ago, given Google's recent jumps forwards in Machine learning products? Can someone with more knowledge about this comment?


Yes. Back in 1989, Intel built one of the first custom chips for ML (ETANN). Since then, there have been hundreds of such designs, and dozens of working chips. So, people have been building them long before "deep learning" arrived in all its glory. Now that it did, it was expected that efforts in custom hardware should intensify.

However, such initiatives still face the same problems as 30 years ago: custom hardware is expensive, inflexible, hard to program, and quickly becomes obsolete. It's still worth it if there's no other way to speed things up, but Moore's law is still alive and kicking, as evidenced by 15B transistor GP100 chip form Nvidia, so we can still just wait a little bit for the next gen GPUs.

Google is certainly in a good position, having developed a very popular ML framework, and having enough resources to develop good hardware (the blog post was written by Norm Jouppi - one of the best computer architects in history). It remains to be seen, however, how well these TPUs are supported in TensorFlow. What kind of models will get the advertised speed up?


Sure. I don't have anything to link on the spot but this was/is/has been foreseeable for some time. Although it's all very cool and shiny - most business applications of machine learning remain squarely in the territory of classic algos like GLM & forests (random, boosted trees etc. etc.). As a fun note, advances like these highlight that data scientists etc. will not be beaten by more complex automated methods, but simply by speed. Much like the filing system that 'runs' whatever you're using to see these words (https://www.youtube.com/watch?v=EKWGGDXe5MA).

Edit: to elaborate... single model training runs are possible to do quite fast now, but knowing how to tune hyper parameters remains the 'voodoo' of the field. But the best hyper params are also possible to discover through brute force: try every combination you can! Today, you can use various heuristics to improve this process, but either way, being able to train whatever X times faster just means we can search hyper parameter space that much faster. The robots are coming :)


They could run 10x more experiments for the same cost and experiment with many more configurations, but soon enough there will probably be an algorithm that can do the same on an single computer. I am waiting for the moment neural networks will become as good as people in designing neural networks.


Pretty quick implementation.

On the energy savings and space savings front, this type of implementation coupled with the space-saving, energy-saving claims of going to unums vs. float should get it to the next order of magnitude. Come on, Google, make unums happen!


> Our goal is to lead the industry on machine learning and make that innovation available to our customers.

Are they saying Google Cloud customers will get access to TPUs eventually? Or that general users will see service improvements?


Is there anyway to detect what hardware to being used by the cloud service if you're using the cloud service? (yes, realize this question is a bit of a paradox, but figured I'd ask.)


Another point is that they will be able to provide much higher computing capabilities at a much lower price point that any competitors. I really like the direction that the company is taking.


I wonder if opening this up as a cloud offering is a way to get a whole bunch of excess capacity (if it needs it for something big?) but have it paid for.


hasn't made a dent on Nvidia's share price yet


One question: what has this got to do with tensors?


That they're used a lot in machine learning? If you're processing video for instance, you might have a 5-dimensional tensor: x, y, color channel, time index and batch index.


The (only?) calculations that this specialized device can do well are tensor calculations.


I think the confluence of new technologies, and the re-emergence / rediscovery of older technologies is going to be the best combination. Whether it goes that way is not certain, since the best technology doesn't always win out. Here, though, the money should, since all would greatly reduce time and energy in mining and validating:

* Vector processing computers - not von Neumann machines [1].

* Array languages new, or like J, K, or Q in the APL family [2,3]

* The replacement of floating point units with unum processors [4]

Neural networks are inherently arrays or matrices, and would do better on a designed vector array machine, not a re-purposed GPU, or even a TPU in the article in a standard von Neumann machine. Maybe non-von Neumann architectire like the old Lisp Machines, but for arrays, not lists (and no, this is not a modern GPU. The data has to stay on the processor, not offloaded to external memory).

I started with neural networks in late 80s early 1990s, and I was mainly programming in C. matrices and FOR loops. I found J, the array language many years later, unfortunately. Businesses have been making enough money off of the advantage of the array processing language A+, then K, that the per-seat cost of KDB+/Q (database/language) is easily justifiable. Other software like RiakTS are looking to get in the game using Spark/shark and other pieces of kit, but a K4 query is 230 times faster than Spark/shark, and uses 0.2GB of memory vs. 50GB. The similar technologies just don't fit the problem space as good as a vector language. I am partial to J being a more mathematically pure array language in that it is based on arrays. K4 (soon to be K5/K6) is list-based at the lower level, and is honed for tick-data or time series data. J is a bit more general purpose or academic in my opinion.

Unums are theoretically more energy efficient and compact than floating point, and take away the error-guessing game. They are being tested with several different language implementations to validate their creator's claims, and practicality. The Mathematica notebook that John Gustafson modeled his work on is available free to download from the book publisher's site. People have already done some type of explorator investigations in Python, Julia and even J already. I believe the J one is a 4-bit implementation of enums based on unums 1.0. John Gustafson just presented unums 2.0 in February 2016.

[1] http://conceptualorigami.blogspot.co.id/2010/12/vector-proce...

[2] jsoftware.com

[3] http://kxcommunity.com/an-introduction-to-neural-networks-wi...

[4] https://www.crcpress.com/The-End-of-Error-Unum-Computing/Gus...


At this point in time, I think that the Python/Numpy stack offers the best performance, productivity, and expressiveness trade-off. With the [Numba](http://numba.pydata.org) just-in-time compiler, you can now easily bounce between numeric SIMD codes that leverage tuned BLAS/MKL, then go into more explicit loop-oriented constructs that perform equivalently to hand-coded C, while still being Python. If I were starting anew, it would be hard to justify investing in a big J/K/Q code base or team, despite the potential performance benefits.

I agree with your overall point that we're seeing a confluence of factors. The advances in compiler technology, combined with the vectorial nature of the problems that are interesting to solve in an era of big data, mean that we can achieve a great deal of productivity by using high-level vector-capable languages.


You may be right. I can't argue with Python's ubiquity; I have even steered my son in that direction, but with a hitch: I still had him learn some J.

The creator of Pandas, Wes McKinney, had a link up a few years back mentioning he was looking for people who were familiar with APL, J or K. It seems he was working on a new project/startup I think (could this have been the shuttered DataPad?). The links are dead now, but I will double check.

If the creator of Pandas is/was eyeing the older APL, and its newer brethren, I'd say it's a safe bet to keep J or K or Q on your radar because they fit. They're vector/array based; they are fast and iterative with a REPL; there is a lot of mathematical formalism in their origins and usage throughout the years, yet they are more beginner-friendly than say Haskell IMHO. I like Haskell too!


Does anyone have links to the talk or the graphs?


Looks to be approximately here https://youtu.be/862r3XS2YB0?t=7320


Does it use approximate computing technology?


I like that the images are mislabeled :)


Perf/W, the official metric of slow but efficient processors. How many times must we go down this road?

Let's see this sucker train AlexNet...


Wearing my CMU hat for a moment (but keeping in mind Google's paying me this year):

Google's always been cautious about the balance of speed and efficiency, out of concerns about programmer productivity, parallelization, and generality. See, for example, Urs's article in response to my and a few other people's crazy-academic research on using "Wimpy" nodes: http://static.googleusercontent.com/media/research.google.co... - vs http://www.cs.cmu.edu/~fawnproj/

There's a big difference between just cranking down the GHz and going for ASIC specialization. GPUs, for example, already represent a point on this spectrum -- it's true that they run at reduced GHz compared to high-end CPUs, but they're arithmetic monsters. The blog post notes, in fact, that the use of TPUs in AlphaGo let them do more searching. So why would you assume automatically that they're slow?


Because there's an absolute sippy straw of bandwidth to the thing if that's a 1x pci-e connection.

For if it were delivering performance on par with a $1000 Maxwell class GPU, why wouldn't you guys crow about it? That would be a really big deal wouldn't it? TitanX for 20W? That'd be awesome.

And having suffered through multiple pitches for us to buy various FPGA and boutique processors, I have yet to see someone who produced perf per watt numbers first, subsequently produce an impressive performance number. In fact, it took nearly yelling at one vendor for them to finally admit perf was abysmal.

Finally, training does not equal inference. Training requires strong scaling, but inference need only weak scale. So I suspect that Urs had to bite his tongue and buy a bunch of gpus for training networks.

Am I missing something?


That wasn't really my point - I'm simply noting that Urs is one of the last people I'd think to hop on the wimpy crazy train. His published articles suggest that he's got a very good grasp of the tradeoffs involved in "real" TCO -- i.e., taking a fairly global view of both the human, capital, and operating expenses involved in a technology decision such as using wimpies (no) or fabricating a custom ASIC for machine learning (yes).

That doesn't mean a TPU is faster or slower than anything in particular, it just means that quite likely that it's good for some machine learning tasks that Google cares enough about to spend the whatever dollars it cost to make the thing.

The WSJ article has a few more quotes from Norm Jouppi, btw.: http://www.wsj.com/articles/google-isnt-playing-games-with-n... (Sorry if that gets paywalled. Googling "wall street journal google tensor processing unit" got me there.)


There's a huge gap between wimpy and GPU, no? I'd guess this chip was right-sized to run inference on a 10 Gbit/s feed, optimized for perf/W.


Hmm, I'd expect that once you've got your weights loaded machine learning would need much less bandwidth/flop than graphics does. Is that incorrect?


Apples and oranges. And as it turned out, it's an 8-bit fixed point processor. Useless for training without heroic handholding, limited use for inference. Google's PR machine does it again.


Why use an ANKY in the title? Using an ANKY(Acronym no one knows yet) is bad writing, makes readers feel dumb, etc. Google JUST NOW invented that acronym, sticking it in the title like just another word we should understand is absolutely ridiculous.


I honestly didn't find it that hard to understand as they recently released a machine learning library that started with a T.


In what sense in this a great news? Yes, it's a progress, so what? After all, you - programmers - earn money for your jobs and pretty soon you might not have one. Because of these kinds of great news -- "Whayyy, this is really interesting, AI, maching learning. Aaaaa!".

"I'll get fired, won't have money for living and AI will take my place, but the world will be better! Yes! Progress!"

Who will benefit from this? Surely not you. Why are you so ecstatic then?


I thought long and hard about the future when 95% of the jobs as they exist today will be taken by robots. Initially, it would seem that people would be reduced to beggars, as they depend on BHI or other forms of welfare to subsist. But then it struck me:

You don't need jobs as long as you have land, renewable energy sources and robots (and 3d printers). You can live in a community that is self sufficient. You will be employed by your land, as it always was up until 100 years ago. We will also have robots, maybe not the latest generation, but we don't need to go back to the 19th century agriculture.

It is you who will benefit in the end, if you can use AI to improve your life. As long as AI doesn't remain locked in the hands of one entity and we all share into the benefits, it will work out ok. In the short run we need some sort of social welfare though, and to invest in renewables and self-sufficiency technologies.

How much self-sufficient a country, city, village or small farm could be? There is a lot of potential to migrate back to small community agrarian economy with robotics and 3d printing and solar panels.

<speculation>People could trade using a different currency than that used for robotic produced goods. This currency will have to enforce differentiation of economic agents (diversity) and integration (low barriers of entry). A currency that will automatically disable the accumulation of power in a few hands and work for humans. We have to build an economy that functions more like the brain. In the brain there is no master neuron. They all share in the activity. So should be an enlightened human society.</speculation>


yes, could be. but not necessarily will be and not necessarily that you and other people of the middle or low class will be given a luxury to have a robot and not to work.

maybe low and middle class will have to serve people from high class who will have robots.


Who will benefit from this? All of us.

The jobs can can be lost - those will be lost and should be lost. Prolonging the process doesn't help anyone as well. The transition can be hard and painful, though, so speeding it up is all the better.


transition to what? are you sure you and other programmers will be better off with that transition?

if jobs should be lost, I suggest you to leave you current job. and I'll want to see where you're going to get money to buy food.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: