Hacker News new | comments | show | ask | jobs | submit login
Micron Kicks Off Mass Production of 12 Gb DRAM Chips (anandtech.com)
185 points by rbanffy 9 days ago | hide | past | web | favorite | 102 comments





Google said 40% of their data server operating cost comes from paying for energy, it makes business sense to switch to this type of DRAM.

I couldn't find the actual youtube video where a google employee shows the 40% energy cost, if you do, please link it. Meanwhile here is another graph that is somewhat similar: https://perspectives.mvdirona.com/2008/11/cost-of-power-in-l...


I am not an expert on microchips but looking at Apple's A12 vs Intel chips in terms of power and energy usage. Wouldn't the industry really benefit moving away from X86 architecture to something else in terms of reducing power consumption? These savings could be passed to users of cloud providers.

No, the power efficiency of (micro)architectures is highly dependent on how well the workload fits design assumptions. CPU cores spend their time in one of several states with different power profiles. You can increase the power efficiency of one state by reducing the power efficiency of another, so assumptions about what a CPU core will spend most of its time doing matter quite a bit in terms of optimizing throughput per watt. If a CPU core spends most of its time idle, it makes sense to optimize power consumption of idle time even if it makes actual computing or waiting for memory more expensive.

Intel microarchitectures tend to be heavily optimized around a workload that assumes a CPU core is doing heavy scalar computation all the time. If your workload and code actually looks like that -- and many kinds of server workloads do -- you can't beat them for performance per watt. But this comes at the cost of things like power consumption when idle. Most personal computing, on the other hand, spends most of its time idle and may benefit significantly from architectures that optimize more for power consumption at low utilization levels. There are many ways to attack these problems at the architectural level.

This becomes complicated to optimize in intermittent compute-intensive environments, since over-optimizing for the idle time effectively increases the amount of time you spend in a less efficient compute-intensive state (see also: Apple's big/little core hybrid ARM architecture). There was a lot of empirical work done on this in the supercomputing world, which is sensitive to power costs and had built ARM clusters. The consensus seemed to be in that environment, for example, that throughput per watt under max load was the driver of power costs, which favored Intel.

Modeling this is difficult because it is a complex interaction between specific pieces of code and specific architecture implementations. Even just comparing Intel and AMD, they have different tradeoffs for real code because the architectures are optimized differently. For example, while some codes run very efficiently on AMD's Epyc, I have a couple old codes that have terrible performance on Epyc that can be traced to design decisions made by AMD. You can't design a CPU that is the best at everything or even most things.


Thank you for the explanation. First time I've heard it in such understandable concepts.

Do you think the industry will move to specialized processors dependant on workload type? For example, for serverless, cloud providers could move workloads to the most efficient microarchitecture.


Some phones now ship with 2 sets of quad cores, one high performance set and one low power set.

There's not much that intrinsically makes an aarch chip lower-power or more energy-efficient than an x86 or x64 one – energy use will be more about the design goals of the microarchitecture than anything else.

There's a paper from a few years ago the looked into this in some detail – ftp://doc.nit.ac.ir/cee/computer/alinejhad.saeedeh/old/95-96-1/MS/Advanced%20Architecture/project/paper/6-ISA%20Wars%20Understanding%20the%20Relevance%20of%20ISA%20being%20RISC%20or%20CISC.pdf – there's no immediate reason to think the conclusions should have changed.


In addition to what others have said there’s one area to consider: the “uncore” performance: memory controllers, PCIE busses, core crosslinks, cache coherency etc.

Both Intel and AMD have demonstrated they can scale these effectively. The A12X isn’t even playing in the same league when it comes to uncore functionality. I’m sure if you created a version of the A12X with the same cache, memory, PCIE, core count, and interconnections the power picture looks a lot different.

This is part of the reason why “datacenter arm” CPUs never get much of a foothold: as soon as you start ramping up your uncore transistor count, your power advantage erodes significantly, to the point where Intel and AMD begin to yield better performance-per-watt in typical server workloads due to a tightly optimized microarch and an “uncore” to feed it.


Even if one arch is intrinsically lower power than the other, I am not sure that x86 would be the one to come off worse.

For example, could one not argue that an x86 binary / code size is on average smaller and more dense than the ARM equivalent, therefore on x86 cache misses will be less common, less ram needs to be used etc etc. Perhaps a denser binary is better for power consumption?

Although this effect would be very marginal even if it is present at all.


Interesting on the code size. How does pipeline length effect this? ARM is a lot shorter so a miss doesn't matter as much. Intel once tried to push long pipelines with their netburst architecture but real-world scenarios didn't really show any improvements. Are the compilers better at branch prediction now?

In the end it all comes down to the energy cost of doing a computation. With good architectures and compilers, the code is causing as few unnecessary bit flips as possible to get to the results and, unless you have a significant difference in how much energy it takes do that, all CPUs will end up close.

But CPUs are just part of the equation - you can make significant gains in other parts of the computer and memory is a good target. When we got to the point PSUs live as independent things plugged into the same rack as your servers, it means we are really concerned about not feeding one amp more than we need to.


Mainstream CPUs are heavily optimised for latency over energy consumption (using speculative execution to effectively waste a multiple of the energy required for the computation itself). It'd be an interesting exercise to try to design an architecture optimises energy consumption over latency (perhaps rather than deep cache hierarchies, using cores that can power gate while they are waiting for memory accesses...)

That's a very good point. Even when one generates optimal code, the CPU will run all sorts of computations that do not contribute to reach the right result, but only to get there more quickly.

A lot of the low-power ARM processors do that - just avoiding speculative execution like you pointed out is a huge win if power is your main concern.

Another interesting take are barrel processors that can avoid scheduling stalled contexts, but they are only effective when we can feed them enough threads the time it spends waiting is less than a "normal" processor would. Early Xeon Phi's (and Cell's PPUs) behaved like barrel processors - each core could dispatch 4 (2 on the PPY) threads, but it did so round-robin. Scheduling only one thread per core didn't make them run any faster.


We have about 40 years of software development on x86. I'm not even sure we could reuse any of that when moving to a new architecture.

Unless your software is written in assembly you can probably reuse a lot of that when moving to a new architecture. And given that interpreted/JITed languages are pretty mainstream these days I'm sure many developers wouldn't even notice the difference if you swapped their CPUs during the night.

Does that matter for some cases.?

Linux supports arm and we have massive companies where economies of scale and a Linux/OSS plus their code could all be cross compiled, I mean if your the average biggish none tech company then yeah you might be stuck on x86 because of windows but that isn't everyone.


In any organization if you have an internal tool or software it’s not like you can press a button and make sure it is cross compiled to ARM from x86. That is a huge and expensive undertaking.

>> In any organization if you have an internal tool or software it’s not like you can press a button and make sure it is cross compiled to ARM from x86. That is a huge and expensive undertaking.

That depends on the software. Many linux distributions can be compiled for x86, ARM, PPC, RISC-V, and even some others. Cross compiling is not difficult if your software is written with even a little of that in mind. Windows itself used to be available for 3 or 4 instruction sets - and an OS requires low level hardware support. Applications should be largely target independent.


Lots of x86 code plays fast and loose with alignment which will blow up on ARM.

Modern big ARMs let you play fast and loose with alignment as well.

Hell, a CortexM3 lets you have unaligned loads and stores too.


Even if they don't you can just turn unaligned access emulation on in your OS kernel. This was a thing in QNX and is also a thing in Linux.

I think it's fair to not want to turn that on in the kernel on chips that don't support it in servers. It's really expensive on those kinds of cores to trap into the kernel (a KPTI makes it dramatically more expensive).

That being said, server cores with their long me or pipelines are the most likely to not have an issue with unaligned access, so it's probably a moot point.


That software is either poorly written, or has low level stuff that needs to be arch specific or rewritten. Most code does not have this sort of issue.

Most startup workload is internally built software that we only need the compiler and/or runtime to be optimized for the architecture.

Most enterprise tools are written in higher level languages like Java, Python, C# and many other mostly platform independent technologies.

This was becoming true before 2010, but no longer holds - server and chip efficiency has improved quite a bit (correlating with the stop in chip frequency increases), and datacenter PUEs have improved sufficiently to take more of the edge off. Power is still a big cost, but it's not 40%.

Significant portion of that electricity spend is to power cooling systems tho

No, that's PUE - a measure of the overall efficiency of computing. Today's leading datacenters have a PUE of about 1.12, meaning 12% of their total power is spent on non-compute things (losses in electrical conversion and cooling, for the most part).

A decade ago, PUE numbers were much more commonly closer to 2, (Google's were in the 1.2-1.3 range even in 2008, but things had improved quite a bit since 2005 already, and MSFT and YHOO were working on it), but modern datacenter design has improved that substantially.

see, e.g., https://www.google.com/about/datacenters/efficiency/internal...


This is a strange figure. If 12% of power is used on non-compute things, what is the other 88% of power converted into? The computed information?

In my understanding, 100% of the power consumed is converted to heat. Measuring efficiency should have a unit like Gigaflops per Megawatt or something, not a fictional percentage. That way you can also account for hardware getting more energy efficient over time, which is now just barely represented in the numbers. It's hard to get a right unit though, because for instance archive.org has a big data center but focuses on storage instead of processing power.


I'm not sure how much, but some of the energy is transformed into the electromotive force that literally pushes metal molecules around in the circuitry of your pc, so it can't be 100% heat.

Almost entirely heat. the reason pue is separate from the efficiency of the computation is mostly because one is a datacenter engineering question. You can improve them independently.

Gflops per watt is also really important, but that's more in Intel/and/Nvidia's hands.

Application ops/w is also also important, and that combines the system efficiency and the code efficiency.

A key thing is that each of these evolves on different time scales. When you build a datacenter it lasts for multiple generations of processors. Hence PUE.


Then it also stands to reason that it makes business sense to switch from Intel Xeon to AMD Epyc.

Yes, it might; though, unlike memory, a processor is more than just a frequency/feature size and a TDP. A processor also has specialty instructions which workloads may have been optimized for.

I would expect that right now, Google is 1. seeing to the re-engineering of certain math libraries they use internally for Google Search et al, and 2. like AWS, intending to create separate cloud-VM instance classes for Intel vs. AMD processors, where each workload type can “pay its own bills.” If AMD continues to dominate in server TDP, then I would expect this to lead to the AMD instance classes getting cheaper while the Intel classes remain the same; and so most customers switching to AMD instances, save for the customers who themselves have workloads optimized for Xeon-specific SIMD instructions.


Care to link to processor efficiency benchmarks underlining your statement?


That page is most assuredly not comparing apples to apples. The isn't noted at all, it's more of a comparison on peak power usage, which you could maybe use to plan cooling, and a comment on idle power usage.

But you would need to compare total watts for enough systems to get the throughput you need, which I suspect is not the same and is likely workload dependent.

For home use, where you are likely to only have a single system, you can directly compare idle usage, but for Google how many idle systems depends on your thrroughput needs --- and maybe Google can orchestrate shutdown / reuse of idle servers off-peak, so it might be moot.


Not bench marks, but Amazon is saying their new R5a and M5a running on Eypc will offer 10 percent savings on compute costs.

https://www.datacenterdynamics.com/news/amazon-web-services-...


Are ECC LPDDRx memory sticks available? If not, perhaps they should consider it. But Google probably couldn't switch until/unless they're produced.

Google produces their own sticks anyway.

At one point they were taking chips that had failed manufacturers QA, sticking them on DIMMs themselves in a way so that the ECC would cover those errors, and just running jobs on many machines to make up for the reduced ability of ECC to make guarantees.

And the make their own motherboards too, so just soldering down LPDDR isn't out of the question.


Wow, this sounds really interesting.

Do you have a link where I can read more about this?


It's nearly impossible to find. Google's paper put out by an intern right after that DRAM errors are way bigger than expected (yeah no shit if they're pulling this crap) dominates the results.

For a while I thought I was going crazy, but a comment from someone on this article implies he saw the same slide deck. https://arstechnica.com/information-technology/2009/10/dram-... I haven't been able to get anyone to reconfirm it from Google's end, my sense is that it's information that wasn't supposed to be released.


LPDDR doesn't come in modular form. It's always soldered onto the same PCB as the processor. The number of packages used is determined by the data bus width of the processor. If Google has access to a processor with a DRAM controller that supports LPDDR4 and ECC, they can design a suitable motherboard.

That is typical usage but it isn't the required usage. It is straight forward to generate a controller ASIC that would provide the appropriate bus signals and drive levels for these to operate on a module.

The signal integrity at 0.6V at whatever frequency these are at, across a connector must be horrible. Are people doing this?

I have seen LPDDR modules where the ASIC provides the connector signals. So they run 0.6V between the LPDDR chips and the ASIC, then the ASIC runs 1.8V on the connector.

I don't think I follow your topology. The asic is driving 1.8V to a dimm with 3 LPDDR chips on it? At ddr4 speeds?

The modules in question were for an embedded system with 3 different memory size options. The LPDDR chips and ASIC were on the memory module (4, 6, or 8 LPDDRs), and the module connected to the baseboard with a high speed mezzanine connector.

Is this something you are looking to build for your own system? I assume the ASIC was proprietary but I could reach out to the vendor and see where they had the modules made (or if they were built in house).


Right I see. So memories and controller on the same module just like normal. My original comment was concerning parallel 0.6V signalling at ddr4 speeds over a connector - which isn't happening here. Or probably anywhere.

IIRC DDR4 already has pretty much the same energy consumption as LPDDR4, LPDDR4 gains when not in active use (sleep / hibernation). LPDDR4X does improve on LPDDR4 by a bit, but I don't know how sensible ~20% less energy consumption on RAM is in a server context. Especially when you're trading density for it.

Do you not think the IO voltage reduction alone gets you that? Or are you saying its just not worth it?

It’s not unreasonable though. For every x watts of server you need y watts of cooling.

Years ago Google complained that they couldn’t fill up their data centers because they had hit the limits of electrical code. Servers plus A/C took them into no man’s land as far as electrical code was concerned. That also means wasted real estate.


> It’s not unreasonable though. For every x watts of server you need y watts of cooling.

Probably less than you think. To be precise, for every x watts of power Google needs about y = 0.12*x watts of cooling (and other overhead) on average. That's based on the PUE number at <https://www.google.com/about/datacenters/efficiency/internal....

> Years ago Google complained that they couldn’t fill up their data centers because they had hit the limits of electrical code. Servers plus A/C took them into no man’s land as far as electrical code was concerned. That also means wasted real estate.

I don't remember that, but I'm guessing it was from the days Google's servers were in datacenters built by other companies. They probably outfitted so many circuits at x volts / y amps each, and there's only so much you can do with that.

(To be precise, Google still runs some servers in non-Google datacenters, as part of running a CDN. But this isn't the bulk of Google's processing power.)


That seems unlikely. An aluminum smelter uses far more energy than a data center ever will.

> uses far more energy than a data center ever will.

https://www.bizjournals.com/seattle/blog/techflash/2015/11/p...

"Power shift: Data centers to replace aluminum industry as largest energy consumers in Washington state"

Large custom data centers draw on the order of 300MW, a non-scientific survey of aluminum smelters on the web has their draw around 400-700MW, so I'd be careful about 'ever will'.


Your first argument is about aggregate demand which is irrelevant. Home lighting uses a lot of power in aggregate but it’s very low per home.

Smelting issues a lot of power in a very small space resulting in a lot of heat, where computing it not getting anywhere close to that in terms of density, just in aggregate over giant data centers.


The fact that data centers don’t get hot enough to melt aluminum is by design.

But I’m not sure I understand your point. I look Google Maps images of the Wenatchee Alcoa works and the Google Data center in the Dalles and they seem order-of-magnitude the same size to my eyeballs.

You seem to think I don’t appreciate how much energy aluminum smelting requires - I do. But imagine the heating element of your aluminum crucible broken into a million pieces, each piece with a cooling system to keep its temperature down. That’s a data center.


The point is aggregate demand is not that difficult to deal with, just look at say New York City.

Large data centers are huge building, the density is very much limited by heat issues and while the building might get bigger the power demand per square foot does not. We can move lots of power through very small areas consider individual steam turbines are up at 600 MW.

Even if we start talking about GW for a single datacenter, the single Kashiwazaki-Kariwa plant is already 8GW.


They are in large ventilated rooms (bigger than an airplane hanger) with strict OSHA controls regarding heat exposure.

A data center is a much more domesticated space and are often near commercial and residential areas not Industrial.


This is what they look like inside: https://www.genisim.com/website/fig2cfd2001.jpg

Yeah that’s a good question. Smelters don’t generate their own power generally, right? But I believe they also use a lot of air cooling so most of that draw is going to be electrolysis.

Wikipedia has a list of aluminum smelters and I notice none of the biggest are in the US. Is that just because, or due to code limitations?


The Pacific Northwest used to have a lot more Aluminum smelting than it does now - the power was provided by large hydro power contracts.

Not sure if it was changes in these contracts or other market forces that caused the decline.


Decided to answer my own question - here's a great history of the NW aluminum industry:

https://www.nwcouncil.org/reports/columbia-river-history/alu...

The last active NW aluminum smelters closed in 2016. Most of the rest were victims of Enron and the 2000 energy crisis.


Aluminum smelting is dirty and uses a ton of power -- we generally don't subsidize power here (in stark contrast to China / Argentina / etc.) and have increasingly strict environmental controls so we're offshoring most production and pollution associated with smelting from ore. We do have a lot of 'secondary' production which is much cleaner since it uses aluminum scrap as the input.

I hope these exciting advances in science and engineering will lead to... DRAM that costs the same as several years ago.

It makes you wonder what advances in science and engineering have been stalled because of DRAM costs. Particularly in countries with less money to throw at research etc.

12GB seems an odd number, when it's usually multiples of 8. Anyone else think that? I wonder if it's because of size constraints?

It’s not 12 GB it’s 12 Gb (1.5GB) although it is still an odd size.

I would imagine the odd size could be due to yields of the new memory.

EDIT: thanks Child post for correcting my typo.


Well, since you started, it's not 1.5Gb, but 1.5GB :)

Or more accurately 1.5GiB ;)

Phones want 6 GB so they're making 1.5 GB chips. Apparently 4x 1.5 GB is cheaper than 3x 2 GB, probably because of yield. I imagine it complicates the address decode slightly but that's it.

This reminds me of triple-core processors that were "inconceivable" right until they shipped.


Were there any 3-core processors which were not de-rated 4-core processors?

Apple's A8X is one. Here's the die shot from AnandTech/Chipworks: https://images.anandtech.com/doci/8716/A8X%20draft%20floorpl...

The Xbox 360 comes to mind.

PS3 had a 9-core (3*3).

The PS3 had a PowerPC main core cpu and a 8 core cell chip. Of the 8 SPU cores, only 6 were available to developers as one was reserved for OS functions and the other, ironically enough, was held back for yield purposes.

> Phones want 6 GB

Why? Why not 8? Whatever, I doubt I've ever seen a phone with more than 2 GB RAM. I wish my phone was as powerful as my PC (which is very old, has 4 GB RAM and can't handle more because of the chipset limitations) so I could just attach peripherals I want, boot a desktop Linux (directly or in a VM) and use it instead of the PC.


> Why? Why not 8?

So they can sell you 8 next year. And 10-12 the year after that. And 16 after that. And so on and so forth...

> I doubt I've ever seen a phone with more than 2 GB RAM.

- The first phone with more than 2GB of RAM was the Samsung Note 3 back in 2013 - with 3GB.

- Almost every "flagship" phone since 2015/16 has had 4GB of RAM or more. So if you've seen people using phones on the street, subway, office, etc in the last 3 years, you've seen phones with more than 2GB of RAM.

> I wish my phone was as powerful as my PC...so I could just attach peripherals I want, boot a desktop Linux (directly or in a VM) and use it instead of the PC.

That would be awesome and is my ideal view of the mobile computing future.


> - The first phone with more than 2GB of RAM was the Samsung Note 3 back in 2013 - with 3GB.

Samsung Note 3 is exactly the model of my phone. I've always believed it has just 2 GiBs. Let me check...

Yes. free -g in Termux says 2. However...

... free -m says 2834 which means ~2.77 GiB. Well, this indeed is "more than 2GB of RAM" but not much and what a weird number...

> That would be awesome and is my ideal view of the mobile computing future.

Seems like it's the past already. There were numerous attempts (successful in a way or another) years ago and despite the fact smartphones become more and more powerful and add more RAM this approach still doesn't show signs of becoming popular and the most hyped projects get either discontinued or stagnant.


Samsung Dex[0] is still an active project and recently announced a beta version. It will only work with newer Samsung devices though, which excludes my Galaxy S4 and of course all phones by other manufacturers.

[0]https://www.samsung.com/global/galaxy/apps/samsung-dex/


The OnePlus 6T comes in 6 GB and 8 GB LPDDR4X variants

Checkout almost any new samsung phone, mine has 6gb.

While 12 is not a power of two, it is the sum of two powers of two (4 and 8).

A number of workstations that I have designed has used 2 sticks of 4GB RAM, and 2 sticks of 2GB RAM. This has saved money at little to no cost to performance.

Perhaps Micron is doing something similar here?


A datasheet for a related non-power-of-2-sized product from Micron gives a hint as to what they're doing:

https://prod.micron.com/~/media/documents/products/data-shee...

Page 28:

"For non-binary memory densities, only a half of the row address space is valid. When the MSB address bit is HIGH, then the MSB-1 address bit must be LOW."

They basically cut off the upper 1/4 of the address decoder.


Also, and speaking as a jaded developer, it’s strategically a bad idea to give your team a 2x bump in available memory especially if it’s maxing out the memory on your servers. If that’s 16GB per slot then 12GB DRAM sticks let you slowly increase memory a bit longer.

Once you hit the ceiling a whole bunch of disciplines have to change overnight. Better to give them a 25-50% bump and start the story about horizontal scalability before you’re entirely out of other options.


Realistically that extra 25% is already taken up by the OS, GUI and other resources (AV, IDE, Electron apps, etc). You're already at a disadvantage to most server platforms.

I've been running 16-32GB for a very long time now. Servers can go into the multiple TB ranges.


I always thought that 8 gb is not enough but 16g is too much for me. 12Gb sounds nice.

Probably a good idea to change the title from "Gb" to "Gbit" to avoid ambiguity.

Is this type of RAM still having rowhammer memory corruption effects?

LPDDR4 has some mitigation for rowhammer, and LPDDR4X is a variant of it. So I don't know that it's immune, but it should be more resilient.

Are those the chips in the new iPad Pro? Would make sense since according to reports the 1TB iPads have 6GB RAM.

I hate these "new parts available" announcements that are so vague they don't even have part numbers.

Fortunately a search at Micron's website reveals that this is the "Z2BM" with two part numbers, MT29VZZZ7D8DQFSL-046 (containing 2 Z2BM) and MT29VZZZBD9DQKPR-046 (containing 4 Z2BM). No public datasheets (yet), unfortunately, I was curious to read a little more.


Does anybody know how many of these could we expect to fit in a Macbook Pro? My understanding is the power consumption has been a limiting factor.

You can fit exactly zero of these in the MacBook Pro until Intel ships chips that actually support LPDDR4X (aside from the Cannonlake paper launch).

Or Apple switch them to the A series processors! Although I really cannot see that happening on the pro lineup anytime soon (Although I think its highly likely on the air and non-pro MacBooks)

I read somewhere that it wouldn't actually help at the moment. LPDDR4 on the A series chips is in the same package as the CPU. From what I understand, this is the only current way that anyone in the industry is using LPDDR4.

What Intel is promising is LPDDR4 packaged in it's own chip(s) on the motherboard.


In the iPads, the RAM is not stacked onto the CPU, it's normal separate chips. e.g. two Micron here: https://d3nevzfk7ii3be.cloudfront.net/igi/WHKDR5ZU6SSQA3qa

Had to double check that....but you're correct. So now I'm really hoping for a AX macbook pro. Imagine what the A12X in the newest ipad could push if you added some nice cooling to it ;)

As long as you aren't using any Adobe software.

So I wonder if Micron will support 16Gbit LPDDR4X chips by the time Intel launches its 10nm chips.

they have 16 Gbit LPDDR4 from what I know

I am talking about the capacity of the individual die BTW.

1. zero because Intel still has no support for LPDDR4 in their laptops

2. I don't know that it would make much difference, IIRC energy-efficient RAM has a huge impact on idle/sleeping power, but when everything's running full tilt other components way outdraw RAM's needs


Apollo lake and Gemini lake support regular LPDDR4. I bet, you can make it use LPDDRX timings.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: