
Left, Right, Above, and Under: Intel 3D Packaging Tech Gains Omnidirectionality - rbanffy
https://fuse.wikichip.org/news/3508/left-right-above-and-under-intel-3d-packaging-tech-gains-omnidirectionality/
======
fxtentacle
The main advantage of this is the ability to have insane amounts of cache
memory.

For a benchmark like rendering, where AMD currently leads due to having more
cores (each with its own cache), having 512mb cache might be a 2x to 10x in
performance.

Its the same for deep learning where convolutions are heavily used and cache
misses and memory bandwidth become the major bottlenecks.

Intel already has the world's best tools for analyzing and fixing pipeline
stalls. But that doesn't help if you have to wait for the relatively slow main
memory.

So this technology could massively increase the value of Intel's existing
development tools and leapfrog performance per watt for rendering, a highly
profitable market if you look at the CGI budgets of recent movies.

~~~
CyberDildonics
This is wild speculation from hope and hype.

First, all CPUs have local L1 and L2 cache in addition to some sort of shared
L3 cache. If you mean 512 megabits, that has already been done, if you mean
512 megabytes, I don't think there is any indication that expensive and power
intensive cache will suddenly made a giant leap in size. Don't forget that
chips already have around 40% of their transistors dedicated to cache. Intel's
given numbers for this look great, but they don't add up to what you are
saying.

Caches for graphics running on the CPU are going to help mostly unoptimized
software or very optimized software that works with contiguous memory chunks
that will fit in cache, thus saving memory bandwidth (which looks like the
biggest breakthrough here). Caches are only going to do so much for jumping
around in memory.

> So this technology could massively increase the value of Intel's existing
> development tools and leapfrog performance per watt for rendering, a highly
> profitable market if you look at the CGI budgets of recent movies.

I don't think this has anything to do with Intel's development tools. There is
no indication this would result in "leapfrogging performance per watt". Cache
needs significant power too. Movie CG budgets are almost all based on the
number of hours clocked in making them. The hardware, including power and
cooling is proportionately a small part of the budget, usually around 10%-15%.

~~~
fxtentacle
From what I heard, feature films regularly spend $1 mio+ on renting CPU power
alone.

And yes, while caches won't help for jumping around in memory, rendering is
heavily optimized to not jump around. It's the same for AI convolution.

While the inner loop might not fit into a 128 MB cache, it would fit into 512
MB cache. That would cause the memory bandwidth to raise from ~20 GB/s to ~500
GB/s. So if memory bandwidth is the limiting factor - and it is for many
convolution and rendering workloads - then a 10x speed-up is possible.

Oh and the development tools I'm referring to is V-Tune & Amplify, i.e. the
Intel tools to make sure your loop is using all the tricks to avoid pipeline
stalls so that RAM bandwidth becomes the limiting factor.

~~~
CyberDildonics
> From what I heard, feature films regularly spend $1 mio+ on renting CPU
> power alone.

Renting CPU power is typically an indication that rendering was severely
underestimated and a million USD would be an extreme example. That money could
buy multiple racks that would be in use for four or five years.

> rendering is heavily optimized to not jump around. It's the same for AI
> convolution.

I wish that were true, but with the exception of Disney's internal renderer
only a few small parts of most renderers end up architected for this. Don't
forget that most actual production renderers are multiple decades old.

> While the inner loop might not fit into a 128 MB cache, it would fit into
> 512 MB cache. That would cause the memory bandwidth to raise from ~20 GB/s
> to ~500 GB/s.

This doesn't really make sense. Instructions are very small and the data they
operate on are very large. Each core will be doing different things to
different data and the total L3 cache will be used by all of them. I'm not
sure how you are getting these numbers, but they don't seem grounded in
reality of real software and lots of cores. L3 cache typically has bandwidth
in the same order of magnitude as memory, while Intel is claiming to increase
that by 3x in the article.

> So if memory bandwidth is the limiting factor - and it is for many
> convolution and rendering workloads - then a 10x speed-up is possible.

This is very far from the truth, especially for rendering. Most renderers hop
around him memory to trace rays through acceleration structures, actual
geometry, geometry attributes and texture lookups. Also I'm not sure how 3x
the cache bandwidth translates to 10x program speed in any scenario.

> Oh and the development tools I'm referring to is V-Tune & Amplify, i.e. the
> Intel tools to make sure your loop is using all the tricks to avoid pipeline
> stalls so that RAM bandwidth becomes the limiting factor.

Yes, those exist, but they don't have anything to do with the article or cache
sizes.

~~~
fxtentacle
> Don't forget that most actual production renderers are multiple decades old.

I use V-Ray, and the memory bandwidth is one of the defining factors of my
rendering speed. I saw a noticeable bump when I went from DDR4-1866 to
DDR4-2666.

> L3 cache typically has bandwidth in the same order of magnitude as memory,
> while Intel is claiming to increase that by 3x in the article.

My 500 GB/s number was the estimated L2 bandwidth for a Ryzen CPU.

> Most renderers hop around him memory to trace rays through acceleration
> structures

There's an Intel library called Embree which is very cache friendly. It's used
by v-Ray, the raytracer that I use.
[https://www.embree.org/](https://www.embree.org/)

> Yes, those exist, but they don't have anything to do with the article or
> cache sizes.

When I had to implement a C++ convolution algorithm, it was pretty slow at
first. Then I used V-Tune and Amplify to re-order the loops, optimize memory
access patterns, and identify places where AVX or SSE intrinsics would be
helpful. After doing that, I could then measure with Amplify that my
convolution loop is now limited in performance by cache misses, meaning
exactly what a larger cache would improve. And if I remember correctly, I got
only 10% pipeline utilization, which would suggest that it could be 10x faster
with fast enough memory.

~~~
CyberDildonics
> There's an Intel library called Embree which is very cache friendly. It's
> used by v-Ray, the raytracer that I use.
> [https://www.embree.org/](https://www.embree.org/)

I started using it when it was released and I've looked through the source
code. It does some operations in chunks of 4, 8 or 16 to try to utilize AVX
registers, especially in the version written in ISPC. It isn't what I would
call cache friendly, it still hops around a lot in memory.

> And if I remember correctly, I got only 10% pipeline utilization, which
> would suggest that it could be 10x faster with fast enough memory.

You misunderstood. You could likely get a 10x speedup if you restructured how
you access memory so that it can prefetched. Memory with 10x better latency
does not exist. What you are interested here is prefetching, not memory
bandwidth or cache.

It's great that you are interested in this stuff, but I think you need to draw
a clearer line between your knowledge and your assumptions.

------
_1100
I'd be interested to hear some informed opinions on this tech...

Doesn't this add a ton of heat, and therefore a loss in clock speed?

Do the benefits of improved routing and transistor density outweigh this loss?

~~~
Kirby64
Silicon in general is a pretty good heat conductor, so that's not a huge
issue. Jacking up density does mean you get more heat in the same area, so you
do run into issues with that.

That said, having dies close to each other dimensionally is good, since you
run into issues with latency if stuff is too far apart. Speed of light is
actually a limiting factor.

So, although you can't have all of the silicon 'hot', you can increase the
amount of silicon you have as long as more of it is 'dark' (i.e. not active).

Seems like this is a benefit overall, although I wonder if the cost is
actually effective. Even processors today do this - if they're computing
complex instructions (e.g., AVX512), they actually downclock because too much
silicon is used and they would heat up too much otherwise.

