
CRAM: Efficient Hardware-Based Memory Compression for Bandwidth Enhancement - godelmachine
https://arxiv.org/abs/1807.07685
======
js8
I wonder how it would fare compared to "software-only" solution, that is re-
architecturing your algorithm to work with compressed data directly.

For example, consider an analytical engine going over values in a column of a
database to sum it together. To increase the bandwidth, you can consider
compressing the values (for example, with some form of Huffman encoding), and
you have two options:

1\. Decompress the values before passing them onto calculation, then do the
calculation. This is what the paper proposes (transparently in HW as the data
are being used in the cache).

2\. Do the "addition" directly on the compressed values. This requires more
logic. (But it doesn't have to be just implemented in software, it could be
hardware assisted by some FPGA or something like that.)

So I wonder, instead of compressing the cache, wouldn't it be better to
implement the 2nd solution?

The disadvantage of the 1st solution is that since it has to work in the
general case, the compression is more likely to be bad in special cases, where
the 2nd solution with a specialized compression/processing combo could be
superior.

~~~
chii
How does one work with compressed data directly? If you are dealing with the
data in blocks, then your compression is only as good as the repetition in
each block (perhaps the dictionary can be shared).

But even in this case, you don't deal with compressed data directly, but
decompress only 1 block at a time.

~~~
js8
No, the quality of the compression depends on what is the probability
distribution of your data blocks. If some blocks are much more likely to
occur, then you can compress them better (put them in a smaller number of
bits).

What I mean by "directly" is instead of having to do

y = y + c^-1 (x_i)

where c(x) is the compression function (and c^-1 is its inverse,
decompression), we could define a function f such that

z = f(z, x_i)

and final y = c^-1(z). This f would entail both the "+" (operation to be done
on the data) and c^-1 (the decompression).

In general, it's non-trivial to come up with f, but I do believe that it would
be in many cases a better approach than to use the first method.

Floating-point numbers can be seen as a simple case of this. If we represent
numbers in fixed decimal point, then we have a lot of 0s in the numbers (and
also lot of digits of very low significance) that we mostly don't need to have
there. Therefore, in practice, we use compression - floating-point
representation. The operations on FP are more complicated (you have to align
the mantissas before you do arithmetic operations), but it's a good trade-off.

~~~
davrosthedalek
In most, if not all compression schemes, your "compression function" depends
on all x_j with j<i. So it's really a new function for every i.

------
jepler
So they set aside 2 32-bit patterns; when these patterns appear at the end of
a 64-byte cache line, they indicate a compressed line; otherwise, the line is
literal.

This smells an awful lot like an impossible infinite compression claim; any
real compressor must increase the length of at least some inputs due to the
pigeonhole principle.

Their solution is to add a 16-entry table to track "lines that look like they
have the compressed-line marker, but are not actually compressed" (and they
also store such lines inverted in DRAM, but this seems inessential). They
claim 16 entries is enough based on probabilistic arguments, but they should
be thinking of storing data generated by an adversarial process; one such
process would be to choose 60 incompressible bytes followed by the 4-byte
compressed line indicator.

Including just one bit of metadata per cache line of memory is 32MiB, more
than the 0.75 bits/line (24MiB) calculated on page 5 for the scheme they would
like to replace.

Their next bit of handwavium is that they use "DES" to choose line markers in
a way dependent on a secret key which can be changed in case of a LUT
overflow. I wasn't immediately able to find a modern figure on how fast
hardware "DES" is, but a 2003 paper gives a figure of "21 to 37 cycles of
latency"
[https://www.intopix.com/Ressources/WPs_and_Sc_Pub/intoPIX%20...](https://www.intopix.com/Ressources/WPs_and_Sc_Pub/intoPIX%20-%20efficient_uses_of_fpgas_for_implementati1520415691870.pdf)
\-- will this fit into the time budget or will it slow memory accesses? (They
first say this is off the critical path but then that the marker is per-line,
I don't see how both are true)

When the limited table is exhausted, a new key is chosen and all of memory
must be updated (so you really don't want to do that) They dismiss this by
again appealing to the rarity as "1-in-2^512" but if that's true you might as
well just design the chip to blow up when the condition arises. If any timing
information about "line appears in LUT" is visible to a program, then it is
just 16 * 2^32 instead for an adversarial program (you can tweak line 1 until
it needs a LUT entry, which takes just 2^32 writes; then repeat for the next
line)

~~~
Dylan16807
> This smells an awful lot like an impossible infinite compression claim; any
> real compressor must increase the length of at least some inputs due to the
> pigeonhole principle.

Why even include this sentence when the rest of the post is about the extra
state it uses to track collisions?

> Including just one bit of metadata per cache line of memory is 32MiB, more
> than the 0.75 bits/line (24MiB) calculated on page 5 for the scheme they
> would like to replace.

Sure? The point of this scheme is to not store _any_ metadata per line of main
memory.

> Their next bit of handwavium is that they use "DES" to choose line markers
> in a way dependent on a secret key which can be changed in case of a LUT
> overflow. I wasn't immediately able to find a modern figure on how fast
> hardware "DES" is, but a 2003 paper gives a figure of "21 to 37 cycles of
> latency"
> [https://www.intopix.com/Ressources/WPs_and_Sc_Pub/intoPIX%20...](https://www.intopix.com/Ressources/WPs_and_Sc_Pub/intoPIX%20..).
> -- will this fit into the time budget or will it slow memory accesses? (They
> first say this is off the critical path but then that the marker is per-
> line, I don't see how both are true)

21 to 37 cycles is definitely lower latency than accessing memory. And you can
easily pipeline it to match the rate of memory accesses.

> When the limited table is exhausted, a new key is chosen and all of memory
> must be updated (so you really don't want to do that) They dismiss this by
> again appealing to the rarity as "1-in-2^512" but if that's true you might
> as well just design the chip to blow up when the condition arises. If any
> timing information about "line appears in LUT" is visible to a program, then
> it is just 16 * 2^32 instead for an adversarial program (you can tweak line
> 1 until it needs a LUT entry, which takes just 2^32 writes; then repeat for
> the next line)

So an adversarial program can waste CPU? It can already do that.

But more importantly, the LUT only needs to be involved when checking the
marker on a memory access. It should be very possible to make it fixed-timing.

------
mehrdadn
Sadly you have to remember that transparent compression is both defeated by
transparent encryption (as more and more data is encrypted) and is also
susceptible to timing attacks when that is not the case. I really wish this
weren't the case, since I really think compression is awesome, but this is
something you have to keep in mind whenever you introduce compression.

~~~
xemdetia
There are a couple of things that feel concerning about this from a security
perspective, as it's optimizations like this that could create another spectre
class vulnerability. Mixing cache lines arbitrarily seems bad when you think
of virtualization and the dream of memory isolation, you would need a lot more
holding data to maintain such an optimization. It is also hard not to be a
crypto snob when the author writes 'secure hash' and indicates single DES as a
remedy.

I also don't like that there is a claimed speedup figure but it is not clear
in a cursory reading and looking at the available tables and graphs what the
baseline is? It's all done in emulation (apparently) so it's hard to know if
it is amortized because there is no actual time to access or something else,
but that could just be my unfamiliarity with that emulator.

------
miohtama
BLOSC already does this with software.

[http://blosc.org/](http://blosc.org/)

The sales pitch is "faster than memcpy." Fetch data from RAM, decompress it in
L1,2,3 caches and run operations on the data. Some very nice benchmarks
available.

~~~
mrguyorama
I find it incredible how _slow_ modern RAM and databuses are. A far cry from
the early days where you could upgrade the CPU cache by inserting some memory
chips on the other side of the board

~~~
gameswithgo
ram has gotten steadily faster since then. but cpus have gotten faster,
faster.

~~~
mhkool
RAM access time has not gotten much faster, only bandwidth increases.

------
ahartmetz
For GPUs, memory compression isn't theoretical: they use it to compress frame,
depth and stencil buffers to save bandwidth. Overall performance (i.e. frame
rate) improvement is typically 10-20% AFAIK.

~~~
sprash
Yes it also only works well with GPUs. GPU-Ram communication is all about
bandwidth.

General purpose CPUs are all about latency. Only very specialized problems
would benefit from memory compression. On the other Hand compression will
increase latency which is counterproductive to any traditional CPU tasks.

~~~
deagle50
Even though the trend is toward GPUs, we'll likely be running large amounts of
throughput workloads on CPUs for another decade. A 5% reduction in power
consumption is worth exploring.

Is it bad that I'm looking forward to the end of Moore's law (a little)?

~~~
ahartmetz
With my specialty in native code, I am already benefiting from the end of
Moore's law. Now, a factor 2 in performance actually matters. I also expect to
see more specialized hardware and software to, say, feed 8K screens, process
many small independent transactions (high core count servers), etc. There is
going to be a lot of interesting performance work :)

------
ontologiae
I wonder if this very clever approach is interesting in the energy consumption
perspective. In this work
([http://hpc.pnl.gov/modsim/2014/Presentations/Kestor.pdf](http://hpc.pnl.gov/modsim/2014/Presentations/Kestor.pdf)
) the author shows (slide 11 ) that it cost 100 times more energy to move data
from memory to register than to move data from L1 cache. So, this kind of
mechanism could hugely improve the amount of data that could be stored in
L1/L2/L3 cache, but what could be the energy cost of compression/decompression
step ?

------
bfirsh
If you’re on a phone, here’s an HTML version of the paper: [https://www.arxiv-
vanity.com/papers/1807.07685/](https://www.arxiv-
vanity.com/papers/1807.07685/)

------
Arkanosis
The “CRAM” name is already used by a NGS file format (for compressed sequences
storage): [http://www.internationalgenome.org/faq/what-are-cram-
files/](http://www.internationalgenome.org/faq/what-are-cram-files/) .

~~~
nawgszy
Is that supposed to be a dealbreaker? I imagine 4 letter acronyms, especially
ones that are also words, are in short supply these days.

------
moonbug
GPU vendors have been doing this with texture maps for years.

~~~
TomVDB
Texture map compression is lossy, and it’s a very compute intensive process
that is done at the game mastering time, and it’s a fix ratio compression.

None of this is the case here.

You’re probably thinking about delta color compression.

------
cramcram
Their assumption of 64 bytes cacheline is unrealistic. Most CPU architecture
use 64 _bit_ cacheline (1/8 of 64 bytes), for good reasons. 512 bit cacheline
is insane.

Also, good compression requires large corpus / large model - thus page-based
compression is much more useful in real life. For example Google's ChromeOS
uses zram by default since 2013 [1].

[1] [https://en.wikipedia.org/wiki/Zram](https://en.wikipedia.org/wiki/Zram)

~~~
blattimwind
> Most CPU architecture use 64 bit cacheline (1/8 of 64 bytes), for good
> reasons. 512 bit cacheline is insane.

No. 64 and 128 (to a lesser extent) bytes are pretty much the only extant
cache line sizes these days.

64 bit cache lines would be insane.

