
Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions (2013) - gbrown_
http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.html
======
nkurz
At a glance, this seems like a clear explanation of using standard SIMD
instructions to solve the problem, but I think the landscape has changed since
this was written such that there are now better approaches.

In 2010, Intel released processors with the a dedicated instruction for
"packed carry-less multiplication": [https://software.intel.com/en-
us/articles/intel-carry-less-m...](https://software.intel.com/en-
us/articles/intel-carry-less-multiplication-instruction-and-its-usage-for-
computing-the-gcm-mode). Unfortunately, the early implementations (through
Sandy Bridge) were slow, and could be beaten by combining other SIMD
operations as shown in this paper.

With the Haswell generation released in 2013, though, PCLMULQDQ got much
faster. Instead of being able to complete one instruction every 8 cycles, it
became possible to finish one every 2 cycles (inverse throughput went from 8
to 2). This paper 2015 paper "Faster 64-bit universal hashing using carry-less
multiplications" shows the difference this makes:
[https://arxiv.org/pdf/1503.03465.pdf](https://arxiv.org/pdf/1503.03465.pdf)

If you are looking for an explanation of how the problem could be solved with
the basic building blocks of SIMD, the 2013 Plank, Greenan, Miller paper might
be a good resource. But if you are hoping to implement high performance
solution for modern processors, the 2015 Lemire and Kaser paper is probably a
better starting point.

(This is with the caveat that I don't actually understand the theory or
terminology of Galois fields, and maybe there is something about applying it
to Erasure Coding that makes the faster PCLMULQDQ approach inapplicable.)

~~~
nullc
Erasure codes normally deal with fairly small fields (e.g. F(2^8)); which
might reduce the usefulness of the carryless multiply instruction.

~~~
problems
This sounds potentially useful for things like GCM too, would it be more
helpful there?

~~~
nullc
AFAICT, the carryless multiply instruction was pretty much added for GCM's
benefit.

------
nickcw
Klaus Post did an implementation of this in Go using with the relevant bits in
SSE3 assembler:
[https://github.com/klauspost/reedsolomon](https://github.com/klauspost/reedsolomon)

He references the paper in the amd64 code blob:
[https://github.com/klauspost/reedsolomon/blob/master/galois_...](https://github.com/klauspost/reedsolomon/blob/master/galois_amd64.s)

------
ms512
Intel's ISA-L ended up implementing this method. Their implementation is
interesting because they took this further and took advantage of knowledge of
the instruction latency to pipeline multiple iterations of this method to
achieve some really amazing throughput.

For reference, see [https://01.org/intel%C2%AE-storage-acceleration-library-
open...](https://01.org/intel%C2%AE-storage-acceleration-library-open-source-
version) and the source code at
[https://github.com/01org/isa-l](https://github.com/01org/isa-l) (see the
erasure code folder for details).

In general, I've found Prof. Plank's other papers and presentations very
interesting, innovative, and accessible.

------
profquail
Looks like this technique is covered by a patent claim by a 3rd party? (See
the link on that page to download their software.)

~~~
fdej
[https://www.techdirt.com/articles/20141115/07113529155/paten...](https://www.techdirt.com/articles/20141115/07113529155/patent-
troll-kills-open-source-project-speeding-up-computation-erasure-codes.shtml)

~~~
ChuckMcM
Time to file an ex-parte review/challenge.

~~~
rch
The founder, president, and CEO of StreamScale, Michael H. Anderson, seems to
have 'over a dozen granted patents in the storage field', some listed here:

[http://www.streamscale.com/cgi-
bin/complex2/showPage.plx?pid...](http://www.streamscale.com/cgi-
bin/complex2/showPage.plx?pid=34)

------
orasis
Is there a TL;DR of how much faster Reed Solomon codes are with this?

------
nneonneo
Needs a (2013) tag.

------
iamwil
What is a Galois Field used for?

~~~
aisofteng
Reading the abstract will answer your question.

~~~
Quequau
So:

"Galois Field arithmetic forms the basis of Reed-Solomon and other erasure
coding techniques to protect storage systems from failures. Most
implementations of Galois Field arithmetic rely on multiplication tables or
discrete logarithms to perform this operation. However, the advent of 128-bit
instructions, such as Intel's Streaming SIMD Extensions, allows us to perform
Galois Field arithmetic much faster. This short paper details how to leverage
these instructions for various field sizes, and demonstrates the significant
performance improvements on commodity microprocessors. The techniques that we
describe are available as open source software."

