Hacker News new | past | comments | ask | show | jobs | submit login
Multiplying Matrices Without Multiplying (arxiv.org)
235 points by moinnadeem 19 days ago | hide | past | favorite | 122 comments

Every element of a matrix multiplication is a dot-product of two vectors. The dot-product of two vectors quantifies their similarity -- in fact, we call it "dot-product similarity" in a nearest-neighbors context: If the dot product > 0, the vectors point in similar directions; if < 0, the two vectors point in dissimilar directions; if = 0, the two vectors are orthogonal.

It's not too hard to imagine that it might be possible to learn representative K-means clusters of training vectors and then, at run-time, find similar vectors using an efficient hashing scheme and do the equivalent of a table lookup to get the approximate dot-product similarity scores -- without having to perform any multiplications.

This isn't quite correct (I'm oversimplifying and ignoring many important details), but I think it's a helpful mental image.

Very clever stuff.

The dot-product of two vectors quantifies their similarity -- in fact, we call it "dot-product similarity" in a nearest-neighbors context: If the dot product > 0, the vectors point in similar directions; if < 0, the two vectors point in dissimilar directions; if = 0, the two vectors are orthogonal.

To make it more explicit, dot product of two vectors is just cosine of the angle between them, multiplied by their lengths.

The angle and cosine start to lose their geometric intuition when we go beyond 3D.

The concept of correlation has no issue with additional components. The concept of similarity of two 17-element vectors is clear. In fact correlation intuitively scales to "infinite component vectors": the dot product becomes multiplying two functions together and then taking an integral.

The Fourier transform of a periodic signal is based in the concept of how similar the signal is to a certain basis of sine/cosine quadratures spaced along a frequency spectrum. This is like a projection of a vector into a space; only the vector has an infinite number of components since it is an interval of a smooth function.

> The angle and cosine start to lose their geometric intuition when we go beyond 3D

... they do?

"Geometry" in general loses intuition beyond 3D, but apart from that, angles between two vectors are probably the one thing that still remains intuitive in higher dimensions (since the two vectors can always be reduced to their common plane).

Even angles behave very counter-intuitively in high dimensions. E.g. in high dimensional spaces uniformaly randomly chosen vectors always have the same inner product. Why? Sum x_i y_i is a sum of iid random variables, so the variance goes to zero by the central limit theorem.

I would say that this is intuitive. For any direction you pick, there are (n-1) orthogonal directions in nD space. It's only natural that the expected inner product drops to zero.

The variance goes to 0 only if you normalize, which is to say that two random high-dimensional vectors are very likely to be close to orthogonal (under mild assumptions on their distribution).

I agree that that's one of those important but initially unintuitive facts about high dimensions. Just like almost all of the volume of a reasonably round convey body is near its surface. But it also doesn't really contradict the GP comment.

> Just like almost all of the volume of a reasonably round convey body is near its surface.

I’d say that’s pretty intuitive for anyone who can see a pattern in surface area to volume ratios.

1D ball: 2 / (2 * r) = 1/r

2D ball: (2 * pi * r) / (pi * r^2) = 2/r

3D ball: (4 * pi * r^2) / (4/3 * pi * r^3) = 3/r

nD ball: ... = n/r

Most people don't find arguing from formulas intuitive unless the formulas themselves are intuitive. If you truly believe they are, I'd be curious to know why.

There is an intuitive version of this. Volume in n dimensions is C*r^n (C is some constant) and surface is the first derivative, leading to a ratio of n/r (the C constant cancels out). Hmm... Maybe not that intuitive

But the former formula already tells you that most of the volume is near the high range of r: that which was to be shown.

The surface area to volume concept adds nothing.

Because the volume of a sphere is proportional to r cubed, you know there is much more volume between r in [0.9, 1.0] than in the same sized interval of r [0.0, 0.1].

You can find the break-even point almost in your head. At what r value is half the volume of a R = 1.0 sphere below that value? Why, that's just the cube root of 1/2 ~= 0.794. So almost half the volume is within 20% of the radius from the surface.

That's far from the claim that almost all the volume is near the surface: half is not almost all, and 20% isn't all that near. However, you can see how it gets nearer and nearer for higher dimensions.

For a ten dimensional sphere, the tenth root of 1/2 is ~ 0.933. So over half the volume of a ten dimensional sphere is within the 7% depth.

The surface area to volume ratio is just a limit of the shell volume to total volume ratio as the shell thickness goes to zero. So both should asymptotically scale with higher dimensions in the same way.

This idea generalises to the concept of https://en.wikipedia.org/wiki/Inner_product_space and a the equivalent of a change-of-basis.

Word2vec with embedding size 300 and more do refute your claim. I successfully trained word2vec model with above embedding sizes and used inner product similarity to create word clusters as it is out of the box there. Then I made a clusutering language model and got significantly lower perplexity compared to word-based language model.

Not really, for example, in physics, lines in 4D are just as meaningful as they are in 3D, more even (they are called geodesics). So are the angles between them. The real problem is that we just don't have good intuitions of higher dimensions in general.

I mean, I get that if I have, say:

  [0 1 1 1 0 1 0 0]
  [1 0 0 0 1 0 1 1]
that these are perpendicular to each other, which I will easily call ninety degrees, and that two such collinear vectors are at zero degrees.

But I somehow wouldn't go from that intuition into specific cosines. Like "Oh, look, if I divide out the lengths from the dot product, I'm getting 0.5! Why that's the cosine of 60 degrees!"

> dot product of two vectors is just cosine of the angle between them, multiplied by their lengths

How do you define the "angle" between two n-dimensional vectors? Most likely using the dot-product and the arccos. "cos(angle) times lengths" might give a good intuition for 2D or 3D space, but it doesn't help in higher-dimensional vectors.

It generalizes perfectly. The angle between two lines in any dimensions is the same concept.

Two (non-collinear) lines share a plane. The angle on that plane is just the ordinary angle, no matter how many dimensions the two lines are embedded in.

In the case they are collinear, the angle between them is zero on any plane that intersects them. So that corner case works too, regardless of numbers of dimensions.

> Two (non-collinear) lines share a plane. The angle on that plane is just the ordinary angle, no matter how many dimensions the two lines are embedded in.

Okay, but now you've got a plane in n-dimensional space. How do you define/calculate the angle between the two vectors without falling back on the dot product?

You could say: The angle between the two vectors A and B is defined as the smallest rotation around their common normal vector so that the rotated first vector points in the same direction as the second vector. But what is the normal vector? It's a third vector C which is 90° to each of A and B. Now your definition is cyclic. Okay, then: C is a non-zero vector so that dot(A,C)=0 and dot(B,C)=0. Now you're back to using dot-products for defining angles in higher-dimensional space.

The normal vector is the cross product of the two vectors.

u = [u1 u2 u3] v = [v1 v2 v3]

in dimensions i, j, k

u x v = determinate of this matrix:

  =  | i  j  k|
     |u1 u2 u3|
     |v1 v2 v3|

  = (u2v3-v2y3)i - (u1v3 - v1y3)j + (u1v2 - v2u2)k

That works for 3D space. What about n-dimensional vectors?

You just do the cross product for n dimensions which is the determinant with n dimensions. [i j k m ...], with vector u, v, w, s, ....

Its the same for all dimensions.

take the first line/vector, then just call that line/vector "first-new-dimension".

Then take the second line/vector, and decide that this vector can be written as a 2-dimensional vector made out of "first-new-dimension" and "second-new-dimension", now you just have to figure out what "second-new-dimension" is.

A simple trick would be to measure the length of the line, and then add/remove a multiple of the first dimension until the length of the line becomes as short as possible, this new line is your "second-new-dimension".

Now, even if you are working with a 10-dimensional space, you have two lines that only exist in the first two (new) dimensions, so, you can treat them as two-dimensional objects and find an angle using your old 2-dimensional methods.

>It's not too hard to imagine that it might be possible to learn representative K-means clusters of training vectors and then, at run-time, find similar vectors using an efficient hashing scheme and do the equivalent of a table lookup to get the approximate dot-product similarity scores -- without having to perform any multiplications.

Isn't that basically the same as replacing your hardware multiplication operations with a hardware lookup table?

Please see this thread started by one of the authors, which I would rather see at the top of the page, above my comment:


I spent some time thinking about this in college, the problem you run into is the sheer size of the lookup table, you will get higher complexity in the ability to search this table (or prep your input matrix to be searchable).

I'm going to try and remember the calculations I performed, but I remember just looking at permutations/combinations of 8bit integer matrices, 2x2 seems reasonable but 3x3 or 4x4 gets unreasonable from an algorithmic/memory perspective very quickly

Thanks for your succinct summary of the result.

Just the other day, I was thinking about how useful clusterings seem in summarizing the landscape of a set of points in any number of dimensions. This work seems to be built on top of the same insight as well as the innovation of using hashing to speed up the similarity checks.

One question is: how does it scale? E.g. how does the time/space complexity increase in the big-O sense with increasing dimension N?

Primary author here. Happy to answer questions!

Also, feel free to email me at the address in the paper if you're interested in talking about it in more detail. E.g., I've already heard from some hardware folks looking at expanding on this work.

So, do I understand this correctly:

0. The problem you're really looking at is quickly computing

for many different vectors a (of length D) which are conceptually random.

1. To do that, you approximate AB, where A is very long and thin (NxD), and B (DxM). (Note: this problem has complexity O(NDM) naively, or, if N=D=M, O(N^3), and there are smarter algorithms that reduce the exponent 3 down to 2.8 (Strassen) or even further down, though those are not practical from what I gather. Oh, and note that at minimum, you're looking at time (N+M)D, because you need to look at every element, unless you approximate).

2. You approximate AB by f(g(A),h(B)), such that the error (in "relative Frobenius norm", ie norm of the error divided by norm of the actual product) is "small" with "high probability".

3. You're given B and a "typical" A^~ (ie training data), and then have two steps:

3.1. Do some training mumbo jumbo (fast?), and

3.2. now finally you can quickly compute the approximate AB (or a'B) for "new" A (or a'), presumably fast?

Quite cool that there are new techniques (and apparently fast, the article claims 100x faster than exact, and 10x faster than existing approximations) for such a classic problem!

But, if I understand correctly, it could happen with a certain probability that the actual error is quite high? Maybe if your a' is untypical?

ETA: From what I understand, in many ML applications people use 32bit or even 16bit floats, so high accuracy is not as important as speed anyway?

Yes, basically correct. A couple notes/clarifications for other readers:

- The rows a of A are "random," but in the sense of being drawn from some distribution for which we have a training set--not in the sense of, e.g., "assume everything is gaussian" or some other idealized case.

- The training takes a few seconds to maybe a couple minutes, depending on your training set size. But it does have to be done ahead of time--you can't train after receiving the matrices and still get a speedup. Unless maybe the matrices are unrealistically large.

- The error in the approximation is provably not much larger than the error on the training set with high probability (under standard ML assumptions--finite variance, training and test distributions match), but there's no hard limit on it in general. If you know the range of values in your individual vectors, you could obtain a hard limit, although it would be much looser than what you see in practice.

- Correct that many ML applications tolerate large losses in precision. But caveat is that we won't get the right answer to a full 16 bits in each element (this would be ~99.999% accuracy), but instead more like 90-99.9%, depending on the speed-quality tradeoff chosen. So we can't claim it will be sufficiently accurate in all cases

I understand there are K prototypes (centroid) in each subspace. And there are C disjoint subspaces. But how are subspaces chosen? Do we naively split the vector space?

For softmax example, if the original 512-element vector is x = (x0, x1, ..., x511), and we want to compress the vector to 4-bytes (and K=16 using 4 bits per prototype). There should be 8 subspaces. Does that mean we break the 512-element vector into eight 64-element sub-vectors? Like (x0,...,x63), (x64,...,x127),(x128,...,x191), ..., (x448,...,x511) and each sub-vector e.g. (x0,...,x64) is compressed into 4 bit? It looks like a lot of information are compressed into to zeros and lost.

I wonder in the image filtering field (such as Sobel and Gaussian filter), how to choose a compression ratio in order to get a reasonable good image quality.

One of our graduate students is thrilled about this paper, although he tends to do that with any new CS advance that seems sensational and that we barely understand (we do bioinformatics). He said that it stood to reason that if it can mmult 100GB/s/core, then we could matrix multiply 12TB in a minute!

Could you translate into practitioner-level language what are the practical limitations of this method; specifically, what would the error rates induced by approximation be under some practical scenarios, when would it make sense and not make sense to use it, etc? There is a complex equation in the paper describing the theoretical error bounds, but I have no idea whether in some practical scenario multiplying some normally distributed variables, whether that would mean a 0.1%, 1%, 5%, 10% error.

Personally I think it only makes sense to use this kind of method in some real-time algorithm where speed is of the essence, the downstream results of the mmult are themselves used in some other approximation (like many ML applications), and emphatically not to make the process of drawing biological conclusions from painstakingly derived data a few minutes faster for the analyst.

I fear that you have made an impressive, but dangerous, tool to people who don't know what they're doing.

I think we only claim to be able to preprocess a matrix at "up to" 100GB/s/core. The overall matrix product will take longer and depend on the matrix shapes.

To simplify Section 1.1, we help when:

1) You need to perform a matrix product more quickly and can tolerate approximation error

2) You have a training set for the larger matrix

3) The smaller matrix is either a) fixed or b) skinny relative to how tall the larger matrix is.

Re: "an impressive, but dangerous, tool to people who don't know what they're doing."

I believe you are overestimating the usability of my code :). But more seriously, I suspect that people attempting to use our method in contexts it wasn't designed for will quickly discover that they either can't actually call the API the way they wanted to, or that the method is no faster for their purposes. We also characterize our method at least as thoroughly as any approximate matrix multiplication algorithm I'm aware of, and have a variety of (admittedly loose) theoretical guarantees. So I hope that at least those who thoroughly read the paper will have a clear idea of what it can do. Overall, I guess my current thinking is that 1) I'm not sure how to introduce a method any more responsibly, but 2) if I can be of help in ensuring that it gets used well, feel free to reach out.

Would the embeds from openai's CLIP be suitably "dense" for this to work? The usage requires calculating the cosine similarity between the embeds and there are many instances where a lookup table approach is already being applied e.g. github.com/rom1504/clip-retrieval.

Is there any significant performance gain for smaller matrices, such as 3x3 or 4x4? Video game engines do a lot of these

Almost certainly not, sadly, unless maybe there's a ton of correlation in the input and you can tolerate quite a bit of error.

An interesting work, with some to-be-addressed questions: 1.The paper only covers the GEMM part with small-scale experiments(CIFAR-10/100), not covering convolution, not covering GEMM part in more popular network such as Transformer/BERT, etc. 2. It is still an approximating method, meaning potential accuracy loss. So I think this method is less attractive to training acceleration scenario, maybe potentially as a complementing methods for inference acceleration. 3. No results evaluated in GPU with TensorCore equipment. I am a little bit curious, since modern AI accelerator(including NV GPU) all incorporate TensorCore which by-design supports GEMM acceleration, what is the add-on value brought by the approximating method mentioned in this paper.

Great observations. I see this paper as the first in a three-part series. The second part is specializing it for convolution (which has additional structure to exploit), and the third is hooking these approximate ops into deep neural nets the way people currently do with scalar quantization / pruning.

I'm not optimistic about beating tensor cores when running on GPUs, at least until/unless we get similar hardware support.*

Barring better hardware support, the killer app is probably CPU inference--once there are Conv implementations and the necessary GPU kernels to train the network.

*Aside: this support would be pretty doable since the kernels look almost identical to GEMM kernels--you just need a multiplex-add rather than a multiply-add. On an x86 machine, all it would take is a vpshufb-add and a 4-bit unpack instruction.

If it works better for inference, it could enable fast inference on devices which don't have good tensor cores/gpus

> I think this method is less attractive to training acceleration scenario

The proposed hash based encoding function is not differentiable, so it doesn’t appear this method can be used for training at all.

I’m not aware of any hash functions that are analytically differentiable, so to support efficient back-propagation I suspect that some fundamental changes to this method would be necessary.

You could still optimize the prototypes, so fine-tuning with this in place would be possible (see, e.g., [1]). But we don't yet have data on how well this would work using our exact method, how early in training you could do the op replacement, etc.

[1] http://openaccess.thecvf.com/content_ECCV_2018/html/Sanghyun...


I think this name really fits well to the concept. Replacing matrix-multiplication with some hash-lookups! (warning: an overly simplified statement)

This is a really interesting application of PQ(product quantization), which itself also requires learning (usually K-means). Paper: https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_wi...

Considering that ANN has survived through many approximations (e.g. lower precision, pruning), and that many ANN applications are anyway subjective (e.g. image processing, recommendation), I think this can be highly useful in many cases.

Certainly off topic, but gotta say I prefer “MADDLESS” over “MADDNESS” for being more… obvious. It still sounds close enough to me, but again I might be just letting my Southern-Chinese ear loose with the n/l confusions…

Method 1: computer, use MUL to multiply X and Y.

Method 2: computer, remember this number. When I give you X and Y, tell me what that number was. This requires no multiplication! Let us tell SCIENCE about our discovery!

So this misses a few aspects of why the method works:

- You can't actually get a speedup from the proposed approach. You'd need a lookup table of size b^2 to multiply two b-bit numbers, which will be much slower than just doing the multiplication.

- Most of what we do is perform fewer operations. We're actually slower per op than multiply-adds because those are supported better in hardware. We just compress the matrices by so much that it's worth it. Relatedly, for sufficiently few codebooks, we're actually sublinear in the input size--i.e., we don't even read it all.

- We have some lightweight machine learning in there to make the sublinearity and other approximations not suck. Getting the ML fast enough to beat a BLAS GEMM is far from trivial.

Or, oversimplified in a different way:

Method 1: Computer, tell me the angle between these two 3D lines!

Method 2: Computer, look at this (second) 3D line from a few different perspectives (determined by the first line) and then tell me how often it looks more vertical than horizontal. That'll give me a good enough idea of the angle between the two.

How about using a look-up table for small numbers (say up to 12x12), left shifts for powers of ten, and decomposing larger multiplications into summation of those operations?

I wonder how much alternative number representations have been studied for use in matrices. Like, storing the log of values instead so that multiplication can be done by just adding. Or something like Zech's logarithm. Or even, take the log of whole matrices (which is a thing apparently [0]), then adding them to compute their multiplication. I wonder if the logarithm of a matrix can be computed using ML.

[0]: https://en.wikipedia.org/wiki/Logarithm_of_a_matrix

There's been some work on doing adds instead of multiplies (e.g., https://arxiv.org/abs/2012.03458). And I think float8 will roughly be doing this under the hood.

Personally, I'm not sure whether this is the path to go down or not. Doing everything in log space could help from a hardware perspective, since multiply-adds are much more complex than adds. But it 1) doesn't reduce the number of operations being performed, 2) might increase the necessary bit width for a given level of fidelity (depending on your input distribution and other factors), and 3) might make accumulate operations expensive, since these aren't cleanly expressed in the log domain.

There is a logarithmic semiring:


which can be used to multiply two sparse matrices with the GraphBLAS. here is an example in Python:


storing log values helps with the multiply part of multiply and add.

But it seems like it would make the add part of multiply and add quite a bit more difficult.

Quick note: the exp does not work the way it does with scalars.


Lots of discussion here about higher dimensional matrices. The argument is that the dot product is an operation on two arguments, therefore any multidimensional matrix multiplication can be broken down into multiple two dimensional operations. The image offerred is that two vectors define a plane so any two dimensional operation is valid. But how about curved space? Suppose I have two vectorsin curved space-they look like two droopy arrows. How does one calculate this dot product? I guess the the local curvature has to be taken into account. In spherically shaped space, this is a constant, but what about irregularly bent space-like that near an amorphous blob of dark matter? Suppose you are trying to calculate electrostatic potentials or do anything with Maxwell's equations? Currently, we define the curvature of space with gravity or the deflection of light past massive objects. Is there a way to measure the curvature of space locally? Can the curvature of space alter the piezoelectric potential generated by a crystal for example and allow its measurement- I am thinking that a miniscule deformation on a large macroscopic object is multiplied many fold on a property that is distributed over the atomic level and happening in parallel on all the atoms of a crystal.

> But how about curved space? Suppose I have two vectorsin curved space-they look like two droopy arrows.

No, they don't. In math, curved spaces are modeled through so-called manifolds which locally look like ℝ^n. In particular, at any point p of the manifold there's a tangent space, i.e. a linear space (higher-dimensional plane) tangent to the manifold at p. Vectors at p are just vectors in that linear space. So they are "straight" not "droopy".

On the tangent space of each point p you can now define an inner product g(p). The resulting family of inner products g is called a (Riemannian) metric on the manifold[0] and describes how lengths (of vectors) and angles (between vectors) can be measured at each point.

> Currently, we define the curvature of space with gravity or the deflection of light past massive objects. Is there a way to measure the curvature of space locally?

Yes, there is. In fact, the curvature[1] of a (Riemannian) manifold is a purely local quantity – it's basically the second derivative of the metric g, so it describes how the notion of length changes (more precisely: how the change in length changes) as you go from a point p to neighboring points.

There are other ways to express what curvature is, e.g. by locally parallelly transporting[2] a vector along a closed curve (and making that curve smaller and smaller) which basically measures how the notion of straight lines changes locally. (Though, since a line being "straight" means "locally length minimizing" this brings us back to the notion of length and, thus, the metric.)

Alternatively, if the manifold has dimension 2, there's a particularly simple way of looking at and interpreting curvature, see [3].

In any case, curvature being used to model gravity is entirely separate from that idea.

[0]: Provided this inner product "varies smoothly" as you move from p to a neighboring point q.

[1]: https://en.wikipedia.org/wiki/Riemann_curvature_tensor

[2]: https://en.wikipedia.org/wiki/Parallel_transport

[3]: https://en.wikipedia.org/wiki/Gaussian_curvature

I'm actually not quite sure what you mean by breaking down into two-dimensional operations. We use operations on pairs of vectors, but nothing is assumed to be two-dimensional, and I don't think we suggest imagining anything as a plane? The vectors can be arbitrary and there's no assumption about curvature. Happy to clarify if I can help.

I wonder how our brain can train billion of neuron without matrix multiplication. What is the biological process that get a similar result?

Perhaps addition and multiplication are the tools we use to get a computer to appear to act like a biological system with all its chemical pathways and simultaneous stimuli.

Lets take that one step further. Who taught orbiting bodies how to do differential equations?

That’s a good one! I feel there is a difference as the cells are actively doing something, they are fighting entropy and create structure. In a similar way we are with our programs, of course what we do is a crude approximation of what neuron cells do in other ways. In other words, I wonder it there is a process that we could use to speed up neural training by looking at how the brain does it.

I feel that the presented algorithm is actually somewhat close to what our brains do in similar tasks.

As mentioned elsethread, the problem is essentially reduced to determining angles between two vectors (one of which is known ahead of time) in high dimensional space. This is done by projecting the other vector into different specially chosen subspaces and classifying it in those, then summing up the "scores" in each subspace.

Given the similar task of determining the angle between two lines in 3D space, we tend to do something very similar (I feel): We look at the pair of lines (or vectors) from a few different perspectives (="subspaces") and get a feel for how well-aligned they are in each (="score"). We can then guess pretty well how large the angle between the two is.

Of course, we frequently use other cues too (e.g. perspective and texture when talking about alignment of real-world 3D objects). And even when you are considering plain lines displayed on a screen (in which case these cues don't work), we tend to also take into account the relative movement of the objects as we (continuously) shift our perspective (e.g. drag-to-rotate a scene).

Maybe the latter part could also be a hint towards further algorithmic ideas. Maybe somehow involving (signs of) derivatives (finite differences?) or similar could be a cheap way to improve accuracy. Just spitballing here though.

The "original" and easy to grasp idea behind neural learning is reinforcement: every time a good result is obtained, the connections that contributed get strengthened (reinforced). Bad results lead to weakening. Back-prop is specific implementation, and it is usually expressed in terms of matrix operations, but you can describe it without as well at the single connection level. Implementing that isn't efficient, though, hence matrix operations.

btw neurons are not the only units of computation in the human body.

allosteric activation in enzyme binding sites act like transistors almost. inside every cell there's various computations occuring

Clever and logical. It reminds me of when John Carmack used a precomputed table of Sine values for fast lookup in Quake, rather than running the actual function on the CPU.

Long before Carmack, a precomputed table of cosine values has been used in a vacuum-tube based surface-to-air missile system I have been maintaining. The table consisted of only one element: 0.7, i.e. the cosine of any angle was considered 0.7. Somehow, it was good enough for intercepting jet fighters and saved a lot of vacuum tubes.

Wasn’t this also the basic JD of “computers” of old? Manually computing lookup tables for artillery, planet movement, etc

I was using a precomputed table of sine values in 3D graphics way before Quake ever hit the scene, and I certainly didn't invent that idea either.

Ah that's useful to know that it was conventional up until then, thanks. It was my first exposure, personally :)

Checking Google Scholar, I found this 1962 paper by King titled "Table Look-Up Procedures in Data Processing" - https://dl.acm.org/doi/abs/10.1145/800198.806120

> It is not generally known that to provide seven- decimal accuracy of the sine function, allowing third-order interpolation, only 15 entries are required [*] ...

The following commentary is interesting.

> In spite of these developments in table construction, it turned out two decades ago that the technology was such that table look-up was too slow for arithmetic. The stored-program type of machine displaced this approach. By this method the value of a function, such as sin x, is computed afresh every time. One cannot but wonder how many times such a number has been re-computed in the last decade, and inquire whether technological developments may make tables efficient again.

> Over the years the emphasis in machines has slowly changed from the central processing unit to memory. From the point of view of table look-up, the memory is the central feature of a machine. Indeed the central processing unit is unnecessary.

The [*] is Krawitz. E., "Proc. Industrial Comp. Seminar," IBM, p. 66, 1950. I cannot find this publication with only a simple search.

See also https://scholar.google.com/scholar?as_sdt=0%2C5&as_yhi=1990&... which finds titles like "A rapid lookup table method for trigonometric functions" (1982) at https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.43801210... .

> It is not generally known that to provide seven- decimal accuracy of the sine function, allowing third-order interpolation, only 15 entries are required [*] ...

That got me thinking: is that so? how do you do that?

Luckily, scipy has all sorts of interpolation tools, so one can just experiment a little.

Third-order interpolation -> that means cubic splines. How many points do you need for a cubic spline approximation of the sine function to have 7 decimal place accuracy (in other words the error should be about 0.5e-8? Well, the sine function being anti-symmetric and periodic, you need to only approximate it on the interval [0, pi/2]. It turns out you need 42 points, not including the ends of the interval (where we don't need to tabulate the values of sine; it is simply 0 at 0 and 1 at pi/2).

     import numpy as np
     from scipy.interpolate import CubicSpline
     xs = np.linspace(0,np.pi/2,44,endpoint=True)
     sines = np.sin(xs)
     spline=CubicSpline(xs, sines)
     ts = np.linspace(0,1,10000,endpoint=True)\*np.pi/2
     print("Max Error is ", np.max(np.abs(spline(ts)-np.sin(ts))))
     >> Max Error is  5.02852541828247e-08
But, we can do better. If we tabulate values of sin, then we have for free the values of the cosine at the complementary angles ( cos(x) = sin(pi/2-x). But the derivative of sine is the cosine, so we can use a cubic spline that matches both the values and the derivatives at a number of points. This is called a Cubic Hermite Spline. Turns out we need 23 points.

     from scipy.interpolate import CubicHermiteSpline
     xs = np.linspace(0,np.pi/2,25,endpoint=True)
     sines = np.sin(xs)
     cosines = sines[::-1] #note that we reuse the sines, 
                           #we don't invoke np.cos
     spline=CubicHermiteSpline(xs, sines, dydx=cosines)
     ts = np.linspace(0,1,10000,endpoint=True)*np.pi/2
     print("Max Error is ", np.max(np.abs(spline(ts)-np.sin(ts))))

     >> Max Error is  4.775709550042251e-08
Now that we got to do Hermite interpolation, why stop here? The second derivative of the sine function is the sine with the sign flipped (oh, man, how do you avoid this pun?). So, you don't need to tabulate any new values. In scipy you can use BPoly to get this higher order spline. Now, you only need to tabulate 4 values of the sine function.

     xs = np.linspace(0,np.pi/2,6,endpoint=True)
     sines = np.sin(xs)
     cosines = sines[::-1]
     spline = BPoly.from_derivatives(xs, np.vstack((sines,cosines,-sines)).T)
     ts =np.linspace(0,np.pi/2,10000) 
     print("Max Error is ", np.max(np.abs(spline(ts)- 

     >> Max Error is  2.057940906574629e-08
Well, we can go all the way now. If we only use the values of the sine function and its first 4 derivatives only at the end of the [0,pi/2] interval, we can find a 9th degree polynomial that matches them.

     spline =  BPoly.from_derivatives([0,np.pi/2], 
        [[0,1,0,-1,0], [1,0,-1,0,1]])
     PSpline = PPoly.from_bernstein_basis(spline)
     P = Polynomial(PSpline.c.T[0][::-1])
     print("Max Error is ", np.max(np.abs(P(ts)- 

     >>Max Error is  1.700470619869776e-08
And here's that 9th degree polynomial, if you are curious:

x −0.16666666666666785 x^3+8.881784197001252e-15 x^4+0.008331651428065356 x^5+5.167900831715144e-06 x^6−0.000204626778274708 x^7+3.5525642158307225e-06 x^8+1.8946044087406189e-06 x^9

How does that compare to the claim that only 15 points are needed?

I found the paper. https://archive.org/details/bitsavers_ibmproceedmputationSem... (Note the type: that paper starts on page 52; Krawitz also had another paper on page 66.)\

The "15 cards" is on page 56.

The paper uses Stirling's interpolation formula.

https://en.wikipedia.org/wiki/Brahmagupta%27s_interpolation_... notes that the Indian mathematician and astronomer Brahmagupta used a second-order version of the interpolation formula to compute a sine table in the early 7th century.

The Wikipedia page links to the Stirling interpolation function description at https://archive.org/details/introductiontonu00hild_0/page/13... .

It looks like it's also called Stirling’s Central Difference Formula.

Any chance you could work this out with scipy? My numeric abilities are pretty crappy.

In my prior comment I used equidistant points for the interpolation.

In scipy you can use smoothing splines to get something close to optimal approximating splines with non-equidistant points. If you aim for a maximum error of 5e-8 (so when you round you have 7 exact decimal places), then with cubic splines you need 33 knots (including the ends). Now, the sine values are always less than one, so the digit before the decimal place is 0. If you count that towards accuracy, then you might be looking for an error not to exceed 5e-7, in which case scipy finds that you need 17 knots including the ends, so 15 interior knots.

    from scipy.interpolate import UnivariateSpline
    xs = np.linspace(0,np.pi/2,1000,endpoint=True)
    s = UnivariateSpline(xs,np.sin(xs),k=3,s=0.75e-11)
    ts = np.linspace(0,np.pi/2,10000)
    print("Max error: " , np.max(np.abs(s(ts)- 
    print("Number knots: ", len(s.get_knots()))

    >> Max error:  4.761484727611176e-07
    >> Number knots:  17


I wish I could find a published list of that table of 15 values.

In my search, I found https://www.jstor.org/stable/2002889?Search=yes&resultItemCl... which comments that an optimum interval table of sines and cosines for 7 place values has 2700 cards instead of 9000. "Each time this table is used, more than two hours of sorting and gang punching time is saved."

The only published "optimum interval table" I could find in archive.org was https://archive.org/details/dli.ernet.212891/page/149/mode/2... where table VII is the function f(r*2) = 1/r*3.

> This is the first time that such a table has ever been printed for use with a hand calculating machine. However, the computer will find that it is easier to use than an ordinary table which requires second difference interpolation, since no interpolating coefficients are needed; and it is not necessary to pay any attention to the irregular intervals.

One of the worked out examples is:

  r^2 = 4.043172
  F = (0.04686041 - 0.043172 x 0.01420874 ) x (-0.043172) + 0.124999851 
    = 0.12300327

I am old enough to remember printed books in the library that contained pre-computed tables of logarithms, sines and other common functions.

the probably were computed by mechanical means, but earlier ones were computed by hand.

One of Gauss’s ‘hobbies’ was to find errors in such tables.

There even were books containing tables of products of integers. You can buy a reproduction of one today: https://www.amazon.com/Calculating-Products-Thousand-Applica...

Shouldn't the solution to fast matrix multiplication be logarithms, similarly to fast scalar multiplication?


Perhaps surprisingly, the matrix logarithm does not satisfy log(A) + log(B) = log(AB) in general (only when AB = BA) which is why you cannot use it to "multiply by adding".

More importantly, matrix logarithms only exist for square matrices.

That’s not that big of a problem, as you can easily reduce the problem of multiplication of rectangular matrices to multiplication of square ones.

Huh, that’s kinda fascinating. Maybe it’d be worth running ml on matrix multiplication that’s approximated by ml?

Ironic thing is large part of how ML works is via matrix multiplication.

It important to remember dense matrix multiplication (doing it by hand) is a O(N^3) operation, this is about approximations to multiplication that beat that already harsh complexity. There is a whole field that develops approximations to matrix multiplication of large matrices, I'm assuming this article is about using ML to find good approximations.

To the replies, the very act of evaluting a prediction from a NN is matrix multiplication (linear transform of a column vector is matrix multiplication). This doesn't replace matrix multiplication wholesale, lol. This is about multiplication in a specific case.

As long as their is enough consistency between approximating a multiply X*W and Z*Wt (Wt = W transpose), then it is possible it could be used in NN training.

Y = X*W is the forward propagation. If Z is an error or derivative of error, Z*Wt is the back propagation.

Its an interesting question as to how well that would work. Anything that speeds up matrix multiply in NN and deep learning in general would be a big deal.

But ML relies on GPUs with lots of parallelism, so e.g. O(N^3) becomes O(N^2) for N smaller than the number of cores (roughly speaking).

Lots of things are like this. I program my network using a network. I run a server (BMC) on my server. I start my engine with an engine (https://en.m.wikipedia.org/wiki/Lockheed_SR-71_Blackbird).

That's not ironic, it could be revolutionary for this algorithm and ML in general.

as a total outside to this sort of stuff, doesn't this have the big issue of error propagation?

Probably, but hill climbing will avoid loss in the long run. The speed boost is probably more than worth it.

I wouldn't be surprised if we started using lossier math for faster state space search. Once you find a peak, you could swap out the maths to be more exact.

Suppose you have a set membership question. If you approximate yes, you do a more complex inquiry to grab details. If you approximate no, you tell the user no, don't know that face [answer, command, etc]. So this is ripe for approximate methods as you can tune to get very few false negatives (respond no when it's really yes), but allow a healthy dose of false positives.

This reminds me of the mechanism of a bloom filter (https://en.wikipedia.org/wiki/Bloom_filter).

Can someone help me understand how approximate it is? what are te error bounds?

We have a generalization guarantee in Section 4.5. It's not especially tight though; in practice, the errors from different codebooks tend to be mostly independent, and you get nice Gaussian-like concentration. I would look at the empirical results in Section 5 to get a better feel for how it performs in practice.

Had this same exact thought as an undergrad like 3 years ago! I kinda gave up due to the massive barrier and difficult financial burdens faced by phd students. This feels nice to know i wasn't crazy.

I don't disagree with you on the potential financial burdens faced by PhD students but, in this case, and if I haven't missed anything, the infrastructural barrier is not high:

> All experi-ments use a single thread on a Macbook Pro with a 2.6GHz Intel Core i7-4960HQ processor. Unless stated otherwise, all timing results use five trials, with each trial reporting the fastest among 20 executions.

It may also be worth nothing that, even as a hobby, one can actually do a lot with modern hardware if you scope your project well. There is also a lot of free resources such as CoLab available to those who have very limited computing power at home/work.

Last but not least, there is also nothing stopping you from announcing your results on arXiv and, if you can be bothered (as a hobbyist), get it published in a peer-reviewed journal.

So if you still have ideas, I encourage you to go ahead and try them! :)

I meant learning barrier. If i had the money i likely would go for it though!

I really don't know how to respond other than to say:

"I would have done this years ago if I knew how to do it" is fantastically narcissistic.

I see where you are coming from, but I personally don't find the comment narcissistic: I read it the OP of the comment thinking out load, saying "ah! It's nice to know that I have thought about something similar, and someone made the effort to show that it works!".

Another form of: "I could totally have built Google or Amazon if I wanted to, I'm just a lazy genius."

I think it's pretty inappropriate to gloss financial hardship as laziness.

What? No im saying it's one of the ideas i worked on before i switched away from the field...?

Do you mean the exact idea behind MADDNESS or just "ML for multiplying matrices"? The latter doesn't mean much without the former :-)

Not sure how they did it, but basically i thought about reducing things that gained complexity through layers with neural nets. Computer architecture and the layers of complexity between groups of system instructions and function code for example. There have got to be inefficiencies that can be collapsed with the help of ml.

In my opinion, that is the thing with theoretical computer science. Many of the ideas are comprehendible. And if one has the opportunity to do the work: you may get the result.

The biggest blocker for me is usually working out how I can implement a given idea without either writing a bunch of code (I'll get bored) or trying to verify if the paper even works for my use case without doing the aforementioned. One field I pay attention to with this problem seems to be abstract interpretation, with my background at least, the methods are very theoretically clean and impressive but actually implementing these as code and knowing how to write good implementations seems to be quite obtuse.

I genuinely don't understand why we allow papers to be published on computer science, with graphs plotted of the supposed efficacy of research (i.e. not just a theoretical paper), with no code attached.

On page six of the paper:

To assess MADDNESS’s effectiveness, we implemented both it and existing algorithms in C++ and Python. All of our code and raw numerical results are publicly available at https://smarturl.it/Maddness. All experiments use a single thread on a Macbook Pro with a 2.6GHz Intel Core i7-4960HQ processor.

edit: the url above redirects to: https://github.com/dblalock/bolt

Computer Science doesn't mean empirical stuff only, nor does it mean evaluations with code.

Theory has its place, and so does systems work. No need to denigrate one or the other.

I was trying to be clear that I was referring to papers which are talking about code they wrote. If you wrote (say) a program to predict the throughput of machine code, then I want to be able to reproduce the results you claim - thats a real example, no hint of any source yet and I've been looking.

If we can't reproduce it isn't really science. I know academics often write bad code and don't like to publish their dirty work, but the buck has to stop somewhere.

Did you ask them for their code?

Well, taken to the extreme, and from what I can tell, that's the truth for any field.

If you roll everything, financial incentives, emotional readiness, support and preparation, etc, into 'opportunity' then yeah, I believe it.

tbh I don't think most ideas in TCS are comprehensible to outsiders without spending at least a year or two building background with the state of the art. Even if you can design an algorithm, to get a good TCS paper you have to prove bounds on how well it approximates and/or on its running time.

How does it work?

Ha ha! What the hell! This is revolutionary if true!

If machine learning depends on matrix multiplications, but you fib them using machine learning ... do you see the problem?

If a C++ compiler depends on a C++ compiler to be compiled itself... do you see the problem?

I have only superficially looked at the paper, but it's pretty clear they are using an offline lookup table. There are no such problems involved.

Yes, I do; your bootstrapping is dependent on a binary C++ compiler, which could be hiding something that isn't in the source code, but which propagates to newly bootstrapped compiler binaries which again pass it on to the next round of bootstrapping, ad infinitum.

Basically, we can't be sure that you really have a C++ compiler. The source code can be verified to be a C++ compiler, but the binaries deviate in some way, possibly malicious.

Sure, but you seem to have gone off on a wild tangent. How does that relate to matrix multiplication?

I believe he’s getting at the idea of nested levels of corruption, with no visibility into where that corruption happens (because recompilation of the compiler/ML produces a fresh inscrutable binary).

I suppose the better term in this context is error propogation

It relates in the following way. Suppose you have some framework for calculation which depends on matrix multiplications. You use that framework to develop some approximation for matrix multiplications which are faster. Then to make that framework faster, you substitute that back into the framework. That is a sort of bootstrapping which user qsort has related to compiler bootstrapping.

The goal in compiler boostrapping is to reach a fixed point, in a single iteration. When the compiler binary compiles itself, what should come out is a faithful replica of that compiler binary. Having done compiler bootstrapping, I know about hair-pulling problems that occur when you don't have a fixed point. E.g. a bug in the optimization causes the compiler to subtly miscompile itself. The resulting compiler then severely miscompiles itself; e.g. fails to recompile itself entirely, or produces something that no longer runs at all. (And that's actually a lucky case; a compiler compiling itself is basically a dog food test. When that fails, you've caught a bug early. What keeps you awake at night is the compiler compiling itself perfectly fine, but rarely miscompiling other code in the wild.)

Anyway, this sort of malfunction is exactly what will happen if you take a working, self-hosted compiler and then replace some important chunk of it with some hokey machine-learning approximation. It's unlikely you can do that without breaking the fixed point.

I don't see how you can not have fixed point issues in the matrix scenario. Maybe it will stabilize if you iterate on it. You take the machine learning framework with approximate matrix multiplications in it and run the process for figuring out those matrix multiplications. You may get some different values in it, which you can substitute back and try again. Will that converge? If it does, how do you know the whole thing is still valid? (What definition of valid applies?)

It's purely a performance increase. Its definition of valid will need to be that it matches the output of the non-ML version.

debugging a machine learning algorithm is already hard without this approximation method. I can see adding another layer could make it extremely harder.

By "you", you mean user qsort, right?

For some reason they run all their tests single threaded. Seems like parallelism is where all computing hardware is inevitably going. I also wish they had run time comparisons to more recent matrix sketching and multiplication method such as frequent directions, newer FJLT implementations, and RIP matrices based on hashing.

We compared to several frequent directions variants, Fast Johnson–Lindenstrauss, some other hashing-based methods, and a bunch of other approximate matrix multiplication approaches. We had to omit some of them from the results section though because they were 10-100x worse than exact matrix products and they ruined the plots. More info in appendix E5.

As far as single threaded, there's a simple answer and a subtle one. The simple answer is that we consider the core subroutine a single thread would run in a multithreaded context, not how to do the multithreading. These are basically orthogonal issues since matrix multiplication is embarrassingly parallel and our method would parallelize just like any other. More details in appendix E2.

The subtler answer is that we could do even better in the multithreaded context if you could fuse the encoding step with whatever earlier code produces the larger matrix. This is a result of matrix products becoming memory-bandwidth-bound in the limit of sufficient cores, combined with our approach reducing the size of the matrix by a huge amount.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact