It's not too hard to imagine that it might be possible to learn representative K-means clusters of training vectors and then, at run-time, find similar vectors using an efficient hashing scheme and do the equivalent of a table lookup to get the approximate dot-product similarity scores -- without having to perform any multiplications.
This isn't quite correct (I'm oversimplifying and ignoring many important details), but I think it's a helpful mental image.
Very clever stuff.
To make it more explicit, dot product of two vectors is just cosine of the angle between them, multiplied by their lengths.
The concept of correlation has no issue with additional components. The concept of similarity of two 17-element vectors is clear. In fact correlation intuitively scales to "infinite component vectors": the dot product becomes multiplying two functions together and then taking an integral.
The Fourier transform of a periodic signal is based in the concept of how similar the signal is to a certain basis of sine/cosine quadratures spaced along a frequency spectrum. This is like a projection of a vector into a space; only the vector has an infinite number of components since it is an interval of a smooth function.
... they do?
"Geometry" in general loses intuition beyond 3D, but apart from that, angles between two vectors are probably the one thing that still remains intuitive in higher dimensions (since the two vectors can always be reduced to their common plane).
I agree that that's one of those important but initially unintuitive facts about high dimensions. Just like almost all of the volume of a reasonably round convey body is near its surface. But it also doesn't really contradict the GP comment.
I’d say that’s pretty intuitive for anyone who can see a pattern in surface area to volume ratios.
1D ball: 2 / (2 * r) = 1/r
2D ball: (2 * pi * r) / (pi * r^2) = 2/r
3D ball: (4 * pi * r^2) / (4/3 * pi * r^3) = 3/r
nD ball: ... = n/r
The surface area to volume concept adds nothing.
Because the volume of a sphere is proportional to r cubed, you know there is much more volume between r in [0.9, 1.0] than in the same sized interval of r [0.0, 0.1].
You can find the break-even point almost in your head. At what r value is half the volume of a R = 1.0 sphere below that value? Why, that's just the cube root of 1/2 ~= 0.794. So almost half the volume is within 20% of the radius from the surface.
That's far from the claim that almost all the volume is near the surface: half is not almost all, and 20% isn't all that near. However, you can see how it gets nearer and nearer for higher dimensions.
For a ten dimensional sphere, the tenth root of 1/2 is ~ 0.933. So over half the volume of a ten dimensional sphere is within the 7% depth.
[0 1 1 1 0 1 0 0]
[1 0 0 0 1 0 1 1]
But I somehow wouldn't go from that intuition into specific cosines. Like "Oh, look, if I divide out the lengths from the dot product, I'm getting 0.5! Why that's the cosine of 60 degrees!"
How do you define the "angle" between two n-dimensional vectors? Most likely using the dot-product and the arccos. "cos(angle) times lengths" might give a good intuition for 2D or 3D space, but it doesn't help in higher-dimensional vectors.
Two (non-collinear) lines share a plane. The angle on that plane is just the ordinary angle, no matter how many dimensions the two lines are embedded in.
In the case they are collinear, the angle between them is zero on any plane that intersects them. So that corner case works too, regardless of numbers of dimensions.
Okay, but now you've got a plane in n-dimensional space. How do you define/calculate the angle between the two vectors without falling back on the dot product?
You could say: The angle between the two vectors A and B is defined as the smallest rotation around their common normal vector so that the rotated first vector points in the same direction as the second vector. But what is the normal vector? It's a third vector C which is 90° to each of A and B. Now your definition is cyclic. Okay, then: C is a non-zero vector so that dot(A,C)=0 and dot(B,C)=0. Now you're back to using dot-products for defining angles in higher-dimensional space.
u = [u1 u2 u3]
v = [v1 v2 v3]
in dimensions i, j, k
u x v = determinate of this matrix:
= | i j k|
|u1 u2 u3|
|v1 v2 v3|
= (u2v3-v2y3)i - (u1v3 - v1y3)j + (u1v2 - v2u2)k
Its the same for all dimensions.
Then take the second line/vector, and decide that this vector can be written as a 2-dimensional vector made out of "first-new-dimension" and "second-new-dimension", now you just have to figure out what "second-new-dimension" is.
A simple trick would be to measure the length of the line, and then add/remove a multiple of the first dimension until the length of the line becomes as short as possible, this new line is your "second-new-dimension".
Now, even if you are working with a 10-dimensional space, you have two lines that only exist in the first two (new) dimensions, so, you can treat them as two-dimensional objects and find an angle using your old 2-dimensional methods.
Isn't that basically the same as replacing your hardware multiplication operations with a hardware lookup table?
I'm going to try and remember the calculations I performed, but I remember just looking at permutations/combinations of 8bit integer matrices, 2x2 seems reasonable but 3x3 or 4x4 gets unreasonable from an algorithmic/memory perspective very quickly
Just the other day, I was thinking about how useful clusterings seem in summarizing the landscape of a set of points in any number of dimensions. This work seems to be built on top of the same insight as well as the innovation of using hashing to speed up the similarity checks.
Also, feel free to email me at the address in the paper if you're interested in talking about it in more detail. E.g., I've already heard from some hardware folks looking at expanding on this work.
0. The problem you're really looking at is quickly computing
1. To do that, you approximate AB, where A is very long and thin (NxD), and B (DxM). (Note: this problem has complexity O(NDM) naively, or, if N=D=M, O(N^3), and there are smarter algorithms that reduce the exponent 3 down to 2.8 (Strassen) or even further down, though those are not practical from what I gather. Oh, and note that at minimum, you're looking at time (N+M)D, because you need to look at every element, unless you approximate).
2. You approximate AB by f(g(A),h(B)), such that the error (in "relative Frobenius norm", ie norm of the error divided by norm of the actual product) is "small" with "high probability".
3. You're given B and a "typical" A^~ (ie training data), and then have two steps:
3.1. Do some training mumbo jumbo (fast?), and
3.2. now finally you can quickly compute the approximate AB (or a'B) for "new" A (or a'), presumably fast?
Quite cool that there are new techniques (and apparently fast, the article claims 100x faster than exact, and 10x faster than existing approximations) for such a classic problem!
But, if I understand correctly, it could happen with a certain probability that the actual error is quite high? Maybe if your a' is untypical?
ETA: From what I understand, in many ML applications people use 32bit or even 16bit floats, so high accuracy is not as important as speed anyway?
- The rows a of A are "random," but in the sense of being drawn from some distribution for which we have a training set--not in the sense of, e.g., "assume everything is gaussian" or some other idealized case.
- The training takes a few seconds to maybe a couple minutes, depending on your training set size. But it does have to be done ahead of time--you can't train after receiving the matrices and still get a speedup. Unless maybe the matrices are unrealistically large.
- The error in the approximation is provably not much larger than the error on the training set with high probability (under standard ML assumptions--finite variance, training and test distributions match), but there's no hard limit on it in general. If you know the range of values in your individual vectors, you could obtain a hard limit, although it would be much looser than what you see in practice.
- Correct that many ML applications tolerate large losses in precision. But caveat is that we won't get the right answer to a full 16 bits in each element (this would be ~99.999% accuracy), but instead more like 90-99.9%, depending on the speed-quality tradeoff chosen. So we can't claim it will be sufficiently accurate in all cases
For softmax example, if the original 512-element vector is x = (x0, x1, ..., x511), and we want to compress the vector to 4-bytes (and K=16 using 4 bits per prototype). There should be 8 subspaces. Does that mean we break the 512-element vector into eight 64-element sub-vectors? Like (x0,...,x63), (x64,...,x127),(x128,...,x191), ..., (x448,...,x511) and each sub-vector e.g. (x0,...,x64) is compressed into 4 bit? It looks like a lot of information are compressed into to zeros and lost.
I wonder in the image filtering field (such as Sobel and Gaussian filter), how to choose a compression ratio in order to get a reasonable good image quality.
Could you translate into practitioner-level language what are the practical limitations of this method; specifically, what would the error rates induced by approximation be under some practical scenarios, when would it make sense and not make sense to use it, etc? There is a complex equation in the paper describing the theoretical error bounds, but I have no idea whether in some practical scenario multiplying some normally distributed variables, whether that would mean a 0.1%, 1%, 5%, 10% error.
Personally I think it only makes sense to use this kind of method in some real-time algorithm where speed is of the essence, the downstream results of the mmult are themselves used in some other approximation (like many ML applications), and emphatically not to make the process of drawing biological conclusions from painstakingly derived data a few minutes faster for the analyst.
I fear that you have made an impressive, but dangerous, tool to people who don't know what they're doing.
To simplify Section 1.1, we help when:
1) You need to perform a matrix product more quickly and can tolerate approximation error
2) You have a training set for the larger matrix
3) The smaller matrix is either a) fixed or b) skinny relative to how tall the larger matrix is.
Re: "an impressive, but dangerous, tool to people who don't know what they're doing."
I believe you are overestimating the usability of my code :). But more seriously, I suspect that people attempting to use our method in contexts it wasn't designed for will quickly discover that they either can't actually call the API the way they wanted to, or that the method is no faster for their purposes. We also characterize our method at least as thoroughly as any approximate matrix multiplication algorithm I'm aware of, and have a variety of (admittedly loose) theoretical guarantees. So I hope that at least those who thoroughly read the paper will have a clear idea of what it can do. Overall, I guess my current thinking is that 1) I'm not sure how to introduce a method any more responsibly, but 2) if I can be of help in ensuring that it gets used well, feel free to reach out.
I'm not optimistic about beating tensor cores when running on GPUs, at least until/unless we get similar hardware support.*
Barring better hardware support, the killer app is probably CPU inference--once there are Conv implementations and the necessary GPU kernels to train the network.
*Aside: this support would be pretty doable since the kernels look almost identical to GEMM kernels--you just need a multiplex-add rather than a multiply-add. On an x86 machine, all it would take is a vpshufb-add and a 4-bit unpack instruction.
The proposed hash based encoding function is not differentiable, so it doesn’t appear this method can be used for training at all.
I’m not aware of any hash functions that are analytically differentiable, so to support efficient back-propagation I suspect that some fundamental changes to this method would be necessary.
I think this name really fits well to the concept. Replacing matrix-multiplication with some hash-lookups! (warning: an overly simplified statement)
This is a really interesting application of PQ(product quantization), which itself also requires learning (usually K-means). Paper: https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_wi...
Considering that ANN has survived through many approximations (e.g. lower precision, pruning), and that many ANN applications are anyway subjective (e.g. image processing, recommendation), I think this can be highly useful in many cases.
Method 2: computer, remember this number. When I give you X and Y, tell me what that number was. This requires no multiplication! Let us tell SCIENCE about our discovery!
- You can't actually get a speedup from the proposed approach. You'd need a lookup table of size b^2 to multiply two b-bit numbers, which will be much slower than just doing the multiplication.
- Most of what we do is perform fewer operations. We're actually slower per op than multiply-adds because those are supported better in hardware. We just compress the matrices by so much that it's worth it. Relatedly, for sufficiently few codebooks, we're actually sublinear in the input size--i.e., we don't even read it all.
- We have some lightweight machine learning in there to make the sublinearity and other approximations not suck. Getting the ML fast enough to beat a BLAS GEMM is far from trivial.
Method 1: Computer, tell me the angle between these two 3D lines!
Method 2: Computer, look at this (second) 3D line from a few different perspectives (determined by the first line) and then tell me how often it looks more vertical than horizontal. That'll give me a good enough idea of the angle between the two.
Personally, I'm not sure whether this is the path to go down or not. Doing everything in log space could help from a hardware perspective, since multiply-adds are much more complex than adds. But it 1) doesn't reduce the number of operations being performed, 2) might increase the necessary bit width for a given level of fidelity (depending on your input distribution and other factors), and 3) might make accumulate operations expensive, since these aren't cleanly expressed in the log domain.
which can be used to multiply two sparse matrices with the GraphBLAS. here is an example in Python:
But it seems like it would make the add part of multiply and add quite a bit more difficult.
No, they don't. In math, curved spaces are modeled through so-called manifolds which locally look like ℝ^n. In particular, at any point p of the manifold there's a tangent space, i.e. a linear space (higher-dimensional plane) tangent to the manifold at p. Vectors at p are just vectors in that linear space. So they are "straight" not "droopy".
On the tangent space of each point p you can now define an inner product g(p). The resulting family of inner products g is called a (Riemannian) metric on the manifold and describes how lengths (of vectors) and angles (between vectors) can be measured at each point.
> Currently, we define the curvature of space with gravity or the deflection of light past massive objects. Is there a way to measure the curvature of space locally?
Yes, there is. In fact, the curvature of a (Riemannian) manifold is a purely local quantity – it's basically the second derivative of the metric g, so it describes how the notion of length changes (more precisely: how the change in length changes) as you go from a point p to neighboring points.
There are other ways to express what curvature is, e.g. by locally parallelly transporting a vector along a closed curve (and making that curve smaller and smaller) which basically measures how the notion of straight lines changes locally. (Though, since a line being "straight" means "locally length minimizing" this brings us back to the notion of length and, thus, the metric.)
Alternatively, if the manifold has dimension 2, there's a particularly simple way of looking at and interpreting curvature, see .
In any case, curvature being used to model gravity is entirely separate from that idea.
: Provided this inner product "varies smoothly" as you move from p to a neighboring point q.
Lets take that one step further. Who taught orbiting bodies how to do differential equations?
As mentioned elsethread, the problem is essentially reduced to determining angles between two vectors (one of which is known ahead of time) in high dimensional space. This is done by projecting the other vector into different specially chosen subspaces and classifying it in those, then summing up the "scores" in each subspace.
Given the similar task of determining the angle between two lines in 3D space, we tend to do something very similar (I feel): We look at the pair of lines (or vectors) from a few different perspectives (="subspaces") and get a feel for how well-aligned they are in each (="score"). We can then guess pretty well how large the angle between the two is.
Of course, we frequently use other cues too (e.g. perspective and texture when talking about alignment of real-world 3D objects). And even when you are considering plain lines displayed on a screen (in which case these cues don't work), we tend to also take into account the relative movement of the objects as we (continuously) shift our perspective (e.g. drag-to-rotate a scene).
Maybe the latter part could also be a hint towards further algorithmic ideas. Maybe somehow involving (signs of) derivatives (finite differences?) or similar could be a cheap way to improve accuracy. Just spitballing here though.
allosteric activation in enzyme binding sites act like transistors almost. inside every cell there's various computations occuring
> It is not generally known that to provide seven- decimal accuracy of the sine function, allowing third-order interpolation, only 15 entries are required [*] ...
The following commentary is interesting.
> In spite of these developments in table construction, it turned out two decades ago that the technology was such that table look-up was too slow for arithmetic. The stored-program type of machine displaced this approach. By this method the value of a function, such as sin x, is computed afresh every time. One cannot but wonder how many times such a number has been re-computed in the last decade, and inquire whether technological developments may make tables efficient again.
> Over the years the emphasis in machines has slowly changed from the central processing unit to memory. From the point of view of table look-up, the memory is the central feature of a machine. Indeed the central processing unit is unnecessary.
The [*] is Krawitz. E., "Proc. Industrial Comp. Seminar," IBM, p. 66, 1950. I cannot find this publication with only a simple search.
See also https://scholar.google.com/scholar?as_sdt=0%2C5&as_yhi=1990&... which finds titles like "A rapid lookup table method for trigonometric functions" (1982) at https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.43801210... .
That got me thinking: is that so? how do you do that?
Luckily, scipy has all sorts of interpolation tools, so one can just experiment a little.
Third-order interpolation -> that means cubic splines. How many points do you need for a cubic spline approximation of the sine function to have 7 decimal place accuracy (in other words the error should be about 0.5e-8? Well, the sine function being anti-symmetric and periodic, you need to only approximate it on the interval [0, pi/2]. It turns out you need 42 points, not including the ends of the interval (where we don't need to tabulate the values of sine; it is simply 0 at 0 and 1 at pi/2).
import numpy as np
from scipy.interpolate import CubicSpline
xs = np.linspace(0,np.pi/2,44,endpoint=True)
sines = np.sin(xs)
ts = np.linspace(0,1,10000,endpoint=True)\*np.pi/2
print("Max Error is ", np.max(np.abs(spline(ts)-np.sin(ts))))
>> Max Error is 5.02852541828247e-08
from scipy.interpolate import CubicHermiteSpline
xs = np.linspace(0,np.pi/2,25,endpoint=True)
sines = np.sin(xs)
cosines = sines[::-1] #note that we reuse the sines,
#we don't invoke np.cos
spline=CubicHermiteSpline(xs, sines, dydx=cosines)
ts = np.linspace(0,1,10000,endpoint=True)*np.pi/2
print("Max Error is ", np.max(np.abs(spline(ts)-np.sin(ts))))
>> Max Error is 4.775709550042251e-08
xs = np.linspace(0,np.pi/2,6,endpoint=True)
sines = np.sin(xs)
cosines = sines[::-1]
spline = BPoly.from_derivatives(xs, np.vstack((sines,cosines,-sines)).T)
print("Max Error is ", np.max(np.abs(spline(ts)-
>> Max Error is 2.057940906574629e-08
spline = BPoly.from_derivatives([0,np.pi/2],
PSpline = PPoly.from_bernstein_basis(spline)
P = Polynomial(PSpline.c.T[::-1])
print("Max Error is ", np.max(np.abs(P(ts)-
>>Max Error is 1.700470619869776e-08
x −0.16666666666666785 x^3+8.881784197001252e-15 x^4+0.008331651428065356 x^5+5.167900831715144e-06 x^6−0.000204626778274708 x^7+3.5525642158307225e-06 x^8+1.8946044087406189e-06 x^9
I found the paper. https://archive.org/details/bitsavers_ibmproceedmputationSem... (Note the type: that paper starts on page 52; Krawitz also had another paper on page 66.)\
The "15 cards" is on page 56.
The paper uses Stirling's interpolation formula.
https://en.wikipedia.org/wiki/Brahmagupta%27s_interpolation_... notes that the Indian mathematician and astronomer Brahmagupta used a second-order version of the interpolation formula to compute a sine table in the early 7th century.
The Wikipedia page links to the Stirling interpolation function description at https://archive.org/details/introductiontonu00hild_0/page/13... .
It looks like it's also called Stirling’s Central Difference Formula.
Any chance you could work this out with scipy? My numeric abilities are pretty crappy.
In scipy you can use smoothing splines to get something close to optimal approximating splines with non-equidistant points. If you aim for a maximum error of 5e-8 (so when you round you have 7 exact decimal places), then with cubic splines you need 33 knots (including the ends). Now, the sine values are always less than one, so the digit before the decimal place is 0. If you count that towards accuracy, then you might be looking for an error not to exceed 5e-7, in which case scipy finds that you need 17 knots including the ends, so 15 interior knots.
from scipy.interpolate import UnivariateSpline
xs = np.linspace(0,np.pi/2,1000,endpoint=True)
s = UnivariateSpline(xs,np.sin(xs),k=3,s=0.75e-11)
ts = np.linspace(0,np.pi/2,10000)
print("Max error: " , np.max(np.abs(s(ts)-
print("Number knots: ", len(s.get_knots()))
>> Max error: 4.761484727611176e-07
>> Number knots: 17
I wish I could find a published list of that table of 15 values.
In my search, I found https://www.jstor.org/stable/2002889?Search=yes&resultItemCl... which comments that an optimum interval table of sines and cosines for 7 place values has 2700 cards instead of 9000. "Each time this table is used, more than two hours of sorting and gang punching time is saved."
The only published "optimum interval table" I could find in archive.org was https://archive.org/details/dli.ernet.212891/page/149/mode/2... where table VII is the function f(r*2) = 1/r*3.
> This is the first time that such a table has ever been printed for use with a hand calculating machine. However, the computer will find that it is easier to use than an ordinary table which requires second difference interpolation, since no interpolating coefficients are needed; and it is not necessary to pay any attention to the irregular intervals.
One of the worked out examples is:
r^2 = 4.043172
F = (0.04686041 - 0.043172 x 0.01420874 ) x (-0.043172) + 0.124999851
One of Gauss’s ‘hobbies’ was to find errors in such tables.
There even were books containing tables of products of integers. You can buy a reproduction of one today: https://www.amazon.com/Calculating-Products-Thousand-Applica...
To the replies, the very act of evaluting a prediction from a NN is matrix multiplication (linear transform of a column vector is matrix multiplication). This doesn't replace matrix multiplication wholesale, lol. This is about multiplication in a specific case.
Y = X*W is the forward propagation. If Z is an error or derivative of error, Z*Wt is the back propagation.
Its an interesting question as to how well that would work. Anything that speeds up matrix multiply in NN and deep learning in general would be a big deal.
I wouldn't be surprised if we started using lossier math for faster state space search. Once you find a peak, you could swap out the maths to be more exact.
> All experi-ments use a single thread on a Macbook Pro with a 2.6GHz Intel Core i7-4960HQ processor. Unless stated otherwise, all timing results use five trials, with each trial reporting the fastest among 20 executions.
It may also be worth nothing that, even as a hobby, one can actually do a lot with modern hardware if you scope your project well. There is also a lot of free resources such as CoLab available to those who have very limited computing power at home/work.
Last but not least, there is also nothing stopping you from announcing your results on arXiv and, if you can be bothered (as a hobbyist), get it published in a peer-reviewed journal.
So if you still have ideas, I encourage you to go ahead and try them! :)
"I would have done this years ago if I knew how to do it" is fantastically narcissistic.
I genuinely don't understand why we allow papers to be published on computer science, with graphs plotted of the supposed efficacy of research (i.e. not just a theoretical paper), with no code attached.
To assess MADDNESS’s effectiveness, we implemented both it and existing algorithms in C++ and Python. All of our code and raw numerical results are publicly available at https://smarturl.it/Maddness. All experiments use a single thread on a Macbook Pro with a 2.6GHz Intel Core i7-4960HQ processor.
edit: the url above redirects to: https://github.com/dblalock/bolt
Theory has its place, and so does systems work. No need to denigrate one or the other.
If we can't reproduce it isn't really science. I know academics often write bad code and don't like to publish their dirty work, but the buck has to stop somewhere.
If you roll everything, financial incentives, emotional readiness, support and preparation, etc, into 'opportunity' then yeah, I believe it.
I have only superficially looked at the paper, but it's pretty clear they are using an offline lookup table. There are no such problems involved.
Basically, we can't be sure that you really have a C++ compiler. The source code can be verified to be a C++ compiler, but the binaries deviate in some way, possibly malicious.
I suppose the better term in this context is error propogation
The goal in compiler boostrapping is to reach a fixed point, in a single iteration. When the compiler binary compiles itself, what should come out is a faithful replica of that compiler binary. Having done compiler bootstrapping, I know about hair-pulling problems that occur when you don't have a fixed point. E.g. a bug in the optimization causes the compiler to subtly miscompile itself. The resulting compiler then severely miscompiles itself; e.g. fails to recompile itself entirely, or produces something that no longer runs at all. (And that's actually a lucky case; a compiler compiling itself is basically a dog food test. When that fails, you've caught a bug early. What keeps you awake at night is the compiler compiling itself perfectly fine, but rarely miscompiling other code in the wild.)
Anyway, this sort of malfunction is exactly what will happen if you take a working, self-hosted compiler and then replace some important chunk of it with some hokey machine-learning approximation. It's unlikely you can do that without breaking the fixed point.
I don't see how you can not have fixed point issues in the matrix scenario. Maybe it will stabilize if you iterate on it. You take the machine learning framework with approximate matrix multiplications in it and run the process for figuring out those matrix multiplications. You may get some different values in it, which you can substitute back and try again. Will that converge? If it does, how do you know the whole thing is still valid? (What definition of valid applies?)
As far as single threaded, there's a simple answer and a subtle one. The simple answer is that we consider the core subroutine a single thread would run in a multithreaded context, not how to do the multithreading. These are basically orthogonal issues since matrix multiplication is embarrassingly parallel and our method would parallelize just like any other. More details in appendix E2.
The subtler answer is that we could do even better in the multithreaded context if you could fuse the encoding step with whatever earlier code produces the larger matrix. This is a result of matrix products becoming memory-bandwidth-bound in the limit of sufficient cores, combined with our approach reducing the size of the matrix by a huge amount.