
Algorithm efficiency comes from problem information - ChrisRackauckas
http://www.stochasticlifestyle.com/algorithm-efficiency-comes-problem-information/
======
ssivark
Yes, but... that somewhat misses the point of machine "learning"! Maybe the
author is being somewhat facetious when making an overly broad interpretation
of an important point.

When one wants to use the best ML model "in production" (after all the
exploration is done and dusted), or when there is a "domain expert" who can
specify the right biases to encode into the algorithm (i.e. problem
information), the point made in the post is very valid.

Often, however, the point of using ML is to tease out some property that is
not apparent previously. In that case, the hope is to use some class of
algorithms (eg: neural networks) with the right general biases (eg:
convolutional layers and RELU activations, or share the first few layers of
VGG) so that it can pull out the appropriate aspects from the problem
specification that _you did not know to ask for_.

From this perspective, the challenge lies in the 'No Free Lunch' theorem which
implies that unless you known enough about the problem to indirectly encode
the correct biases (eg: In the case of image classification -- choosing neural
networks, convolutions, etc. rather than a decision tree) you can never win
the game.

~~~
joshuamorton
I'm going to quibble with your last point. A decision tree can, absolutely
classify images. Depending on exactly how you specify the problem it can do so
exactly as well as a neural network. It would just either be a lot less space
efficient.

No free lunch isn't about choosing the right biases, it says that a single
model cannot solve every problem. And in fact any solver will have accuracy
equivalent to chance when averaged across all problems. Here's an example:

There are four possible one bit to one bit functions, I and ~ (the identity
and inverse functions), as well as T and F (True and False). If you create a
truth table for all four of those functions:

    
    
        func  input  output
        I     0    | 0
        I     1    | 1
        ~     0    | 1
        ~     1    | 0
        T     0    | 1
        T     1    | 1
        F     0    | 0
        F     1    | 0
    

You'll notice that the output column contains an equal number of one bits and
zero bits. This is true for any input size on binary functions, and in general
if you extend the definition from "one bits and zero bits" to "equal number of
each possible output class" it is true for functions of any output size.

That's what the No Free Lunch theorem states: that across _all_ possible
functions in your space, you can do no better than chance, because for any
function F that you approximate exactly, I can define a function "!F" (or in a
non-binary space, a family of functions) which you are exactly 0% accurate on.
This is true no matter the algorithm or assumptions you choose.

Similarly, I can define a classification algorithm that has no implicit
assumptions about what I am classifying, other than that it can be classified,
but still achieves 100% accuracy on any specific problem I apply it to:

For every possible input in your function space, get the ground truth label
from some external function, throw those all into a mapping (a very, very
large mapping) and then just look up the input in the mapping when requested.

For image classification, this means a mapping of all possible NxN images to a
boolean: "Contains dog". You might even, as an optimization make this more
space efficient by picking certain pixels that are more impactful and looking
at them first, so that you don't have to store every image. That sounds an
awful lot like a decision tree ;)

Tl'dr: 'No Free Lunch' has nothing to do with the efficiency of the encoding,
but instead with the ability to have a perfect agent that can solve every
problem.

------
xiaodai
The ideas behind the post are gold. I have blogged about group-by strategies
and one of the key themes is that the more you know about the inputs the more
you can optimize your algorithms (see
[https://www.codementor.io/zhuojiadai/an-empirical-study-
of-g...](https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-
strategies-in-julia-dagnosell))

------
SilasX
>What does my algorithm know about my problem?

I’ve phrased the same idea as “what regularity of the problem space is my
algorithm exploiting?” A good first question to demistify any purported fast
algorithm.

~~~
dahart
I've been thinking of this prior knowledge as "domain reduction", or even
"dimensionality reduction". The more I know about the problem, the more of the
space I can just remove from the search before I start. In high dimensional
spaces, each dimension removed can be loosely/conceptually like an order of
magnitude.

I like your phrase too, as there are classes of knowledge about a problem that
don't necessarily reduce the domain, depending on how you look at it, but can
still be exploited.

~~~
TeMPOraL
My pet phrasing is more like: each bit of domain knowledge I have lets me cut
the problem space in half.

------
grandalf
I made essentially this argument in the thread for the google announcement
about beating stockfish and was down voted mercilessly.

I think neural nets and deep learning systems are misunderstood so profoundly
because humans are bad at many kinds of reasoning and the explanation of the
neural net being "smart" is so psychologically appealing.

------
Strilanc
> _In fact, you can show that solving a linear systems has to have O(n^3) as a
> lower bound_

Where did the author get this idea? It can be done more efficiently than that:
[https://mathoverflow.net/a/52557/40272](https://mathoverflow.net/a/52557/40272)

~~~
danbruc
I think you are misreading the answers; it states the well-known fact that
matrix multiplication and inversion can be done is less than O(n³) but it does
not say this about solving linear systems. If you read through the material
referenced in the answers it seems that O(n³) is indeed a lower bound and
actually hard to obtain in practice.

~~~
tmyklebu
You can compute the inverse of a matrix A in asymptotically the same amount of
time as you can multiply two matrices. You can also Cholesky factorise in that
amount of time, and compute T^(-1) A where T is (upper or lower) triangular in
that amount of time. You invert by A^(-1) = (A^T A)^(-1) A^T, the latter
computed by Cholesky and two triangular solves. You Cholesky and do triangular
solves by divide-and-conquer.

So no, the article is wrong.

~~~
ChrisRackauckas
Yes, I made a mistake. Gaussian elimination is O(n^3), but there are general
algorithms which are as fast (or slow, depending on how you look at it) as the
current complexity bound for matrix multiplication which is O(n^{2.737}). Not
the end of the world, I corrected the article. Reference:

[http://theory.epfl.ch/vishnoi/CargeseLectures.pdf](http://theory.epfl.ch/vishnoi/CargeseLectures.pdf)

Still, the point stands that you'd never want to actually solve a large system
"the generic way" unless you really have to (i.e. you cannot identify a
structure for which you have a faster algorithm).

~~~
tmyklebu
It looks like you also turned the rest of your post into a link.

The best currently known exponent for square matrix multiplication is around
2.3728; Wikipedia's page on matrix multiplication has a reference.

~~~
ChrisRackauckas
>However, the constant coefficient hidden by the Big O notation is so large
that these algorithms are only worthwhile for matrices that are too large to
handle on present-day computers.

No wonder I never heard about these. I updated it for correctness (and fixed
the link, ugh) but yeah, these are details I'd hope to not dwell on in the
post since for all intents and purposes you'd still never want to do the worst
fallback unless you have to.

~~~
tmyklebu
Yep. (Largely) irrelevant in practice. But the gap between what we can do and
what lower bounds we can prove is very large indeed for many matrix
operations.

------
Terr_
Alternately, you could say that more information allows you to recognize that
you're faced with a _different problem_ , one which is more-specific and may
have a more-efficient solution.

The outermost, most -general, and least-tractable problem being "Do what's
needed for whatever."

~~~
adrianratnapala
Yes. And it is interesting that general neural networks can't really solve
that one. The closest thing to a solution we have for it is the human animal
-- and it is still a pretty specific beast.

------
Veedrac
> However, while these features matter, most decent implementations of the
> same algorithm tend to get quite close in efficiency to each other (probably
> <3x if everyone is using a performant enough programming language or
> constructs).

No, I don't think this is even close for most problems. Neural networks,
maybe, but they're an unusually simple case where everyone uses similar,
largely GPU accelerated code.

~~~
ChrisRackauckas
At least in scientific computing this applies. And note I say most decent
implementations, so I assume you took it to a language without performance
issues like C/Fortran/Rust/Julia/Go/etc. and got rid of temporary allocations,
made sure you looped in the correct ordering, etc. After you do all of that,
for mathematical and scientific problems you tend to hit a wall. [BLAS
libraries tend to be within that range of each
other]([https://github.com/barche/julia-blas-
benchmarks/blob/master/...](https://github.com/barche/julia-blas-
benchmarks/blob/master/BenchmarkResults.ipynb)), which pretty much directly
means that regression, linear classification, etc. libraries are all going to
be within this amount. Different differential equations solvers which are
sufficiently optimized with the same algorithm tend to be within that amount
from each other. Neural network libraries tend to be within that amount from
each other. General optimization algorithms over a large class of problems
tend to be the same. That doesn't mean it's not a worthy enterprise to keep
making the "general fallbacks" better, indeed this helps everyone.

If you're looking to make your machine learning software faster, writing a new
piece of fundamental GPU code to beat cuDNN could possibly be faster but
probably won't get you very far. But changing your model to better match the
data could make the problem orders of magnitude easier.

~~~
Veedrac
You can't prove the effect of "temporary arrays, choice of language, and
parallelism" and "pointer indirection, cache efficiency, and SIMD
vectorization" don't matter unless you compare one thing that does those
things to another thing that doesn't.

BLAS libraries are all within the range of each other because they all store
memory in basically the same way, do indirection in basically the same way,
care for the cache in basically the same way and do SIMD in basically the same
way.

As soon as you step out of those prebuilt blocks and build your own function
over squares of numbers, you're going to be losing far more than a factor of 3
until you understand how to handle these issues.

~~~
newen
>You can't prove the effect of "temporary arrays, choice of language, and
parallelism" and "pointer indirection, cache efficiency, and SIMD
vectorization" don't matter unless you compare one thing that does those
things to another thing that doesn't.

I think he said the exact opposite. He's basically saying algorithmic time
complexity matters (and therefore identifying your problem class matters,
since you might get a more efficient algorithm out of it) since you can only
get so far with the generic algorithm, however good your implementation is.

~~~
Veedrac
The argument is symmetric; once you've tuned the algorithm the only way to get
>10x speedups is by working with those low-level things.

 _Both_ the broad-strokes algorithm and the fine details are responsible for
huge performance differences.

------
marmaduke
Another example of this is found in comparing generic MCMC methods to HMC. The
former work for discrete or continuous parameters while the latter requires a
gradient. Regardless of how fast a generic MCMC implementation is, the HMC
method is asymptotically faster, so for some problem sizes the language or low
level optimizations are irrelevant.

