
A visual proof that neural nets can approximate any function - antman
http://neuralnetworksanddeeplearning.com/chap4.html
======
a-nikolaev
Approximate, not compute. The function also must be continuous. NNs are good
for approximation / interpolation / extrapolation, which makes them quite
useful for certain domains of problems. But of course, it does not make them a
kind of universal computing machine (in the computability sense, like
universal Turing machines).

~~~
13415
Wait a minute, isn't it also the case that according to the Weierstrass
approximation theorem any continuous function on a closed interval can be
approximated by a polynomial function? And isn't that kind of pointless for
practical applications because we also need to avoid overfitting?

To clarify, I'm not trying to make a snippy remark, I just happened to have
used polynomial curve fitting before and looked up the Wikipedia page for the
Stone-Weierstrass theorem and am trying to figure out the relevance of that
post on NNs. Is it essentially the same claim?

Any clarification appreciated!

~~~
xtacy
Yes, the claims are pretty much in the same spirit. Although the first
(Weierstrass's) theorem [1] was stated for real-valued functions in a 1-D
closed interval [a, b], Stone-Weirstrass is a generalisation of the above
theorem [2] that's applicable in more general scenarios. Here is the formal
statement:

\- [1]
[http://mathworld.wolfram.com/WeierstrassApproximationTheorem...](http://mathworld.wolfram.com/WeierstrassApproximationTheorem.html)

\- [2] [http://mathworld.wolfram.com/Stone-
WeierstrassTheorem.html](http://mathworld.wolfram.com/Stone-
WeierstrassTheorem.html)

Neural Networks use a different "basis" (sigmoid, ReLU, etc.), but the
underlying idea shares the same spirit.

~~~
mturmon
Yes. A NN consisting of, for example, sigmoids will form an “algebra” that
separates points in the sense of the Stone Weierstrass theorem, and the NN
approximation result will follow.

All that’s really needed is that the limit of the NN basis function is
different at plus versus minus infinity on the real line. This will give you
the “separates points “ property.

------
fluffything
So can any lagrange polynomial, Fourier series, ...

The real question is whether the approximation is a good one:

\- can you prove error bounds ?

\- can you bound the maximum error?

\- is it efficient ? (low storage, low computational effort)

\- is it fast to build? (low computational effort of coefficients)

\- derivatives: how well does it approximate gradients, what's the error on
the gradient, is it bounded? can one bound it, how fast can one evaluate them,
etc.

\- there are many other interesting properties:
[https://en.wikipedia.org/wiki/Approximation_theory](https://en.wikipedia.org/wiki/Approximation_theory)

From pretty much every single aspect of approximation theory, neural nets are
one of the worst methods to approximate a continuous function. If you were to
make an analogy with sorting algorithms, they would be worse than bogosort.
There are no error bounds, you can't bound the maximum error, computing their
coefficients is very slow (training, needs GPUs, ...), they require a lot of
storage and computational power to evaluate, ...

~~~
0815test
> From pretty much every single aspect of approximation theory, neural nets
> are one of the worst methods to approximate a continuous function.

What about scalability? I don't know of many approximation methods that can
routinely work with the amount of coefficients, datapoints, dimensionality of
data etc. that neural networks are coping with. (Though AIUI compressed
sensing methods might come close; compressed sensing can be seen as a kind of
approximation as well.)

~~~
simonster
Yes, neural nets are successful is in large part because they are
asymptotically more efficient than other models. Training time is O(n) with
O(1) memory, and prediction time is O(1) with O(1) memory. Compare to e.g.
kernel methods, which have nicer theory behind them, but kernel least squares
is O(n^3) with O(n^2) memory to fit and O(n^2) with O(n^2) memory to predict.
The coefficients are larger for neural nets, but if your data are big enough,
the asymptotics win out.

~~~
blamestross
Training time for a NN is not O(n), it is a function of the dataset size and
the complexity of NN to approximate a given function. Similarly the memory
cost is also a function of the size of the network required. The same is true
for prediction time and memory costs. If you data are big enough, all the O(1)
lies we tell ourselves start breaking down.

~~~
simonster
By this logic, (naive) matrix multiplication is not O(n^3) because it is a
function of the precision required. The size of the neural network required to
approximate a given function to within some epsilon does not change with the
dataset size.

~~~
fluffything
Do you have a proof that the neural network approximates the function within
some epsilon for all possible inputs within some range ?

~~~
simonster
The universal approximation theorem guarantees that a finite-width neural
network that approximates the function to within some epsilon exists. But,
regardless of the approximation method, there is no way to certify that a
given approximation method is sufficient for an arbitrary continuous function
given only a finite number of samples (i.e., without oracle knowledge of the
underlying function), which is the typical situation where neural networks are
applied. I can construct a continuous function that has an arbitrary (but non-
infinite) number of peaks in an arbitrary interval. Thus, any method that
approximates the function within some epsilon for all possible inputs within
that arbitrary interval must encode an arbitrary amount of information. I can
also ensure that whatever the number of samples is, it's not enough to
properly approximate the function.

------
soVeryTired
People make far too big a deal of the universal function approximation
property of a hidden-layer neural network. Universality should be a basic
property of any decent interpolation method.

Piecewise linear regression is a universal function approximator.

~~~
salty_biscuits
And GMMs and 4th order DAEs and a bunch of other things. The really
interesting thing is if the sensitivity of the weights to the loss function is
a simple function. Best case would be a convex optimization problem, e.g. like
SVMs

------
mkl
I think you can explain it even more clearly with smooth relative shifts,
rather than sharp bump functions. I made a quick demo:
[https://www.desmos.com/calculator/rfaqogkbmy](https://www.desmos.com/calculator/rfaqogkbmy)

Drag the sliders for _w_ and _n_ to change how step-like the sigmoids are and
how many are combined. The purple lines are the sigmoids, relative changes at
each (regularly spaced) position, which are added together to make the blue
approximation to the red function. You can change the function _f_ ( _x_ ) to
see how it handles other possibilities, including piecewise/discontinuous ones
like "{x<.5: 0, x>=5: x}".

~~~
Gravityloss
This deserves its own submission. Maybe the number of neurons slider could be
logarithmic?

~~~
mkl
Done:
[https://news.ycombinator.com/item?id=19711416](https://news.ycombinator.com/item?id=19711416)

I couldn't find a nice way of making the slider logarithmic (in Desmos the
only way to do it is with an intermediate variable, which is kind of
confusing), so I reduced the maximum number on the slider. I also fixed a bug.

------
personjerry
I skimmed it, but at a glance, isn't this almost identical to the Taylor
Series?

[https://en.wikipedia.org/wiki/Taylor_series](https://en.wikipedia.org/wiki/Taylor_series)

~~~
YjSe2GMQ
Well, in the sense that polynomials are universal approximators too (on a
compact domain). See this comment:

[https://news.ycombinator.com/item?id=19709834](https://news.ycombinator.com/item?id=19709834)

~~~
tprice7
The Stone-Weierstrass theorem is not related to Taylor series, in fact the
target function need not have a derivative, let alone have all higher-order
derivatives, a prerequisite for the Taylor series to even exist.

------
acdc4life
"No matter what the function, there is guaranteed to be a neural network so
that for every possible input, x , the value f(x) or some close approximation)
is output from the network"

Okay, so what? You require more and more neurons (ie. parameters) to
approximate your function better and better. You can do the same with
piecewise constant (Riemann sums). You can do this with trig functions too
(Fourier transform).

"This result tells us that neural networks have a kind of universality."

I don't know what this statement means. What mathematical properties do neural
networks have that other functions don't? The ability to approximate
continuous functions isn't special. Given 5 points, I can perfectly fit an
elephant to your function. And it's not like you are fitting the function with
as few parameters as possible.

~~~
dual_basis
It's a baseline desirable property. It tells us that, at the very least,
neutral networks are capable of approximating any continuous function. This
isn't true for linear functions, for example, so we wouldn't want to try and
model everything using linear functions.

------
aportnoy
*any continuous function

~~~
throwawaymath
While we're at it, doesn't the "universality theorem" (as the article calls
it) basically follow immediately from the fact that the set of all continuous
functions comprises a vector space?

If the continuous function is additive, it's linear. If it's nonlinear, you
can differentiate it to obtain a linear approximation. A neural network
computes linear transformations, so unless I'm missing something I'm a little
surprised there's a substantive theorem for this. Is it not a corollary on the
fact that we can construct a vector space of all continuous functions?

~~~
braised_babbage
One of the things about continuous functions is that they aren't necessarily
differentiable, cf.
[https://en.wikipedia.org/wiki/Weierstrass_function](https://en.wikipedia.org/wiki/Weierstrass_function)

~~~
throwawaymath
True, but that was kind of an afterthought of my point. All continuous
functions still comprise a vector space.

~~~
soVeryTired
I don't see how that helps?

> A neural network computes linear transformations

They compute a nonlinear function of an affine transform. Neural networks are
nonlinear functions.

------
codesternews
Any deeplearning expert here. Why Neural network can't compute a linear
function Celsius to Fahrenheit 100% accurately.

Is it data or is it something can be optimised.

```

celsius_q = np.array([-40, -10, 0, 8, 15, 22, 38], dtype=float)

fahrenheit_a = np.array([-40, 14, 32, 46, 59, 72, 100], dtype=float)

for i,c in enumerate(celsius_q): print("{} degrees Celsius = {} degrees
Fahrenheit".format(c, fahrenheit_a[i]))

l0 = tf.keras.layers.Dense(units=1, input_shape=[1])

model = tf.keras.Sequential([l0])

model.compile(loss='mean_squared_error',
optimizer=tf.keras.optimizers.Adam(0.1)) history = model.fit(celsius_q,
fahrenheit_a, epochs=500, verbose=False)

print("Finished training the model")

print(model.predict([100.0])) // it results 211.874 which is not 100% accurate
(100×1.8+32=212)

```

What can be done to make this NN 100% accurate for simple linear equation
𝑓=1.8𝑐+32

[https://colab.research.google.com/github/tensorflow/examples...](https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l02c01_celsius_to_fahrenheit.ipynb#scrollTo=Y2zTA-
rDS5Xk)

~~~
czr
* Fix the data. Right now the optimal coefficients on your data (using least-squares) are m=1.79794911, b=31.952525636156476, which yields 211.74743638 when predicting on 100.

* Tune the hyperparameters. In particular, tune the learning rate. To quote the Deep Learning Book [0]:

> _The learning rate is perhaps the most important hyperparameter. If you have
> time to tune only one hyperparameter, tune the learning rate. It controls
> the effective capacity of the model in a more complicated way than other
> hyperparameters—the effective capacity of the model is highest when the
> learning rate is correct for the optimization problem, not when the learning
> rate is especially large or especially small._

The following code will yield exactly 212 almost every run (using fixed data
and a different choice of learning rate):

```

celsius_q = np.array([-40, -10, 0, 8, 15, 22, 38], dtype=float)

fahrenheit_a = np.array([x * 1.8 + 32 for x in celsius_q], dtype=float)

for i, c in enumerate(celsius_q):

    
    
      print("{} degrees Celsius = {} degrees Fahrenheit".format(c, fahrenheit_a[i]))
    

l0 = tf.keras.layers.Dense(units=1, input_shape=[1])

model = tf.keras.Sequential([l0])

model.compile(loss='mean_squared_error',
optimizer=tf.keras.optimizers.Adam(lr=1.0))

history = model.fit(celsius_q, fahrenheit_a, epochs=500, verbose=False)

print("Finished training the model")

print(model.predict([100.0]))

```

[0]
[https://www.deeplearningbook.org/contents/guidelines.html](https://www.deeplearningbook.org/contents/guidelines.html)

~~~
tylerhou
Code blocks don't work on HN; you need to format all your code with spaces:

    
    
        celsius_q = np.array([-40, -10, 0, 8, 15, 22, 38], dtype=float)
        fahrenheit_a = np.array([x * 1.8 + 32 for x in celsius_q], dtype=float)
    
        for i, c in enumerate(celsius_q):
            print("{} degrees Celsius = {} degrees Fahrenheit".format(c, fahrenheit_a[i]))
    
        l0 = tf.keras.layers.Dense(units=1, input_shape=[1])
        model = tf.keras.Sequential([l0])
    
        model.compile(loss='mean_squared_error', optimizer=tf.keras.optimizers.Adam(lr=1.0))
        history = model.fit(celsius_q, fahrenheit_a, epochs=500, verbose=False)
    
        print("Finished training the model")
        print(model.predict([100.0]))

------
alleycat5000
If you're interested in classical approximation with polynomials, check out
Nick Trefethen's lectures on the subject.

[https://people.maths.ox.ac.uk/trefethen/atapvideos.html](https://people.maths.ox.ac.uk/trefethen/atapvideos.html)

Chebfun is pretty cool too!

[http://www.chebfun.org](http://www.chebfun.org)

------
rodionos
For an alternative approach, if you treat this function as a time series where
x is time, you can get a reasonably good approximation by performing SVD of
the trajectory matrix and building the forecast from the principal components
(eigen vectors) using a recurrent formula.

Here's an example:

[https://apps.axibase.com/chartlab/9922f98f](https://apps.axibase.com/chartlab/9922f98f)

* Chart 1. Function value for x in [0, 1).

* Chart 2. Function value for x in [0, 2).

* Chart 3. Function value for x in [0, 1) and extrapolated values for x in [1, 2).

------
zepearl
Skimming through the article, I understand that the author...

A) was focusing on functions that take a certain amount of input variables and

B) that the function (that s/he mirrored using the neural net) computes out of
it directly one or more of result(s).

C) To do that s/he used a backpropagation network (which is the only model I
know very well).

Right or wrong?

EDIT: when I say "directly" I mean that the function(s) does not feed itself.

~~~
mkl
If you want to know the details, you should read the article more thoroughly.
There is no back-propagation (or any training) involved, as this article is
about what kinds of things neural networks can do in principle. I.e. how we
can be sure that neural networks can in theory solve some problem we have. In
practice, you have to actually find a network (by training) that solves your
problem with a reasonable amount of resources, using the data you have, and
that's a whole other issue (in fact, an entire research field!).

~~~
zepearl
Ok, thanks for the explanation.

------
iheartpotatoes
Isn't this a bit like saying you just created an nth-order polynomial
regresion for any set of points x/y that someone hands you? They are
essentially the same amount of parameters, and both can only approximate one
function. So.... no kidding?

------
chriscaruso
Great book, this is how I learned NNs

------
uuwp
This proof is largely irrelevant in the real world. An interesting question
would be how much can be approximated with a model that has 1 MB worth of
weights and can use only relu/tanh/softmax activations.

~~~
darkkindness
The first paragraph of the conclusion addresses that this is merely a proof of
what is possible, not what is practical:

> The explanation for universality we've discussed is certainly not a
> practical prescription for how to compute using neural networks! In this,
> it's much like proofs of universality for NAND gates and the like. For this
> reason, I've focused mostly on trying to make the construction clear and
> easy to follow, and not on optimizing the details of the construction.
> However, you may find it a fun and instructive exercise to see if you can
> improve the construction.

------
rsiqueira
How would it be possible to implememnt/approximate a simple SIN or COS
function using a neural network? I would like to implement it but I was not
able to find out how.

------
drpixie
A system mirroring a polynomial has the same properties as a polynomial - gee
whizz - that really doesn't deserve to be at the top of of hacker news!

------
uuwp
Of course they can, because an NN can be any function. If we can pick tanh as
the "activation" then we can as easily pick arctan as the activation ans say
our NN computes arctan. What an achievement! A better question is whether
conv+relu based NNs can approximate any function. But that's most likely false
because there are many weird functions that are impossible to compute, not
even approximate (I'm talking about those curious counter-examples in math).

~~~
DavidSJ
This is a non sequitur in this context. The universality described here
depends only on changing connection weights, not the neuronal activation
functions. An important caveat is the approximated function must be
continuous, but that covers a very large family.

~~~
uuwp
I don't think every continuous function can be approximated this way because
we can make an infinitely complex, but continuous function that would have any
n-th derivative also continuous. I'm thinking about those weird zeta-riemann-
style functions. In order to approximate such a function we'd need a huge
model that couldn't be computed or stored even by a universe-size perfect
computer.

~~~
DavidSJ
It’s a theorem, so it’s been proven:
[https://en.wikipedia.org/wiki/Universal_approximation_theore...](https://en.wikipedia.org/wiki/Universal_approximation_theorem)

Another caveat that I forgot in my previous comment is the domain has to be
compact (closed and bounded). But if so, then it doesn’t really matter how
weird your continuous function is, because compactness of the domain
guarantees uniform continuity, i.e. your delta only depends on epsilon and not
x in the epsilon-delta criterion of continuity. That allows you to partition
the domain into patches of diameter delta, in which very simple functions are
sufficient to approximate within epsilon.

------
yters
Can neural networks solve the halting problem?

~~~
dagav
No. Neural networks are still computed on Turing machines, which are
mathematically proven to not be able to solve the halting problem.

~~~
segfaultbuserr
It doesn't matter if neural networks are computed on Turing machines or not.
The term _" compute"_ or _" computation"_ is commonly defined _by_ a Turing
machine, as all other models have been shown to have equivalent power, and so
far there is no indication that the Church-Turing thesis is false. So if one
says a model can perform arbitrary computation, it cannot be stronger than a
Turing machine, unless one is talking about hypercomputation, which is still
largely a conjuncture.

Just my two cents, correct me if I'm wrong.

~~~
contravariant
Well, you're not exactly wrong but what you're saying is a bit weird.

Technically something capable of arbitrary computation in the Turing machine
sense can be stronger than a Turing machine (the obvious example being a
Turing machine with access to a halting oracle).

Also if you want to show something is limited by the capabilities of a Turing
machine it's way easier to point out it's being run on a Turing machine, as
opposed to showing it's capable of arbitrary computation (which might not even
be sufficient, as I explained above).

As it stands it's not entirely obvious that a neural network with access to
arbitrary precision arithmetic might not be more powerful than a Turing
machine, but since we couldn't possibly construct a neural network precise
enough that's a bit of a moot point.

------
otabdeveloper2
> neural nets can approximate any continuous function

So-called "neural nets" are just logistic regression with a fancy name. Seems
"deep learning" is just a wavelet transform with a sigmoid basis. (So, boring
math stuff we already knew forever, plus marketing mumbo-jumbo.)

~~~
soVeryTired
> Seems "deep learning" is just a wavelet transform with a sigmoid basis

Well... more like an iterated transform.

You're right, most of the mathematics is very old indeed. What changed was the
hardware (parallelism), software (easy-to-use autodiff packages) and the
availability of data.

There's a lot of hype in the field, but some of that hype is deserved.
Computer vision was practically in crisis in the late 00's and early 10's. No
significant progress was being made on problems, and there were few strategic
directions to move in that hadn't been done to death already. Then _smash_ :
along comes deep learning, which changed everything.

------
Xcelerate
I'd like to see a neural network that can compute a hash function.

~~~
bowyakka
[http://cs.nju.edu.cn/lwj/L2H.html](http://cs.nju.edu.cn/lwj/L2H.html)

~~~
chewzerita
This is just crazy! I can't even begin to understand what is going on... Bravo

------
jjtheblunt
title is false since there exist uncomputable functions, no?

------
quickthrower2
Can it compute y=sin x?

~~~
hcs
> First, this doesn't mean that a network can be used to exactly compute any
> function. Rather, we can get an approximation that is as good as we want. By
> increasing the number of hidden neurons we can improve the approximation.

~~~
tsbinz
Also, the article doesn't mention a crucial part of the universal
approximation theorem: It is about functions on compact subsets of R^n, so it
doesn't say anything about functions that take the whole of R as input.

~~~
ska
It's also important to note the theorem doesn't help you find the right
weights, just notes they exist.

------
nisuni
Any orthonormal ~set~ basis of functions can represent any function. So what?

~~~
anonytrary
No, the set must be an orthonormal _basis_. For example, two Fourier
components technically form an orthonormal set, but do not form a complete
basis.

~~~
nisuni
Yes, I meant basis. My point still stands: so what?

