
Implementing a Neural Network from Scratch in Python - vzhou842
https://victorzhou.com/blog/intro-to-neural-networks/
======
amasad
Runnable code from the article: [https://repl.it/@vzhou842/An-Introduction-to-
Neural-Networks](https://repl.it/@vzhou842/An-Introduction-to-Neural-Networks)

I see so many implementations of NeuralNets from Scratch on Repl.it (I'm a co-
founder) but this is one of the simplest and clearest one.

One that blew my mind recently is one by a 12 year old programmer:
[https://repl.it/@pyelias/netlib](https://repl.it/@pyelias/netlib)

------
fredley
My first reading of this headline was "Implementing a Neural Network _in_
Scratch". Now that I would like to see.

[https://scratch.mit.edu/](https://scratch.mit.edu/)

~~~
dbcurtis
That would be a challenge.

Back in the heyday of expert systems and original Mac’s, I did an expert
system in Hypercard. Not really production worthy. But interesting from
perspective of exploring UI.

~~~
fit2rule
Still got it somewhere? I've got a project to demonstrate a bunch of Hypercard
stacks in a retro-computing context, and it'd be definitely interesting to
see, in the case that you'd consider finding it and sharing it ...

~~~
dbcurtis
Oh gosh.... that was so many years ago. I suspect I could never find the
floppy, assuming I haven't chucked it.

All I can remember about it is that it was structured to ask the user one
query per card, and there was some kind of global function that kept the state
of the consultation and navigated to the next card.

The expert system that I implemented was a simple toy, because the point was
to explore the UI aspects. For instance, you could easily embed images and ask
"Click the image that most closely resembles your specimen." Stuff like that,
intermixed with traditional text queries.

~~~
fit2rule
Well, it was worth the try .. sounds interesting though! There's a resurgent
interest in Hypercard these days .. if you ever get the druthers to go look
for your old stack, I'm sure it'll be of interest to some of us. :)

------
gambler
Why are deep learning researchers allergic to meaningful variable names?

~~~
chrisfosterelli
These are meaningful variable names in the context of a code implementation of
a math algorithm. The variables directly map to the algorithmic notation.

~~~
jacobolus
And when trying to do algebraic manipulation on paper, having an 8 or 10
letter variable name is incredibly cumbersome.

Frankly in the middle of a numerical algorithm it is typically also cumbersome
in code to have descriptive variable names for everything.

However, mathematical code (especially when written by scientists, etc.) often
takes this too far, introducing many 1- or few-letter variable names without
enough context or description to figure out what they stand for, and for more
of the variables than necessary.

This blog post was just fine though. h = hidden, o = output, y_pred =
predicted value of y, etc. are quite clear

~~~
gambler
_> h = hidden, o = output, y_pred = predicted value of y, etc. are quite
clear_

    
    
      ∀ abbr + to the cogntv load prsn wht mntn in thr hd.
    
      ∀ -> every
      + -> adds
      wht ->  who has to
      h -> head
    

Was that easier to read than the following?

    
    
      Every abbreviation adds to the cognitive load the person has to maintain in their head.
    

This isn't code, but the principles of making code (un)readable are exactly
the same.

If you see that full-word variable names create too much clutter, it means
your code structure is wrong. So you fix it.

You go from this:

    
    
        self.w1 = np.random.normal()
        self.w2 = np.random.normal()
        self.w3 = np.random.normal()
        self.w4 = np.random.normal()
        self.w5 = np.random.normal()
        self.w6 = np.random.normal()
    

To this:

    
    
        set_to_random_numbers(self.weights)
    

After 9 years of maintaining several dozen legacy codebases (which sometimes
involved financial math) I am fully convinced that the only people who like
terse code are people who write it and never read it after.

~~~
jacobolus
Are you suggesting that we should write all of our mathematical expressions
using prose, and stop using symbols for operators?

That would be the original approach historically (before the past 500 years),
e.g. for the quadratic formula, Brahmagupta (628 CE):

> _To the absolute number multiplied by four times the square, add the square
> of the middle term; the square root of the same, less the middle term, being
> divided by twice the square is the value._

Go ahead and do what you like, but I doubt you’ll find many publishers who
will accept your paper in the 21st century.

I know which one imposes more cognitive load for me. But disclaimer: I spent a
lot of time from age 5–20 working with mathematical notation.

~~~
gambler
I am suggesting that just because software does "math" doesn't change how
people read it. Bad coding practices like cryptic variable names and
repetitions will affect it _exactly in the same ways_ they affect software in
all other domains.

~~~
jacobolus
> _just because software does "math" doesn't change how people read it._

It absolutely does. Different problem domains (and different communities’
treatment of problems) involve differing types and amounts of formal
structure, differing conventional notations, etc., and in practice the code
looks substantially different (in organization, abstractions used, naming,
...) even if you try to standardize it to all look the same.

People who are reading “math” code can be expected to understand mathematical
notation, e.g. to be capable of reading a journal paper where an algorithm or
formula is described more completely including motivation, formal derivation,
proofs of correctness, proofs of various formal properties, ...

Mathematical code is often quite abstract; unlike company-specific business
logic, the same tool might be used in many different contexts with inputs of
different meanings. There really isn’t that much insight gained by replacing
_i_ with “index”, _x_ with “generic_input_variable”,

or to take it to an extreme, _ax_ \+ _b_ with add(multiply(input_variable,
proportionality_constant), offset_constant)

or sin(x) with
perpendicular_unit_vector_component_for_angle(input_angle_measure)

The extra space overhead of the long variables and words instead of symbols is
a killer for clarity.

If variable names are “cryptic” [as in, can’t be guessed at a glance by
someone working in the field] then that is indeed a failure though. Short
variable names should have limited scope (ideally fitting on one screenfull of
code) and obvious meaning in context, which might involve some explanatory
comments, links to a paper.

~~~
YeGoblynQueenne
>> People who are reading “math” code can be expected to understand
mathematical notation, e.g. to be capable of reading a journal paper where an
algorithm or formula is described more completely including motivation, formal
derivation, proofs of correctness, proofs of various formal properties, ...

The majority of machine learning papers are very well stocked in terms of
heavy mathematical-y notation, but are very, _very_ low on formal derivation,
proofs of correctness, proofs of anything like formal properties, or even
motivation ("wait, where did this vector come from?"). Most have no
theoretical results at all- only definitions.

So let's not overdo it. The OP is making a reasonable demand: write complex
code in a way that makes it easily readable _without_ being part of an elite
brotherood of adepts that know all the secret handshakes and shibboleths.

A great deal of complexity could be removed from machine learning papers by
notating algorithms _as algorithms_ rather than formulae. For example, you can
say exactly the same thing with two "for i to j" and two summations with top
and bottom indices. Sometimes the mathematical notation can be more compact-
but when your subscripts start having subscripted superscripts, it's time to
stop and think what you're trying to do.

Besides- the OP did talk about _code_ not _papers_. Code has to be maintained
by someone, usually someone _else_. Papers, not so much.

~~~
FridgeSeal
If you're working in a domain, is it really that much to ask to become
familiar with it? Especially if the domain has a large theoretical component.

When we teach people software engineering we teach them concepts like "give
your variables meaningful names". Now that we're in sub-domain of implementing
some mathematics in software, I'd argue that matching the variables and
functions to their source (more or less) _is_ exactly "giving your variables
meaningful names".

> A great deal of complexity could be removed from machine learning papers by
> notating algorithms as algorithms rather than formulae

And you would immediately lose the ability to quickly and easily recognise
similar patterns and abstractions that mathematical notation so fluently
allows.

~~~
YeGoblynQueenne
>> If you're working in a domain, is it really that much to ask to become
familiar with it? Especially if the domain has a large theoretical component.

 _If_ the domain has a large theoretical component. Here, we're talking about
statistical machine learning and neural networks in particular, where this is,
for the most part, not at all the case.

>> And you would immediately lose the ability to quickly and easily recognise
similar patterns and abstractions that mathematical notation so fluently
allows.

I disagree. An algorithm _is_ mathematical notation, complete with immediately
recognisalbe patterns and abstractions (for-loops, conditional blocks, etc).

And, btw, so is computer code: it is _formal_ notation that, contrary to
mathematical formulate that require some familiarity with the conventions of a
field to read and understand, has an objective interpretation- in the form of
a compiler for the language used.

So machine learning papers could very well notate their algorithm in Python,
even a high-level abstraction (without all the boilerplate) of the algorithm,
and that would actually make them much more accessible to a larger number of
people.

Mathematical notation, as in formulae, is not required- it's a tradition in
the field, but that's all.

However, that's a bit of a digression from the subject of the naming of
variables. Apologies. It's still relevant to the compreshensibility of formal
notation.

------
11thEarlOfMar
"We’ll use an optimization algorithm called stochastic gradient descent (SGD)
that tells us how to change our weights and biases to minimize loss. It’s
basically just this update equation:

w1 ← w1 − η ∂w|∂L

η is a constant called the learning rate that controls how fast we train. All
we’re doing is subtracting η ∂w|∂L from w1:

\- If ∂L | ∂w1 is positive, w1 will decrease, which makes L decrease

\- If ∂L | ∂w1 is positive, w1 will increase, which makes L decrease"

Simple indeed.

~~~
joshklein
I think this comment is facetious (apologies if I’m reading it incorrectly),
so I want to offer that this is “simple”, but opaque to someone who is
unfamiliar with the syntax or background context.

Andrew Ng’s course on machine learning from Stanford (on Coursera, which you
should be able to audit for free) gives great intuitive explanations of these
topics, if you’re curious enough to invest a couple solid weekends to get the
foundations. The foundations alone will get you very far indeed.

~~~
cr0sh
Seconding Ng's course here.

I took it in 2011 (ML Class) before Coursera existed, and it finally opened my
eyes on not only what and how backprop worked, but also how everything in a NN
could be represented and calculated using vectors and matrices (linear
algebra), and how that process was "parallelizable".

The course uses Octave as it's programming environment, which is essentially
an open-source and (mostly) compatible implementation of Matlab.

My first thought was "Finally! A use case for a home Beowulf cluster that is
somewhat practical!"

It really opened my eyes and mind to a number concepts that I had looked into
before, but couldn't quite wrap my brain around completely.

------
arendtio
Recently, I watched a Scrimba tutorial [1] about brain.js. That one brought me
quite a step further to understand how different kinds of neural networks are
capable of (NN vs. RNN vs. CNN).

[1]:
[https://scrimba.com/g/gneuralnetworks](https://scrimba.com/g/gneuralnetworks)

------
kowdermeister
I started playing around with NN-s recently and this looked like a good
tutorial until the partial derivatives :) It would have been so much better to
turn a complex math concept like that to code in smaller chunks because it's
hard to extract from the full sample which part refers to which equation.

Also, the bias is not explained at all.

\- Why is it there?

\- What does it do?

\- What value should it have?

~~~
dragandj
I currently publish a series of detailed articles that cover exactly that.

[https://dragan.rocks/articles/19/Deep-Learning-in-Clojure-
Fr...](https://dragan.rocks/articles/19/Deep-Learning-in-Clojure-From-Scratch-
to-GPU-0-Why-Bother)

~~~
victor106
dragon.rocks, rocks. Your articles are some of the best i have seen on this
topic. Thank you. I hope you write a book on this topic.

~~~
dragandj
Thanks! We'll see. BTW, its dragAn, not dragOn :)

------
inputcoffee
There are so many little details to remember when you implement a Neural
Network from "scratch". Or, I suppose, even if you do not.

You know what would be a great contribution? An extensive set of unit tests,
or even just problems with solutions. That way people can write their own
implementations and test them. And even if a person were to implement the net
in Pytorch of Tensorflow, they could test the work.

So there would be a matrix of weights, and a vector of input nodes, and the
"answer" would be the output vector. Then there would be another "answer"
which is the output with a particular activation function, and so on.

This library would just be there so people who are doing their own
implementation can test their work.

As I said, a unit test would work too but then it would have to be language
specific. Just the matrix and answer would be language agnostic.

For people who think: can't you just make up an example yourself using a sheet
of paper or in excel.

Yes, for most purposes this is fine, but if you forget one little
implementation detail of a three layer network with a ReLu, you really want an
external way to check that.

~~~
muttled
Built-in testing would be fantastic. Being able to tell if you designed a
model wrong or just made an error setting up the code would reduce frustration
a ton!

~~~
yorwba
If you're implementing your own derivatives, you should probably use numerical
methods to check that the gradient is computed correctly, and pytorch comes
with tools to help with that
[https://pytorch.org/docs/stable/autograd.html#numerical-
grad...](https://pytorch.org/docs/stable/autograd.html#numerical-gradient-
checking)

If you're reimplementing something for learning purposes, you can just compare
the output you get with the existing implementation.

But as soon as you're trying to do something novel, no automatic test can tell
you whether the model architecture or your implementation of it is at fault
when you don't get the results you'd like, because there's nothing you could
use for comparison.

------
eam
I didn't have time to read all of it, but half way through. So far it has bee
a pretty good write up, however, just one question.

Can the author or anyone explain how were the amount for values shifted
chosen? I.e. Weight (minus 135) and Height (minus 66). Why was -135 and -66
chosen? An explanation would be helpful. Thanks!

~~~
LodeOfCode
_" I arbitrarily chose the shift amounts (135 and 66) to make the numbers look
nice. Normally, you’d shift by the mean."_

~~~
eam
I am pretty sure that note wasn't written there before. The author must have
updated the article _after_ I posted this question. Though "nice" is still
somewhat vague to me. Nice in what way? Smaller numbers are nicer? Negative
and positives make it nice?

The second part of the answer is more useful information, _" Normally, you’d
shift by the mean."_

------
lowpro
This is a great write up! I'm in the process of writing a similar article,
although I think for many non-programmers, neural networks can be useful
through using algorithms that are already made to solve problems where people
have the data but don't understand the math.

This article is good if you've been around some math and have a little coding
experience, I think Neural Nets are even more accessible than the author is
making them out to be however.

For example, using something like YOLO for object detection can be done even
if you've never taken algebra or a programming class, and really just requires
some configuration and training on a dataset you have.

------
syntaxing
If you're interested in learning something like this, I highly recommend
Udacity's deep learning nanodegree. They have a section that teaches you how
to build your own neural network with the the help of numpy. They even have a
section where you write your own sentimental analysis neural network from
scratch. Easily the best explanation and notebook to follow I have seen so
far. I've been taking their revamped one that focuses on PyTorch rather than
Tensorflow.

Disclaimer: I am taking the deep learning nanodegree degree for free but its
sponsored by Facebook rather than Udacity directly.

~~~
applecrazy
The notebooks are available on GitHub for free.

------
rrggrr
Great, but the problem with these summaries is no real-world data/use-case. An
actual business case example with code is easier to comprehend for less math-
inclined folks.

~~~
cr0sh
The problem with that is these kinds of simplified "how it works" neural net
examples don't scale for real world problems - especially when using an
interpreted language like Python.

Even a simple but useful scenario like the MNIST number recognition system
would probably run "dog-slow" on such a neural network, given the number of
input nodes, plus the size of the hidden layer - the combinatorial "explosion"
of edges in the graph, all the processing needed during backprop and
forwardprop...

It might be doable with C/C++ - but it still won't run as well as it could.

These examples are really just meant as teaching tools; they are the bare-
bones-basics of neural networks, to get you to understand the underlying
mechanisms at work (actually, the XOR network is the real bare-bones NN
example, because it requires so few nodes, that it can be worked out by pencil
and paper methods).

Real-world implementations are best done using more advanced libraries and
tools, like Tensorflow and Keras, or similar (and moving to C/C++ and/or CUDA
if even more processing power is needed).

~~~
ldng
Maybe you missed the disclaimer ?

    
    
      "The code below is intended to be simple and educational, NOT optimal."

------
ddebernardy
One of the nice parts of NNs is that it takes some maths education to
understand what's going on. My initial thought when reading the title was: "uh
oh, are AI tutorials going to be the next PHP/MySQL tutorials, with god awful
code all over the internet?" They haven't yet insofar as I'm aware. I hope the
maths involved will prevent them from becoming that.

~~~
deytempo
Luckily math isn’t actually required to implement them in a series of arrays
so that little relief will be short lived.

------
cgopalan
I took the Pytorch scholarship challenge this past December with Udacity. I
couldn't make it to the nanodegree scholarship but the challenge course was
really good in exposing me to the various facets of deep learning. The
notebooks are free, but the videos really help explain some concepts well. For
people who took the course, the videos are available for another year, so that
helps.

I wanted to get a good grasp of DL but was floundering around before with so
many sources and tutorials - and frameworks like Keras, Pytorch, TensorFlow
etc. This course helped me make some decisions - like if I needed to get a
developer's handle on DL, I would have to know Perceptrons, CNNs, RNNs, Style
Transfer, Sentiment Prediction and GANs (even though GANs were not part of the
course). And Pytorch would be my tool of choice.

------
techbio
Very cool. I want to give a shout out to the simple classic
[http://arctrix.com/nas/python/bpnn.py](http://arctrix.com/nas/python/bpnn.py)

------
posix_compliant
Friendly comment: at the beginning of section 4 there's this sum:

    
    
        1
        ∑(y_true - y_pred)^2
        i=1
    

There's no 'i' in the sum, should probably be y_true_i - y_pred_i.

~~~
posix_compliant
I do get that in the example given there's only one element in each of the
true/predicted vectors, but it threw me off for a sec.

------
gillesjacobs
This is very similar to Andrew Ng's Deep Learning Specialization course [1].
If you found this blog post enjoyable be sure to check it out. It is a great
course and the intuitions behind NNs are explained very clearly.

1\. [https://www.coursera.org/specializations/deep-
learning](https://www.coursera.org/specializations/deep-learning)

------
starpilot
So many examples of this:
[https://hn.algolia.com/?query=python%20neural%20scratch&sort...](https://hn.algolia.com/?query=python%20neural%20scratch&sort=byPopularity&prefix&page=0&dateRange=all&type=story)

Is this example better than previous ones?

------
iamaziz
The visualization is cool, thanks!

Here's another neural network example from scratch in 13 lines of Python:

[https://iamtrask.github.io/2015/07/27/python-network-
part2/](https://iamtrask.github.io/2015/07/27/python-network-part2/)

------
ujuj
See this code posted on HN a month ago too:
[https://gist.github.com/macournoyer/620a8ba4a2ecd6d6feaf](https://gist.github.com/macournoyer/620a8ba4a2ecd6d6feaf)

I found the code very clear and understandable.

------
yreaderynotread
The derive of sigmoid is exp(-x) / (1+exp(-x))² and not exp(x) / (1+exp(-x))²

------
laneb
I appreciate the author showing some of the math that drives the optimization.
As a reader, I'd suggest introducing SGD ahead of the PD calculations, because
it would give a clearer motivation for the PDs.

------
buildbot
Since this is based on numpy, you could quickly have it GPU accelerated by
using cupy: [https://cupy.chainer.org/](https://cupy.chainer.org/)

------
fartcannon
It's fun, but it's cheating a bit if you use numpy. Can someone implement one
with all the matrix math from scratch too?

~~~
siekmanj
shameless self-promotion:
[https://github.com/siekmanj/sieknet](https://github.com/siekmanj/sieknet)

~~~
fartcannon
Looks great. Even got yourself a genetic algorithm in there, eh? How well does
it work?

~~~
siekmanj
The genetic algorithm currently isn't in a working state unfortunately. I did
a big architectural rewrite recently and haven't gotten around to GA stuff -
but I'm looking forward to applying some more modern results in neuroevolution
from the last two or three years.

------
avmich
> Real neural net code looks nothing like this.

This negates a lot of promise of the article. If one understands principles,
but needs to know how to apply them, this statement makes the article
practically useless.

Usually after reading like this goes a jump in explanations, which directs to
apply TensorFlow et. al. and even, if you're lucky, explains how to do that -
but doesn't explain what it is that TensorFlow does qualitatively different,
which would justify putting such a phrase as a warning.

~~~
sgillen
I still think it's a useful exercise for people like me who learn best by
implementing a thing.

Tensorflow, Pytorch, etc introduce a lot of magic for someone who is
completely unfamiliar with the area.

~~~
avmich
> I still think it's a useful exercise for people like me who learn best by
> implementing a thing.

Yes, but if you implement one thing, and then TensorFlow which you plan to use
does something very different, what's the point? Ok, different enough to
justify the warning.

> Tensorflow, Pytorch, etc introduce a lot of magic for someone who is
> completely unfamiliar with the area.

Exactly. It's that magic which is interesting, after you're shown how in
principle - but only in principle - the thing could work.

~~~
sgillen
I guess the point here is that in this article we implement some of the magic
that the higher level frameworks use, which I think will be useful later when
writing and debugging our high level code.

But I see your point which is that it would be nice if the article was telling
us which part of the high level library we're implementing and what that would
look like.

I'm still happy with the article though and would not call it almost useless.
But hey if you'd rather jump right into a tensorkerastorch tutorial that's an
option too!

~~~
avmich
> I'm still happy with the article though and would not call it almost
> useless.

I really like the article. My problem is the next step, which I look for,
rather unsuccessfully.

> But hey if you'd rather jump right into a tensorkerastorch tutorial that's
> an option too!

Do you know good tutorials for that? Preferably on the level of this article,
only regarding those libraries?

~~~
sgillen
Pytorch's official one is very good:
[https://pytorch.org/tutorials/beginner/pytorch_with_examples...](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html)

They even start with NumPy, as the article does.

------
jackallis
how come every information like this, intended for beginners, don't say why
you even need NN or when would you even use it.

~~~
Shorel
Why do you need to know linear algebra or calculus in programming?

A poignant answer is: you never really know! The future will surprise you!

Watching the movie about AlphaGo and all other stuff DeepMind has done will
give you all the answers you need about the why NN.

The question in the first line has an obvious answer and it is just
rhetorical. The bigger point is: this mindset doesn't apply just to Neural
Networks, many things in mathematics are useful years after they are first
invented, and it is better to learn general principles and have an open mind,
Richard Feynman style.

------
ecmascript
This is very cool and informative. Thank you for this article!

------
alexlchen2019
amazing read, good content

~~~
noperks
probably the most easy going read-through i've seen

------
theblackcat1002
I think a better and elegant way to implement is the pytorch way. By defining
a forward and backward function for easy chaining, you can easily chain
multiple layers into a single model. Placing the training loop inside the
layer will be hard to write a multi layer version.

Just wrote a similar from scratch code few months back:
[https://github.com/theblackcat102/DL_2018/blob/master/HW2/la...](https://github.com/theblackcat102/DL_2018/blob/master/HW2/layers.py)

