I see so many implementations of NeuralNets from Scratch on Repl.it (I'm a co-founder) but this is one of the simplest and clearest one.
One that blew my mind recently is one by a 12 year old programmer: https://repl.it/@pyelias/netlib
Back in the heyday of expert systems and original Mac’s, I did an expert system in Hypercard. Not really production worthy. But interesting from perspective of exploring UI.
All I can remember about it is that it was structured to ask the user one query per card, and there was some kind of global function that kept the state of the consultation and navigated to the next card.
The expert system that I implemented was a simple toy, because the point was to explore the UI aspects. For instance, you could easily embed images and ask "Click the image that most closely resembles your specimen." Stuff like that, intermixed with traditional text queries.
Frankly in the middle of a numerical algorithm it is typically also cumbersome in code to have descriptive variable names for everything.
However, mathematical code (especially when written by scientists, etc.) often takes this too far, introducing many 1- or few-letter variable names without enough context or description to figure out what they stand for, and for more of the variables than necessary.
This blog post was just fine though. h = hidden, o = output, y_pred = predicted value of y, etc. are quite clear
∀ abbr + to the cogntv load prsn wht mntn in thr hd.
∀ -> every
+ -> adds
wht -> who has to
h -> head
Every abbreviation adds to the cognitive load the person has to maintain in their head.
If you see that full-word variable names create too much clutter, it means your code structure is wrong. So you fix it.
You go from this:
self.w1 = np.random.normal()
self.w2 = np.random.normal()
self.w3 = np.random.normal()
self.w4 = np.random.normal()
self.w5 = np.random.normal()
self.w6 = np.random.normal()
I think your claim that abbreviations are always more cognitive load is wrong. For the sake of argument, let's take it at face value though. These are not even code abbreviations. They are literal translations to the math notation. The input "x" is not an abbreviation, it's a well defined value in neural network terminology.
If they called it something more descriptive like "input_vector_to_first_layer_of_neural_net", this would be more cognitive load because someone reviewing this now needs to mentally map this to "x" in the algorithm anytime they are reviewing both.
Now, to your claim itself, I think it's unfair to say abbreviations are always more cognitive load. These variables are significantly more approachable because they follow convention. I see "x" and I know what that is. If every individual developer went ahead and rewrote every neural net with something they viewed as more interpretable in their personal context, it'd be a lot harder to understand what is going on for everyone. The variable itself may be longer, but I might actually need a prose explanation of the variable's purpose because now I don't have the context of well established naming convention.
That would be the original approach historically (before the past 500 years), e.g. for the quadratic formula, Brahmagupta (628 CE):
> To the absolute number multiplied by four times the square, add the square of the middle term; the square root of the same, less the middle term, being divided by twice the square is the value.
Go ahead and do what you like, but I doubt you’ll find many publishers who will accept your paper in the 21st century.
I know which one imposes more cognitive load for me. But disclaimer: I spent a lot of time from age 5–20 working with mathematical notation.
It absolutely does. Different problem domains (and different communities’ treatment of problems) involve differing types and amounts of formal structure, differing conventional notations, etc., and in practice the code looks substantially different (in organization, abstractions used, naming, ...) even if you try to standardize it to all look the same.
People who are reading “math” code can be expected to understand mathematical notation, e.g. to be capable of reading a journal paper where an algorithm or formula is described more completely including motivation, formal derivation, proofs of correctness, proofs of various formal properties, ...
Mathematical code is often quite abstract; unlike company-specific business logic, the same tool might be used in many different contexts with inputs of different meanings. There really isn’t that much insight gained by replacing i with “index”, x with “generic_input_variable”,
or to take it to an extreme, ax + b with add(multiply(input_variable, proportionality_constant), offset_constant)
or sin(x) with perpendicular_unit_vector_component_for_angle(input_angle_measure)
The extra space overhead of the long variables and words instead of symbols is a killer for clarity.
If variable names are “cryptic” [as in, can’t be guessed at a glance by someone working in the field] then that is indeed a failure though. Short variable names should have limited scope (ideally fitting on one screenfull of code) and obvious meaning in context, which might involve some explanatory comments, links to a paper.
The majority of machine learning papers are very well stocked in terms of heavy mathematical-y notation, but are very, very low on formal derivation, proofs of correctness, proofs of anything like formal properties, or even motivation ("wait, where did this vector come from?"). Most have no theoretical results at all- only definitions.
So let's not overdo it. The OP is making a reasonable demand: write complex code in a way that makes it easily readable without being part of an elite brotherood of adepts that know all the secret handshakes and shibboleths.
A great deal of complexity could be removed from machine learning papers by notating algorithms as algorithms rather than formulae. For example, you can say exactly the same thing with two "for i to j" and two summations with top and bottom indices. Sometimes the mathematical notation can be more compact- but when your subscripts start having subscripted superscripts, it's time to stop and think what you're trying to do.
Besides- the OP did talk about code not papers. Code has to be maintained by someone, usually someone else. Papers, not so much.
When we teach people software engineering we teach them concepts like "give your variables meaningful names". Now that we're in sub-domain of implementing some mathematics in software, I'd argue that matching the variables and functions to their source (more or less) _is_ exactly "giving your variables meaningful names".
> A great deal of complexity could be removed from machine learning papers by notating algorithms as algorithms rather than formulae
And you would immediately lose the ability to quickly and easily recognise similar patterns and abstractions that mathematical notation so fluently allows.
If the domain has a large theoretical component. Here, we're talking about statistical machine learning and neural networks in particular, where this is, for the most part, not at all the case.
>> And you would immediately lose the ability to quickly and easily recognise similar patterns and abstractions that mathematical notation so fluently allows.
I disagree. An algorithm is mathematical notation, complete with immediately recognisalbe patterns and abstractions (for-loops, conditional blocks, etc).
And, btw, so is computer code: it is formal notation that, contrary to mathematical formulate that require some familiarity with the conventions of a field to read and understand, has an objective interpretation- in the form of a compiler for the language used.
So machine learning papers could very well notate their algorithm in Python, even a high-level abstraction (without all the boilerplate) of the algorithm, and that would actually make them much more accessible to a larger number of people.
Mathematical notation, as in formulae, is not required- it's a tradition in the field, but that's all.
However, that's a bit of a digression from the subject of the naming of variables. Apologies. It's still relevant to the compreshensibility of formal notation.
In code that starts something like "/* implements gambler et. al 2019 (doi:xxxxxxx) eqn 3 ... */". Then I really expect the code to go to great lengths to match the notation used in the paper. Anything else is adding to the cognitive load.
The exception is if the entire algorithm is discussed in the comments of the code without outside reference, then I want the code and comment to be extremely consistent.
Personally, I like the former as a shorthand for "don't mess with this without the paper in front of you, you'll probably screw it up".
If I'm implementing some complex equation I try to match the symbols as much as possible and keep the representation compact because it means when I (or my teammates) come back to the code they can see the whole complex thing in one go and easily recognise the equation it comes from.
In cases like this I think a "meaningful" variable name would actively hamper reading. For example:
It's also common to use variables I,J,K,N,M for numbers and P,Q,R for predicate symbols (that can be passed as arguments to predicates, in Prolog). If the code sticks to the same convention throughout, it gets much easier to read than having to come up with special names for each variable ("Index", "Counter", "Next_value", "Length", etc).
And, if I remember correctly, the same kind of convention is common in Haskell, where I understand you can actualy "dash" variables (as in x, x').
When I'm implementing algorithm, I'll typically be implementing something that uses single letter variable names. It's easier to do with consistent variable names. That's it. I once tried meaningful variable names in LaTeX to make code nicer, but long variable names rendered as math are an unreadable mess, so I went back to short variable names.
That being said, it's not hard to change variable names with a good IDE, and in LaTeX you can alias variable names with \newcommand.
This code is designed to be read by someone who is pretty familiar with the underlying concepts. If you don't really know how a neural network works or how gradient descent is applied to update the weights, that's your blocker, not the variable names I saw here.
How can you explain or implement gradient descent without math? At some point I think you have to accept that this is a topic that involves math, and you're way better off understanding it on those terms rather than trying to avoid it.
I think this is an extremely well written tutorial! Kudos to the author.
I haven't coded a NN since grad school and just scanning over the code I can tell this is basically the same as what I did. No doubt this is a boiler plate approach compiled into an article.
w1 ← w1 − η ∂w|∂L
η is a constant called the learning rate that controls how fast we train. All we’re doing is subtracting η ∂w|∂L from w1:
- If ∂L | ∂w1 is positive, w1 will decrease, which makes L decrease
- If ∂L | ∂w1 is positive, w1 will increase, which makes L decrease"
Andrew Ng’s course on machine learning from Stanford (on Coursera, which you should be able to audit for free) gives great intuitive explanations of these topics, if you’re curious enough to invest a couple solid weekends to get the foundations. The foundations alone will get you very far indeed.
I took it in 2011 (ML Class) before Coursera existed, and it finally opened my eyes on not only what and how backprop worked, but also how everything in a NN could be represented and calculated using vectors and matrices (linear algebra), and how that process was "parallelizable".
The course uses Octave as it's programming environment, which is essentially an open-source and (mostly) compatible implementation of Matlab.
My first thought was "Finally! A use case for a home Beowulf cluster that is somewhat practical!"
It really opened my eyes and mind to a number concepts that I had looked into before, but couldn't quite wrap my brain around completely.
In case you were being sarcastic, it is simple if you don't take it out of context. It requires a certain understanding of derivatives, but a reader who doesn't understand them won't make it to the point you're quoting.
Also, the bias is not explained at all.
- Why is it there?
- What does it do?
- What value should it have?
Bias in this context can be simply interpreted as analogous to the intercept in y = mx + b. Kind of like the baseline when no additional information is given.
You know what would be a great contribution? An extensive set of unit tests, or even just problems with solutions. That way people can write their own implementations and test them. And even if a person were to implement the net in Pytorch of Tensorflow, they could test the work.
So there would be a matrix of weights, and a vector of input nodes, and the "answer" would be the output vector. Then there would be another "answer" which is the output with a particular activation function, and so on.
This library would just be there so people who are doing their own implementation can test their work.
As I said, a unit test would work too but then it would have to be language specific. Just the matrix and answer would be language agnostic.
For people who think: can't you just make up an example yourself using a sheet of paper or in excel.
Yes, for most purposes this is fine, but if you forget one little implementation detail of a three layer network with a ReLu, you really want an external way to check that.
1. Most BLAS libraries aren't deterministic.
Floating point numbers aren't in general associative, (a + b) + c =/= a + (b + c). However most BLAS libraries will happily shuffle the order of operations around if it saves some CPU cycles, I've seen this issue cause huge differences in behavior between different CPUs before, although I think that was a pathological case. So if you want to do a test like this you really need to have everyone on the same page an installing a deterministic BLAS.
2. Training neural networks is stochastic
I know you are talking about just the forward pass but I think it would be interesting to have these tests for training your network as well. Naively this sounds like it might be easy to fix (just provide a seed!), but gets very tricky when you want something to output the same code over multiple libraries, operating systems, hardware etc. Multithreading code, putting stuff on the GPU or even the cloud add even more difficulty.
If you're reimplementing something for learning purposes, you can just compare the output you get with the existing implementation.
But as soon as you're trying to do something novel, no automatic test can tell you whether the model architecture or your implementation of it is at fault when you don't get the results you'd like, because there's nothing you could use for comparison.
Can the author or anyone explain how were the amount for values shifted chosen? I.e. Weight (minus 135) and Height (minus 66). Why was -135 and -66 chosen? An explanation would be helpful. Thanks!
The second part of the answer is more useful information, "Normally, you’d shift by the mean."
This article is good if you've been around some math and have a little coding experience, I think Neural Nets are even more accessible than the author is making them out to be however.
For example, using something like YOLO for object detection can be done even if you've never taken algebra or a programming class, and really just requires some configuration and training on a dataset you have.
Disclaimer: I am taking the deep learning nanodegree degree for free but its sponsored by Facebook rather than Udacity directly.
Even a simple but useful scenario like the MNIST number recognition system would probably run "dog-slow" on such a neural network, given the number of input nodes, plus the size of the hidden layer - the combinatorial "explosion" of edges in the graph, all the processing needed during backprop and forwardprop...
It might be doable with C/C++ - but it still won't run as well as it could.
These examples are really just meant as teaching tools; they are the bare-bones-basics of neural networks, to get you to understand the underlying mechanisms at work (actually, the XOR network is the real bare-bones NN example, because it requires so few nodes, that it can be worked out by pencil and paper methods).
Real-world implementations are best done using more advanced libraries and tools, like Tensorflow and Keras, or similar (and moving to C/C++ and/or CUDA if even more processing power is needed).
"The code below is intended to be simple and educational, NOT optimal."
You can, however, try this service for free during the first two months:
(I am not affiliated with AWS and I have not used SageMaker)
You can also read the DeepMind blog to find out the latest state-of-the-art real-world usage of Neural Networks.
I wanted to get a good grasp of DL but was floundering around before with so many sources and tutorials - and frameworks like Keras, Pytorch, TensorFlow etc. This course helped me make some decisions - like if I needed to get a developer's handle on DL, I would have to know Perceptrons, CNNs, RNNs, Style Transfer, Sentiment Prediction and GANs (even though GANs were not part of the course). And Pytorch would be my tool of choice.
∑(y_true - y_pred)^2
Is this example better than previous ones?
Here's another neural network example from scratch in 13 lines of Python:
I found the code very clear and understandable.
This negates a lot of promise of the article. If one understands principles, but needs to know how to apply them, this statement makes the article practically useless.
Usually after reading like this goes a jump in explanations, which directs to apply TensorFlow et. al. and even, if you're lucky, explains how to do that - but doesn't explain what it is that TensorFlow does qualitatively different, which would justify putting such a phrase as a warning.
Tensorflow, Pytorch, etc introduce a lot of magic for someone who is completely unfamiliar with the area.
Yes, but if you implement one thing, and then TensorFlow which you plan to use does something very different, what's the point? Ok, different enough to justify the warning.
> Tensorflow, Pytorch, etc introduce a lot of magic for someone who is completely unfamiliar with the area.
Exactly. It's that magic which is interesting, after you're shown how in principle - but only in principle - the thing could work.
But I see your point which is that it would be nice if the article was telling us which part of the high level library we're implementing and what that would look like.
I'm still happy with the article though and would not call it almost useless. But hey if you'd rather jump right into a tensorkerastorch tutorial that's an option too!
I really like the article. My problem is the next step, which I look for, rather unsuccessfully.
> But hey if you'd rather jump right into a tensorkerastorch tutorial that's an option too!
Do you know good tutorials for that? Preferably on the level of this article, only regarding those libraries?
They even start with NumPy, as the article does.
This is the same course that was originally called "ML Class" in 2011 when it was offered via Stanford (it, plus the other course called "AI Class" - helped to launch Coursera and Udacity, respectively). It uses Octave as it's development language to teach the concepts.
Octave has as it's "primitives" the concept of the Vector and Matrix; that is, it is a language designed around the concepts of linear algebra. So, you can easily set a variable to the values that make up a vector, set another to say the identity matrix, multiply them together, and it will return a matrix.
But the course is initially taught showing how to do all of this "by hand" (much like this tutorial) - using for-next loops and such; essentially rolling your own matrix calcs. Then, once you know that, you are introduced into how to use the built-in calcs provided by octave, etc.
It is really a great course, and I encourage you to try it. It opened my eyes and mind to a lot of concepts I had been trying to grasp for a while that just didn't click until I took that course in 2011.
Of course, once it did, I was like "that's it? why couldn't I see that?" - such is the nature of learning I guess!
His method of building up from the basics all the way to a neural network using Octave will be a good foundation for further learning or playing with other tools like TensorFlow; I ended up taking a couple of other later MOOCs offered by Udacity (focused mainly on self-driving vehicle concepts), and ML Class (and what I took of AI Class) really helped to prepare me.
It also showed me things I was deficient in (something I need to correct some day). But that's a different story.
Suffice to say, that course will "unveil the magic"; it was really an amazing course that I am glad I took. Oh - and to show you what you can do with what you learn in that course, a fellow student during that time ended up making this (and the course was not even finished when he did it):
Throughout the course, the ALVINN autonomous vehicle was referenced quite often, and it inspired him to recreate it successfully in miniature. It also serves to show just how far computer technology has come - what used to take a van-load of equipment can now be played around with using a device that fits in your pocket!
However even understanding all that doesn't allow one using modern frameworks by itself, they seem to carry too much additional assumptions, terminology which aren't quite explained with similar level of patience.
Real systems like TensorFlow (and Keras, which is a wrapper that simplifies TensorFlow - which is already pretty simple; if you come to love TensorFlow, you will really love Keras - but you can't really appreciate TensorFlow until you understand the basics of what it does - which the posted tutorial tries to show) hide all of these steps behind a variety of API that tries to mimic the conceptual ideas behind neural networks (ie - "layers" of "neurons" with "connections") while hiding the lower-level "complex mathematics".
That's a good thing on one level, because it opens up ANNs for more users. But it's a bad thing at another level, for those who want to know what is going on in those "black boxes", as well as for those who sometimes will stumble into certain issues with their networks, and struggle to understand certain things if they don't understand those underlying systems.
No one implements neural network code the way it is shown in this tutorial for much the same reason most people don't code in assembler: Because there are higher level means to work with the same operations and do it quicker, more efficiently, and more correctly (also, the lower level operations used in such libraries have been rigorously tested to be correct - it's similar to the reasoning on how you don't roll your own crypto, because you will likely get it wrong - even if you are an expert in the domain).
In that case, the hard part is already done.
Understanding the principles is what takes the most work, not writing or adapting the code.
In fact, understanding the principles is IMO a prerequisite for being able to successfully adapt the code.
Now, using something like TensorFlow in another later article is of course super useful. It also will clarify why TensorFlow is more confusing and why this article was written.
I find the article enlightening, and specific libraries like TensorFlow are just "implementation details".
A poignant answer is: you never really know! The future will surprise you!
Watching the movie about AlphaGo and all other stuff DeepMind has done will give you all the answers you need about the why NN.
The question in the first line has an obvious answer and it is just rhetorical. The bigger point is: this mindset doesn't apply just to Neural Networks, many things in mathematics are useful years after they are first invented, and it is better to learn general principles and have an open mind, Richard Feynman style.
First off, for most problems you don't need a neural network - many, many problems can be solved using simple statistical regression techniques - what could be called "classical machine learning".
It's when your data starts to involve more than a few variables (usually more than 2 or 3) that you need an answer, and/or when you don't know exactly what the best "classical" machine learning technique to use, is when you may find a neural network to be a potential solution.
Ultimately, what a regular neural network does (I'm not going to get into anything like GANs or CNNs or anything special like that - though arguably they all work in a similar manner) is come up with a solution that encapsulates and encodes, through "learning", the proper algorithm(s) needed to provide a solution for the problem - whether that is classification or something else.
It's basically the classic argument that all a neural network is, is a complicated form of linear regression. And to an extent, that is true - but it's done in a "black box" manner, which may or may not be important to you for your problem (the issue becomes "how does it do what it does?" - and that's where things can become tricky to answer).
Neural networks basically take information - various attributes of a problem set - and then, once trained, can output an answer for an example (that it hasn't seen before!) that fits that problem set criteria. It does this by having been "trained" by seeing a lot of different examples, with each example matched up to a "labeled" output. Depending on whether it is right or wrong for the labeled output, it will then propagate the error backwards through the network to correct things by a small amount, and try again (note: your training data needs to be extremely varied, and should include both positive examples and negative examples, and shouldn't be biased toward any particular set of examples).
Anyhow - your data must ultimately be represented by numbers in some manner, usually continuous, and can be anything from a single input to multiple inputs. For instance, you might have the single input of "temperature" to control the output of "turn on/off the furnace" - ie, modeling a thermostat (note that you would never do this actually, outside of learning how a simple perceptron works I guess).
Usually you don't use a single continuous input - you would use multiple inputs - maybe a date, coupled with gps coordinates (like say for housing prices), with an output of "price". Or maybe 10,000 grayscale values (0.00-1.00), representing a 100x100 pixel image, with a classification output of the numbers 0-9 and letters A-Z (36 symbols).
In other words, you are trying to figure out the answer to a problem (is this a picture of a letter or number?) that would be difficult or impossible to code a set of rules by hand to answer with any statistical certainty.
Note that last part: Neural networks do not give you an absolute result; they output a continuous value or a set of values that represent the most probable likelihood of a correct answer based upon what the model has learned in the past. In the case of a single output, that could represent a "yes/no" or "true/false" answer (a value between 1 and 0 respectively); it could also be "left/right", "up/down" or something similar if controlling direction/pressure/flow rate, etc based on inputs.
For classification, it will be a set of continuous values, indicating for each "class" which is the higher probability (0-1) of being correct. For instance, if you had three categories like:
Note that it is also possible to fool such a network, as numerous studies have shown.
It's also possible to have the network output an image (ie - by having it output a very large array of node values that represents the pixel values of the image); this is how neural networks generate or alter images. Sound data can also be done in a similar manner.
Alright - I think I am rambling a bit, so I'm going to leave it here; I hope this answers at least something of your questions (and I hope it didn't confuse you - if so, I apologize and that wasn't my intent).
Just wrote a similar from scratch code few months back: https://github.com/theblackcat102/DL_2018/blob/master/HW2/la...