Back in the heyday of expert systems and original Mac’s, I did an expert system in Hypercard. Not really production worthy. But interesting from perspective of exploring UI.
Still got it somewhere? I've got a project to demonstrate a bunch of Hypercard stacks in a retro-computing context, and it'd be definitely interesting to see, in the case that you'd consider finding it and sharing it ...
Oh gosh.... that was so many years ago. I suspect I could never find the floppy, assuming I haven't chucked it.
All I can remember about it is that it was structured to ask the user one query per card, and there was some kind of global function that kept the state of the consultation and navigated to the next card.
The expert system that I implemented was a simple toy, because the point was to explore the UI aspects. For instance, you could easily embed images and ask "Click the image that most closely resembles your specimen." Stuff like that, intermixed with traditional text queries.
Well, it was worth the try .. sounds interesting though! There's a resurgent interest in Hypercard these days .. if you ever get the druthers to go look for your old stack, I'm sure it'll be of interest to some of us. :)
These are meaningful variable names in the context of a code implementation of a math algorithm. The variables directly map to the algorithmic notation.
And when trying to do algebraic manipulation on paper, having an 8 or 10 letter variable name is incredibly cumbersome.
Frankly in the middle of a numerical algorithm it is typically also cumbersome in code to have descriptive variable names for everything.
However, mathematical code (especially when written by scientists, etc.) often takes this too far, introducing many 1- or few-letter variable names without enough context or description to figure out what they stand for, and for more of the variables than necessary.
This blog post was just fine though. h = hidden, o = output, y_pred = predicted value of y, etc. are quite clear
After 9 years of maintaining several dozen legacy codebases (which sometimes involved financial math) I am fully convinced that the only people who like terse code are people who write it and never read it after.
> Every abbreviation adds to the cognitive load the person has to maintain in their head.
I think your claim that abbreviations are always more cognitive load is wrong. For the sake of argument, let's take it at face value though. These are not even code abbreviations. They are literal translations to the math notation. The input "x" is not an abbreviation, it's a well defined value in neural network terminology.
If they called it something more descriptive like "input_vector_to_first_layer_of_neural_net", this would be more cognitive load because someone reviewing this now needs to mentally map this to "x" in the algorithm anytime they are reviewing both.
Now, to your claim itself, I think it's unfair to say abbreviations are always more cognitive load. These variables are significantly more approachable because they follow convention. I see "x" and I know what that is. If every individual developer went ahead and rewrote every neural net with something they viewed as more interpretable in their personal context, it'd be a lot harder to understand what is going on for everyone. The variable itself may be longer, but I might actually need a prose explanation of the variable's purpose because now I don't have the context of well established naming convention.
Are you suggesting that we should write all of our mathematical expressions using prose, and stop using symbols for operators?
That would be the original approach historically (before the past 500 years), e.g. for the quadratic formula, Brahmagupta (628 CE):
> To the absolute number multiplied by four times the square, add the square of the middle term; the square root of the same, less the middle term, being divided by twice the square is the value.
Go ahead and do what you like, but I doubt you’ll find many publishers who will accept your paper in the 21st century.
I know which one imposes more cognitive load for me. But disclaimer: I spent a lot of time from age 5–20 working with mathematical notation.
I am suggesting that just because software does "math" doesn't change how people read it. Bad coding practices like cryptic variable names and repetitions will affect it exactly in the same ways they affect software in all other domains.
> just because software does "math" doesn't change how people read it.
It absolutely does. Different problem domains (and different communities’ treatment of problems) involve differing types and amounts of formal structure, differing conventional notations, etc., and in practice the code looks substantially different (in organization, abstractions used, naming, ...) even if you try to standardize it to all look the same.
People who are reading “math” code can be expected to understand mathematical notation, e.g. to be capable of reading a journal paper where an algorithm or formula is described more completely including motivation, formal derivation, proofs of correctness, proofs of various formal properties, ...
Mathematical code is often quite abstract; unlike company-specific business logic, the same tool might be used in many different contexts with inputs of different meanings. There really isn’t that much insight gained by replacing i with “index”, x with “generic_input_variable”,
or to take it to an extreme, ax + b with add(multiply(input_variable, proportionality_constant), offset_constant)
or sin(x) with perpendicular_unit_vector_component_for_angle(input_angle_measure)
The extra space overhead of the long variables and words instead of symbols is a killer for clarity.
If variable names are “cryptic” [as in, can’t be guessed at a glance by someone working in the field] then that is indeed a failure though. Short variable names should have limited scope (ideally fitting on one screenfull of code) and obvious meaning in context, which might involve some explanatory comments, links to a paper.
>> People who are reading “math” code can be expected to understand mathematical notation, e.g. to be capable of reading a journal paper where an algorithm or formula is described more completely including motivation, formal derivation, proofs of correctness, proofs of various formal properties, ...
The majority of machine learning papers are very well stocked in terms of heavy mathematical-y notation, but are very, very low on formal derivation, proofs of correctness, proofs of anything like formal properties, or even motivation ("wait, where did this vector come from?"). Most have no theoretical results at all- only definitions.
So let's not overdo it. The OP is making a reasonable demand: write complex code in a way that makes it easily readable without being part of an elite brotherood of adepts that know all the secret handshakes and shibboleths.
A great deal of complexity could be removed from machine learning papers by notating algorithms as algorithms rather than formulae. For example, you can say exactly the same thing with two "for i to j" and two summations with top and bottom indices. Sometimes the mathematical notation can be more compact- but when your subscripts start having subscripted superscripts, it's time to stop and think what you're trying to do.
Besides- the OP did talk about code not papers. Code has to be maintained by someone, usually someone else. Papers, not so much.
If you're working in a domain, is it really that much to ask to become familiar with it? Especially if the domain has a large theoretical component.
When we teach people software engineering we teach them concepts like "give your variables meaningful names". Now that we're in sub-domain of implementing some mathematics in software, I'd argue that matching the variables and functions to their source (more or less) _is_ exactly "giving your variables meaningful names".
> A great deal of complexity could be removed from machine learning papers by notating algorithms as algorithms rather than formulae
And you would immediately lose the ability to quickly and easily recognise similar patterns and abstractions that mathematical notation so fluently allows.
>> If you're working in a domain, is it really that much to ask to become familiar with it? Especially if the domain has a large theoretical component.
If the domain has a large theoretical component. Here, we're talking about statistical machine learning and neural networks in particular, where this is, for the most part, not at all the case.
>> And you would immediately lose the ability to quickly and easily recognise similar patterns and abstractions that mathematical notation so fluently allows.
I disagree. An algorithm is mathematical notation, complete with immediately recognisalbe patterns and abstractions (for-loops, conditional blocks, etc).
And, btw, so is computer code: it is formal notation that, contrary to mathematical formulate that require some familiarity with the conventions of a field to read and understand, has an objective interpretation- in the form of a compiler for the language used.
So machine learning papers could very well notate their algorithm in Python, even a high-level abstraction (without all the boilerplate) of the algorithm, and that would actually make them much more accessible to a larger number of people.
Mathematical notation, as in formulae, is not required- it's a tradition in the field, but that's all.
However, that's a bit of a digression from the subject of the naming of variables. Apologies. It's still relevant to the compreshensibility of formal notation.
In code that starts something like "/* implements gambler et. al 2019 (doi:xxxxxxx) eqn 3 ... */". Then I really expect the code to go to great lengths to match the notation used in the paper. Anything else is adding to the cognitive load.
The exception is if the entire algorithm is discussed in the comments of the code without outside reference, then I want the code and comment to be extremely consistent.
Personally, I like the former as a shorthand for "don't mess with this without the paper in front of you, you'll probably screw it up".
In the context of the equation, having long/non-matching variable names ads to the overhead. I find it orders of magnitude easier to read some Julia code examples that use matching symbols than I do the equivalent thing in python with full variable names.
If I'm implementing some complex equation I try to match the symbols as much as possible and keep the representation compact because it means when I (or my teammates) come back to the code they can see the whole complex thing in one go and easily recognise the equation it comes from.
Personally, I stick to using a capital letter followed by an "s" for a list and the same capital letter alone for an element of a list, as above, though that is by no means a universal convention.
It's also common to use variables I,J,K,N,M for numbers and P,Q,R for predicate symbols (that can be passed as arguments to predicates, in Prolog). If the code sticks to the same convention throughout, it gets much easier to read than having to come up with special names for each variable ("Index", "Counter", "Next_value", "Length", etc).
And, if I remember correctly, the same kind of convention is common in Haskell, where I understand you can actualy "dash" variables (as in x, x').
"set_to_random_numbers" is much less descriptive than np.random.normal...the former doesn't tell me what distribution the random numbers come from. Agreed that the weights should be stored in a list or the like in real life rather than duplicating code for each layer, but the examples in the blog post are clearly intentionally minimizing abstractions as much as possible so that an untrained reader can immediately tell exactly what each line is doing. It's obviously not intended to scale to larger models.
I've been told that not knowing better is a typical reason (and another reply says so), but my anecdotal contribution is different.
When I'm implementing algorithm, I'll typically be implementing something that uses single letter variable names. It's easier to do with consistent variable names. That's it. I once tried meaningful variable names in LaTeX to make code nicer, but long variable names rendered as math are an unreadable mess, so I went back to short variable names.
That being said, it's not hard to change variable names with a good IDE, and in LaTeX you can alias variable names with \newcommand.
For what it's worth, I didn't read the article at all, went straight to the final code, and had no issue reading or understanding it. The variable names seem fine to me. They're actually quite good, and I liked the convention for naming partial derivatives.
This code is designed to be read by someone who is pretty familiar with the underlying concepts. If you don't really know how a neural network works or how gradient descent is applied to update the weights, that's your blocker, not the variable names I saw here.
Well then they should get rid of the misleading title “from scratch” it makes it sound like they are going to explain it without beltching out a bunch of abstract math. I’m really annoyed that everyone pretends that you need math to explain a NN. I have done plenty of them without numpy or math using just arrays and loops.
This is not abstract math, and the article does explain what it's doing before presenting the code snippets.
How can you explain or implement gradient descent without math? At some point I think you have to accept that this is a topic that involves math, and you're way better off understanding it on those terms rather than trying to avoid it.
They're usually working from text/notes in mathematical notation and are trying to match the math notation as best they can. I'm sure programming not being their profession may be part of it, but I find myself doing the same thing when implementing algorithms described in mathematical notation. Due to the shear number of variables in a lot of algorithms, it's difficult to give each one a good name. And even if you did, it's unlikely the code would make any more sense, in order to grasp what's going on, you'd likely need some text on the subject to explain how it works.
I've found that lots of people from math backgrounds are comfortable with non-descriptive variable names. Most variables in math are Greek letters, or "i, j, k" and things like that.
They're typically not coders as a background. They're usually a hard science/math person so they didn't get variable indoctrination.
I haven't coded a NN since grad school and just scanning over the code I can tell this is basically the same as what I did. No doubt this is a boiler plate approach compiled into an article.
"We’ll use an optimization algorithm called stochastic gradient descent (SGD) that tells us how to change our weights and biases to minimize loss. It’s basically just this update equation:
w1 ← w1 − η ∂w|∂L
η is a constant called the learning rate that controls how fast we train. All we’re doing is subtracting η ∂w|∂L from w1:
- If ∂L | ∂w1 is positive, w1 will decrease, which makes L decrease
- If ∂L | ∂w1 is positive, w1 will increase, which makes L decrease"
I think this comment is facetious (apologies if I’m reading it incorrectly), so I want to offer that this is “simple”, but opaque to someone who is unfamiliar with the syntax or background context.
Andrew Ng’s course on machine learning from Stanford (on Coursera, which you should be able to audit for free) gives great intuitive explanations of these topics, if you’re curious enough to invest a couple solid weekends to get the foundations. The foundations alone will get you very far indeed.
I took it in 2011 (ML Class) before Coursera existed, and it finally opened my eyes on not only what and how backprop worked, but also how everything in a NN could be represented and calculated using vectors and matrices (linear algebra), and how that process was "parallelizable".
The course uses Octave as it's programming environment, which is essentially an open-source and (mostly) compatible implementation of Matlab.
My first thought was "Finally! A use case for a home Beowulf cluster that is somewhat practical!"
It really opened my eyes and mind to a number concepts that I had looked into before, but couldn't quite wrap my brain around completely.
There's are a couple of misquotes there. For starters, it's not "w1 ← w1 − η ∂w|∂L" but "w1 ← w1 − η ∂L|∂w1". Also, the second "if" should say "negative", not "positive" again.
In case you were being sarcastic, it is simple if you don't take it out of context. It requires a certain understanding of derivatives, but a reader who doesn't understand them won't make it to the point you're quoting.
Recently, I watched a Scrimba tutorial [1] about
brain.js. That one brought me quite a step further to understand how different kinds of neural networks are capable of (NN vs. RNN vs. CNN).
I started playing around with NN-s recently and this looked like a good tutorial until the partial derivatives :) It would have been so much better to turn a complex math concept like that to code in smaller chunks because it's hard to extract from the full sample which part refers to which equation.
Agreed, it is a little confusing. What makes it even more confusing is bias can be introduced as a regularization technique (to address overfitting) and these are not the same thing.
Bias in this context can be simply interpreted as analogous to the intercept in y = mx + b. Kind of like the baseline when no additional information is given.
There are so many little details to remember when you implement a Neural Network from "scratch". Or, I suppose, even if you do not.
You know what would be a great contribution? An extensive set of unit tests, or even just problems with solutions. That way people can write their own implementations and test them. And even if a person were to implement the net in Pytorch of Tensorflow, they could test the work.
So there would be a matrix of weights, and a vector of input nodes, and the "answer" would be the output vector. Then there would be another "answer" which is the output with a particular activation function, and so on.
This library would just be there so people who are doing their own implementation can test their work.
As I said, a unit test would work too but then it would have to be language specific. Just the matrix and answer would be language agnostic.
For people who think: can't you just make up an example yourself using a sheet of paper or in excel.
Yes, for most purposes this is fine, but if you forget one little implementation detail of a three layer network with a ReLu, you really want an external way to check that.
That would be a great contribution but unfortunately I think not an easy one to provide. A few issues
1. Most BLAS libraries aren't deterministic.
Floating point numbers aren't in general associative, (a + b) + c =/= a + (b + c). However most BLAS libraries will happily shuffle the order of operations around if it saves some CPU cycles, I've seen this issue cause huge differences in behavior between different CPUs before, although I think that was a pathological case. So if you want to do a test like this you really need to have everyone on the same page an installing a deterministic BLAS.
2. Training neural networks is stochastic
I know you are talking about just the forward pass but I think it would be interesting to have these tests for training your network as well. Naively this sounds like it might be easy to fix (just provide a seed!), but gets very tricky when you want something to output the same code over multiple libraries, operating systems, hardware etc. Multithreading code, putting stuff on the GPU or even the cloud add even more difficulty.
Built-in testing would be fantastic. Being able to tell if you designed a model wrong or just made an error setting up the code would reduce frustration a ton!
If you're reimplementing something for learning purposes, you can just compare the output you get with the existing implementation.
But as soon as you're trying to do something novel, no automatic test can tell you whether the model architecture or your implementation of it is at fault when you don't get the results you'd like, because there's nothing you could use for comparison.
I didn't have time to read all of it, but half way through. So far it has bee a pretty good write up, however, just one question.
Can the author or anyone explain how were the amount for values shifted chosen? I.e. Weight (minus 135) and Height (minus 66). Why was -135 and -66 chosen? An explanation would be helpful. Thanks!
I am pretty sure that note wasn't written there before. The author must have updated the article after I posted this question. Though "nice" is still somewhat vague to me. Nice in what way? Smaller numbers are nicer? Negative and positives make it nice?
The second part of the answer is more useful information, "Normally, you’d shift by the mean."
I think this a form of normalizing a normal distribution into a standard normal distribution
( a normal distribution has two parameters \mu and \sigma which in standard form gives respectively 0 and 1.
Any normal distribution is a special case of the standard normal distribution N(0,1)
I haven't read the article thoroughly but I suspect part of the data preprocessing step included centering and standardizing. Those must be the mean weight and height values.
This is a great write up! I'm in the process of writing a similar article, although I think for many non-programmers, neural networks can be useful through using algorithms that are already made to solve problems where people have the data but don't understand the math.
This article is good if you've been around some math and have a little coding experience, I think Neural Nets are even more accessible than the author is making them out to be however.
For example, using something like YOLO for object detection can be done even if you've never taken algebra or a programming class, and really just requires some configuration and training on a dataset you have.
If you're interested in learning something like this, I highly recommend Udacity's deep learning nanodegree. They have a section that teaches you how to build your own neural network with the the help of numpy. They even have a section where you write your own sentimental analysis neural network from scratch. Easily the best explanation and notebook to follow I have seen so far. I've been taking their revamped one that focuses on PyTorch rather than Tensorflow.
Disclaimer: I am taking the deep learning nanodegree degree for free but its sponsored by Facebook rather than Udacity directly.
Great, but the problem with these summaries is no real-world data/use-case. An actual business case example with code is easier to comprehend for less math-inclined folks.
The problem with that is these kinds of simplified "how it works" neural net examples don't scale for real world problems - especially when using an interpreted language like Python.
Even a simple but useful scenario like the MNIST number recognition system would probably run "dog-slow" on such a neural network, given the number of input nodes, plus the size of the hidden layer - the combinatorial "explosion" of edges in the graph, all the processing needed during backprop and forwardprop...
It might be doable with C/C++ - but it still won't run as well as it could.
These examples are really just meant as teaching tools; they are the bare-bones-basics of neural networks, to get you to understand the underlying mechanisms at work (actually, the XOR network is the real bare-bones NN example, because it requires so few nodes, that it can be worked out by pencil and paper methods).
Real-world implementations are best done using more advanced libraries and tools, like Tensorflow and Keras, or similar (and moving to C/C++ and/or CUDA if even more processing power is needed).
One of the nice parts of NNs is that it takes some maths education to understand what's going on. My initial thought when reading the title was: "uh oh, are AI tutorials going to be the next PHP/MySQL tutorials, with god awful code all over the internet?" They haven't yet insofar as I'm aware. I hope the maths involved will prevent them from becoming that.
I took the Pytorch scholarship challenge this past December with Udacity. I couldn't make it to the nanodegree scholarship but the challenge course was really good in exposing me to the various facets of deep learning. The notebooks are free, but the videos really help explain some concepts well. For people who took the course, the videos are available for another year, so that helps.
I wanted to get a good grasp of DL but was floundering around before with so many sources and tutorials - and frameworks like Keras, Pytorch, TensorFlow etc. This course helped me make some decisions - like if I needed to get a developer's handle on DL, I would have to know Perceptrons, CNNs, RNNs, Style Transfer, Sentiment Prediction and GANs (even though GANs were not part of the course). And Pytorch would be my tool of choice.
This is very similar to Andrew Ng's Deep Learning Specialization course [1]. If you found this blog post enjoyable be sure to check it out. It is a great course and the intuitions behind NNs are explained very clearly.
I appreciate the author showing some of the math that drives the optimization. As a reader, I'd suggest introducing SGD ahead of the PD calculations, because it would give a clearer motivation for the PDs.
The genetic algorithm currently isn't in a working state unfortunately. I did a big architectural rewrite recently and haven't gotten around to GA stuff - but I'm looking forward to applying some more modern results in neuroevolution from the last two or three years.
This negates a lot of promise of the article. If one understands principles, but needs to know how to apply them, this statement makes the article practically useless.
Usually after reading like this goes a jump in explanations, which directs to apply TensorFlow et. al. and even, if you're lucky, explains how to do that - but doesn't explain what it is that TensorFlow does qualitatively different, which would justify putting such a phrase as a warning.
> I still think it's a useful exercise for people like me who learn best by implementing a thing.
Yes, but if you implement one thing, and then TensorFlow which you plan to use does something very different, what's the point? Ok, different enough to justify the warning.
> Tensorflow, Pytorch, etc introduce a lot of magic for someone who is completely unfamiliar with the area.
Exactly. It's that magic which is interesting, after you're shown how in principle - but only in principle - the thing could work.
I guess the point here is that in this article we implement some of the magic that the higher level frameworks use, which I think will be useful later when writing and debugging our high level code.
But I see your point which is that it would be nice if the article was telling us which part of the high level library we're implementing and what that would look like.
I'm still happy with the article though and would not call it almost useless. But hey if you'd rather jump right into a tensorkerastorch tutorial that's an option too!
This is the same course that was originally called "ML Class" in 2011 when it was offered via Stanford (it, plus the other course called "AI Class" - helped to launch Coursera and Udacity, respectively). It uses Octave as it's development language to teach the concepts.
Octave has as it's "primitives" the concept of the Vector and Matrix; that is, it is a language designed around the concepts of linear algebra. So, you can easily set a variable to the values that make up a vector, set another to say the identity matrix, multiply them together, and it will return a matrix.
But the course is initially taught showing how to do all of this "by hand" (much like this tutorial) - using for-next loops and such; essentially rolling your own matrix calcs. Then, once you know that, you are introduced into how to use the built-in calcs provided by octave, etc.
It is really a great course, and I encourage you to try it. It opened my eyes and mind to a lot of concepts I had been trying to grasp for a while that just didn't click until I took that course in 2011.
Of course, once it did, I was like "that's it? why couldn't I see that?" - such is the nature of learning I guess!
His method of building up from the basics all the way to a neural network using Octave will be a good foundation for further learning or playing with other tools like TensorFlow; I ended up taking a couple of other later MOOCs offered by Udacity (focused mainly on self-driving vehicle concepts), and ML Class (and what I took of AI Class) really helped to prepare me.
It also showed me things I was deficient in (something I need to correct some day). But that's a different story.
Suffice to say, that course will "unveil the magic"; it was really an amazing course that I am glad I took. Oh - and to show you what you can do with what you learn in that course, a fellow student during that time ended up making this (and the course was not even finished when he did it):
Throughout the course, the ALVINN autonomous vehicle was referenced quite often, and it inspired him to recreate it successfully in miniature. It also serves to show just how far computer technology has come - what used to take a van-load of equipment can now be played around with using a device that fits in your pocket!
This is a great course, I took it once a few years ago and sometime think if I should take it again. Wonder how it changed over time. It was eye-opening training a small NN by hand on MNIST dataset and getting it to recognize digits surprisingly well.
However even understanding all that doesn't allow one using modern frameworks by itself, they seem to carry too much additional assumptions, terminology which aren't quite explained with similar level of patience.
What is being meant is that the example is one of "unrolled loops" - that is, to an extent, some of the matrix math has been unrolled in order to show the discrete steps involved in the calculations that would be lost (more opaque) for someone learning.
Real systems like TensorFlow (and Keras, which is a wrapper that simplifies TensorFlow - which is already pretty simple; if you come to love TensorFlow, you will really love Keras - but you can't really appreciate TensorFlow until you understand the basics of what it does - which the posted tutorial tries to show) hide all of these steps behind a variety of API that tries to mimic the conceptual ideas behind neural networks (ie - "layers" of "neurons" with "connections") while hiding the lower-level "complex mathematics".
That's a good thing on one level, because it opens up ANNs for more users. But it's a bad thing at another level, for those who want to know what is going on in those "black boxes", as well as for those who sometimes will stumble into certain issues with their networks, and struggle to understand certain things if they don't understand those underlying systems.
No one implements neural network code the way it is shown in this tutorial for much the same reason most people don't code in assembler: Because there are higher level means to work with the same operations and do it quicker, more efficiently, and more correctly (also, the lower level operations used in such libraries have been rigorously tested to be correct - it's similar to the reasoning on how you don't roll your own crypto, because you will likely get it wrong - even if you are an expert in the domain).
Understanding the principles is what takes the most work, not writing or adapting the code.
In fact, understanding the principles is IMO a prerequisite for being able to successfully adapt the code.
Now, using something like TensorFlow in another later article is of course super useful. It also will clarify why TensorFlow is more confusing and why this article was written.
I find the article enlightening, and specific libraries like TensorFlow are just "implementation details".
Why do you need to know linear algebra or calculus in programming?
A poignant answer is: you never really know! The future will surprise you!
Watching the movie about AlphaGo and all other stuff DeepMind has done will give you all the answers you need about the why NN.
The question in the first line has an obvious answer and it is just rhetorical. The bigger point is: this mindset doesn't apply just to Neural Networks, many things in mathematics are useful years after they are first invented, and it is better to learn general principles and have an open mind, Richard Feynman style.
That is a good question - I will attempt to answer it for you (probably poorly):
First off, for most problems you don't need a neural network - many, many problems can be solved using simple statistical regression techniques - what could be called "classical machine learning".
It's when your data starts to involve more than a few variables (usually more than 2 or 3) that you need an answer, and/or when you don't know exactly what the best "classical" machine learning technique to use, is when you may find a neural network to be a potential solution.
Ultimately, what a regular neural network does (I'm not going to get into anything like GANs or CNNs or anything special like that - though arguably they all work in a similar manner) is come up with a solution that encapsulates and encodes, through "learning", the proper algorithm(s) needed to provide a solution for the problem - whether that is classification or something else.
It's basically the classic argument that all a neural network is, is a complicated form of linear regression. And to an extent, that is true - but it's done in a "black box" manner, which may or may not be important to you for your problem (the issue becomes "how does it do what it does?" - and that's where things can become tricky to answer).
Neural networks basically take information - various attributes of a problem set - and then, once trained, can output an answer for an example (that it hasn't seen before!) that fits that problem set criteria. It does this by having been "trained" by seeing a lot of different examples, with each example matched up to a "labeled" output. Depending on whether it is right or wrong for the labeled output, it will then propagate the error backwards through the network to correct things by a small amount, and try again (note: your training data needs to be extremely varied, and should include both positive examples and negative examples, and shouldn't be biased toward any particular set of examples).
Anyhow - your data must ultimately be represented by numbers in some manner, usually continuous, and can be anything from a single input to multiple inputs. For instance, you might have the single input of "temperature" to control the output of "turn on/off the furnace" - ie, modeling a thermostat (note that you would never do this actually, outside of learning how a simple perceptron works I guess).
Usually you don't use a single continuous input - you would use multiple inputs - maybe a date, coupled with gps coordinates (like say for housing prices), with an output of "price". Or maybe 10,000 grayscale values (0.00-1.00), representing a 100x100 pixel image, with a classification output of the numbers 0-9 and letters A-Z (36 symbols).
In other words, you are trying to figure out the answer to a problem (is this a picture of a letter or number?) that would be difficult or impossible to code a set of rules by hand to answer with any statistical certainty.
Note that last part: Neural networks do not give you an absolute result; they output a continuous value or a set of values that represent the most probable likelihood of a correct answer based upon what the model has learned in the past. In the case of a single output, that could represent a "yes/no" or "true/false" answer (a value between 1 and 0 respectively); it could also be "left/right", "up/down" or something similar if controlling direction/pressure/flow rate, etc based on inputs.
For classification, it will be a set of continuous values, indicating for each "class" which is the higher probability (0-1) of being correct. For instance, if you had three categories like:
Monkey: 0.1
Banana: 0.8
Coconut: 0.2
Then it is likely that the picture was of a banana. But if you saw this:
Monkey: 0.7
Banana: 0.9
Coconut: 0.1
Maybe the picture contained a monkey holding a banana? It might be possible - even if the network was never trained on that particular imagery!
Note that it is also possible to fool such a network, as numerous studies have shown.
It's also possible to have the network output an image (ie - by having it output a very large array of node values that represents the pixel values of the image); this is how neural networks generate or alter images. Sound data can also be done in a similar manner.
Alright - I think I am rambling a bit, so I'm going to leave it here; I hope this answers at least something of your questions (and I hope it didn't confuse you - if so, I apologize and that wasn't my intent).
I think a better and elegant way to implement is the pytorch way. By defining a forward and backward function for easy chaining, you can easily chain multiple layers into a single model. Placing the training loop inside the layer will be hard to write a multi layer version.
I see so many implementations of NeuralNets from Scratch on Repl.it (I'm a co-founder) but this is one of the simplest and clearest one.
One that blew my mind recently is one by a 12 year old programmer: https://repl.it/@pyelias/netlib