Hacker News new | past | comments | ask | show | jobs | submit login

As someone who has read a lot of implementing neural networks from articles, the massive problem with all of them is that they import numpy. You may think that it is silly to reimplement the matrix math but with out that part of the code, you can't easily port it to other languages/microcontrollers/microwaves/badgers.

It's a legitimately valid part of machine learning, and its not easy to do for novices.

And I need help putting it on my badger damn it!




As someone who does teach tutorials as a side gig, I would argue that implementing matrix operations in a tutorial on neural networks is overkill. No matter what the level of the tutorial you always need to draw a line and assume a certain amount of background knowledge and knowing how to use standard tools isn't too much to ask. (yes, I know numpy isn't part of python's standard library, but it comes with pretty much any Python distribution as many other libraries depend on it.)

If we're talking about a longer format, such as a book, then we might consider digging deeper and implementing as much as possible using the barest of Python requirements. Indeed, Joel Grus does implement everything from scratch in his great (although a bit dated) book https://www.amazon.com/Data-Science-Scratch-Principles-Pytho....

EDIT: This is still a work in progress (and relies on numpy and matplotlib), but here is my version: https://github.com/DataForScience/DeepLearning These notebooks are meant as support for a webinar so they might not be the clearest as standalone, but you also have the slides there.


A new edition of Grus comes out next week actually...

https://www.amazon.com/Data-Science-Scratch-Principles-Pytho...


Nice! He mentioned he was working on it when I met him at Strata last year, but I didn't know it was coming out already.


Ugh I just bought the old one a week ago.


I’d agree... Outside of very rare circumstances (specialist in numerical linear algebra implementations), my opinion is that implementing matrix operations is something you do once (twice) in your numerical courses to get an intuition for the algorithm, and then never again.

But maybe it’s educational to do once if you never have before.


Matrix math is easy peasy. Freshman level programming. Just lookup algorithms on Wikipedia and you're all set.

The problem is it's extremely hard to make it efficient. Dozens of men-years are spent trying to optimize linear algebra libraries. There are handful linalg libraries that have competitive performance. It was my college project to make a fast linalg library, and boy it is fast. There are some things like matrix multiplication that if you implement in C with the trivial algorithm, takes >2 mins but with some tricks you can make it as fast as <second (vectorization, OpenMP, handwritten assembly, automatically optimized code, various optimizations, better algorithm.....).

So, if you want to implement linalg in some language and compile it, go ahead, more power to you. But it's basically impossible to do it efficiently. My opinion is: this is fine and we should do this. There should be linalg libraries written in pure python (and are 1000x slower than lapack) but just understand that it's impossible to satisfy all use cases of numpy this way (at least currently).


Can anyone simply explain the gist of how matrix multiplication is optimized? I know a lot of is farmed out to the GPU (if you've got a good GPU), but what's the essence of it? Caching? Some kind of clever mathematical tricks? All of the above?


> Caching?

That's the first step. If you have a 64kB cache, then you want to fill that cache ONCE, calculate everything you can with that 64kB chunk of data. Save off the result, and then load a new 64kB chunk. This is called "tiling".

Its actually kinda tricky to do just right, but once you know the concept, you basically spend effort ensuring that main-memory hits are minimized.

GPUs have many different memory regions: global Memory, L2, L1, "Shared" memory, and finally registers. Maximizing your math and minimizing your memory-moves is one big part of optimization.

-------

There are a ton of other optimization tricks: GPUs are more efficient if they access memory in certain ways. "Shared Memory" in GPUs are typically banked (on modern NVidia and AMD GPUs).

If you have 64-threads, it is more efficient if thread#0 accesses X+0. Thread#1 accesses X+1. Thread#2 accesses X+2. (etc. etc.) Thread#63 accesses X+63. In fact, a GPU can perform all 64-memory loads simultaneously.

However, if the 64-threads all access memory location X+4, you have a "bank conflict". The 64-threads can only read this memory location one-at-a-time, resulting in a massive slowdown.

There are a huge number of tricks in "tricking" the for loops to access memory optimally across your many, many threads that are calculating the multiplication.

--------

> Some kind of clever mathematical tricks?

The clever mathematical tricks have been solved and are known. LU decomposition, etc. etc. Its important to know them, but that's not generally what people mean by "optimization".


I'd hope that those conflicts only occur on writes and not reads?


> I'd hope that those conflicts only occur on writes and not reads?

They happen on reads and writes.

The best way to describe it is... GPUs internally not only have tons of cores and threads (albeit "SIMD" threads, not true threads...)... they also have multiple memory banks.

I know AMD's architecture better. So let me talk about GCN. The smallest "cohesive" structure of an AMD GCN GPU is the execution unit. An execution unit has its own L1 data-cache, a 64kB "shared-memory" region, and finally the 256 vALUs (vector ALUs). There are 4-instruction pointers (64-simd threads per instruction pointer).

I'll focus on the high performance "64kB Shared Memory" region.

The 64kB shared memory is organized into 32-channels. In effect, channel 0 handles all addresses ending in XXXX00. Channel 1 handles all addresses ending in XXXX04. Channel 2 handles all addresses ending in XXXX08. Etc. etc. Channel 31 handles all addresses ending in XXXX7C.

Each channel can serve ONE thread per clock cycle. So if all 256-threads of the execution unit try to grab address 400000 (ending in 00, so channel 0), they all hammer channel 0. Channels 1 through 31 don't do anything, because no one asked for data from them. In this case, ONLY one channel is working, while 31-channels are sitting around doing nothing.

In this case, the last thread will be waiting for ~256 clock cycles, because its got 255 threads in front of it trying to grab data.

If you instead wrote your program so that all the threads asked for data "equally" between all 32-channels, then your code would be 32x faster (because all 32-channels would be working). Each channel has 8 requests (32x8 == 256 reads), and the 256 threads all get their data in just 8 clock cycles.

--------

This isn't a problem in CPU world, because your L1 cache on Intel / AMD systems only has one channel per core. Actually, AMD offers TWO L1 channels per core (!!), and Intel offers 3x channels per core (2x reads + 1x write). So your one thread can do 2-reads + 1x write per clock tick.

If you have a massive 32-core CPU, you still have 2x reads + 1x write per core. So total of 64-reads + 32 writes across the whole system. This makes CPUs simple.

GPUs however, have a compromise. The 256-simd threads (or really, up to 2560-threads on an AMD system per EU) share the same 32 read/write channels.

--------

Technically speaking, there are "channels" and "banks", two different memory organization schemes in a GPU. In practice, you can just pretend that only channels exist, because a bank conflict more or less acts the same as a channel conflict.


Thanks for the reply. It makes me wonder how much of a slowdown GPU accelerated neural nets will get due to the mass reading of shared input values.


Broadcasts can be done efficiently on AMD systems. I dunno about NVidia, but I would assume NVidia PTX has some kind of low-level broadcast mechanism too.

A lot of optimization is just knowing all of the special ways you can move memory around. Broadcast was common enough that they've given AMD GPUs a special instruction just for it.

So in the case of neural networks all reading from the same input, you'd want to do it through the broadcast instructions, instead of through shared memory. Shared memory would create bank conflicts.


On the CPU, matrix multiplication follows the same procedure you'd use to multiply matrices by hand. But GPUs are good at performing the same operation on a bunch of data at the same time. Any operation that is embarrassingly parallel is a good fit for doing on the GPU and often large matrix multiplications are. So the premise is that you do the steps that don't need to be done sequentially in parallel on the GPU.

Most approaches to doing matrix multiplication on the GPU benefit from doing operations in a way that plays off the behavior described above, make good use of caching, and respond to how the data in your matrix actually looks (e.g. is it sparse, etc).

To learn how you'd do matrix multiplication on the GPU, you might want to look up how its done via CUDA since many applications that make use of the GPU do so via CUDA and it doesn't require specific knowlege of graphics programming. This seems like a good introduction: https://www.shodor.org/media/content/petascale/materials/UPM...


You can look at the basic algorithm on wikipedia[0], and you'll see that a lot of it is parallelizable (non-sequential dependency of calculations) which by default is easily (time-) optimized via GPU pipes.

After that, there are some caching optimizations to be had by iterating over the nested loops to minimize cache misses. Another optimization is to split up the matrices if their shapes are appropriate and use compounding matrix operations to recombine them, allowing for further use of parallelization in phases.

There are a few fancier algorithms out there which are optimized for certain assumptions -- extremely large matrices, several consecutive operations, matrices with many 0s or many identical entries or following certain patterns for values such as the identity matrix or eigenvectors, etc.

[0] https://en.wikipedia.org/wiki/Matrix_multiplication


The answer is: No one really knows because cuBLAS is closed source.

But to get within the same order of magnitude, tiling the workload for better cache utilization is usually the most important step. This article [1] explains it quite well and also lists a few other tricks.

In addition, there's also the fast Fourier transform for large filter kernels and Winograd convolutions [2] for small filter kernels.

[1] https://cnugteren.github.io/tutorial/pages/page1.html

[2] https://arxiv.org/pdf/1509.09308.pdf


Not entirely true - Scott Gray knows: https://github.com/NervanaSystems/maxas/wiki/SGEMM

IIRC his kernels shipped in cuBLAS at some point.


Things I used in my project were:

> (vectorization, OpenMP, handwritten assembly, automatically optimized code, various optimizations, better algorithm.....)

also in addition to these, I used caching. Also GPU programming, but it's a trade-off.


Easy peasy but not optimized sounds like it would fit with these from scratch tutorials that pop up everyone now and then.

Perhaps it is overkill. It's just not actually from scratch without it, you know?


If you don't care much about performance (and if you are reimplementing a neural network from scratch you're probably doing it more as a learning project than anything else) implementing matrix operations isn't very difficult.

If you're not used to work with matrices simply reading the Wikipedia article might tell you enough to implement them yourself.


If you have an assembler or C compiler you can implement matrix multiplication (GEMM) which usually does most of the heavy lifting in your neural net. Now you correctly alluded that it may not be simple to efficiently implement GEMM but if you have a simple architecture without a complex memory hierarchy then using whatever SIMD facilities are available and some standard tricks will get you in ballpark of peak FLOP/s.

Or, just download a fast BLAS from your hardware vendor...


Exactly the reason why my colleagues and myself do all deep learning in C++, performance and portability, from cloud to RPie. We've even modified caffe2 so we could build the training graph from pure C++. We know this is not the current doxa :) It's also all open sourced just in case others might need it...


Link? I love C++ and would love to see it.


It is easy to do unless you don't code at all, or are completely confused by the math.

I'm a C# developer and I'm sure it would take me all of about 30 seconds to install a matrix multiplication package through nuget. I'm sure it would be immediately obvious how to add items to matrices or do a dot product.


I'm a C# developer who wrote my own implementation (with help from random tutorials etc).

It was dead easy to get code examples as needed.


Hm. I just finished teaching an ML course where all of the assignments were pure Python (on purpose, so students would actually have the chance to see all of the code). One of the assignments included implementing reverse-mode autodiff and a NN classifier on top. It can be done in ~600 lines of clear python, serious!


Which course?


https://cscheid.net/courses/spr19/csc665/

The assignments are not directly available, but my email is easy to find.


The complicated parts of Numpy are themselves a wrapper for the seminal LAPACK: http://www.netlib.org/lapack/. It has C language APIs, so that might help you with what you need to do.


That's very interesting, thank you.


Yep. I would like to see an article that implements everything without using matrices first, then creates the matrices library with you, and refactors everything over.

So much learning that we're missing by not going through this step.


A good course that comes close to this would be the Coursera Machine Learning course (what used to be known as "ML Class" by Andrew Ng).

It uses Octave - but you first do everything (in the section on NN) "by hand" - building and looping for the matrix operations. Only after you've gone that far, does he (Ng) introduce the fact that Octave has vector/matrix primitives...

I took the original ML Class in the Fall of 2011; it was a great class, and opened my eyes a great deal on the topic of machine learning and neural networks, which I had struggled with understanding in the past (mainly on what and how backprop worked).


To me it feels a bit like that joke about drawing instructions. "1. Draw some circles. 2. Now draw the rest of the owl."



I'm curious if you have a particular language/microcontroller/microwave/badger you have in mind? Depending on which, YMMV.


No, not off hand I don't. It's just something I've noticed in all these make a neural network posts. Feels a little like they're just drawing the rest of the owl, if you know what I mean. But thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: