You may like to take a look at Flux's implementation [1]; roughly the same idea but "professionalised" with performance work, tighter integration with the type system, nested AD and so on. It's a little less simple for that, of course, but is still under 500loc of fairly straightforward Julia code, and is generally a bit faster than PyTorch.
The Julia world has done a lot of experimentation with AD and is converging on some really cool things, so if you're interested in this field it's definitely worth a look.
Flux is awesome. One of the biggest advantages IMO is that kernels you can easily write yourself should be by default just as fast as what's built-in -- since it's all written in Julia and Julia itself is fast, without having to rely on C/C++/FORTRAN under the hood. As the Flux devs say:
"You could have written Flux. All of it, from LSTMs to GPU kernels, is straightforward Julia code. When in doubt, it’s well worth looking at the source. If you need something different, you can easily roll your own." http://fluxml.ai/Flux.jl/stable/
edit: and the automatic differentiation works on them too!
For those unaware of what automatic differentiation is: It's a close-to-magical tool which turns code for evaluating a function f into code for evaluating its derivative f'.
It uses a special sort of invented numbers which square to zero even though they are not themselves zero.
To clarify, your statement that, "It uses a special sort of invented numbers which square to zero even though they are not themselves zero." is not entirely true. In case someone else is looking at this, what the OP is referring to is dual numbers. That is one way to implement things, but not the most common way for fast tools.
Fundamentally, automatic differentiation is the methodical application of the chain rule. Forward mode results from the application of the rules for directional derivatives. Reverse mode results from the total derivative. The reason that we have a reverse pass in reverse mode can be seen from this perspective. The directional derivative is `(f o g)'(x)dx = f'(g(x)) g'(x)`. Note, we compute `g(x)` before `f(g(x))` and the derivative `g'(x)` before `f'(g(x))`. Therefore, we can compute the derivatives as we compute the answer. If we want the gradient, which results from the total derivative, we have `grad (f o g)(x)` = `g'(x)* grad f(g(x))`. Although we still compute `g(x)` before `f(g(x))` during our computation, the gradient requires the computation of `grad f(g(x))` before the application of the adjoint operator `g'(x)*`. We do the evaluations on the first pass, and cache extra values, and then compute the gradient on a reverse pass because we need the adjoint of the total derivatives in the reverse order.
Or, at least that's my bias in how to derive things.
Besides efficiency concerns (which I don't have a clue about), a disadvantage of the point of view using dual numbers is that, to my knowledge, it can only be used to derive the forward mode of automatic differentiation. Still I take pleasure in appreciating the slightly mystic aura of the dual numbers. :-)
Dual numbers don't have efficiency concerns in a language like Julia where the dispatching occurs at compile time, so basically you get the same compiled code as you'd want from source-to-source transformations. The issue though is that all functions have to be generic. Dual numbers is a very easy form to allow the user to add their own overloads though and take advantage of how the AD works: this is much harder with source-to-source since you'd have to actually modify the AST transform. So there's a trade-off here in what is flexible.
But yes, dual numbers only do forward mode. However, slightly related is tracker types, like ReverseDiff.jl or Flux.jl's Tracker, where you essentially push a type forward through the code similarly to a dual number, and use this to build the computational graph which you then run backwards with the chain rule.
presumably, source-to-source "AD" could do some symbolic reasoning (whether any systems do, and whether it still counts as AD, I don't actually know). Specifically, I have in mind a source-to-source system that could look at `sqrt(x^2)` and produce a valid sub-gradient at x=0. With AD (whether forward-mode or reverse), you get NaN=Inf*0, because the gradient of x^2 is 0 and the gradient of `sqrt` is Inf.
Hmmm, I don’t know. If you’re allowed skim over edge cases, the statement of the chain rule is pretty obvious: the composition of two linear functions is another linear function, with the coefficients multiplied.
It's not and I also find it a bit frustrating when the default answer is just chain rule. For me, the key insights into how AD is derived are the following:
1. There is a fundamental difference between normed spaces and inner product spaces and this affects what algorithms we can derive. Specifically, the forward mode corresponds to a version of the chain rule that only requires a normed space and not an inner product space. If we assume that we have an inner product, then we can apply the Reisz representation theorem. This is important because it means that there's a concrete element in the space that corresponds to the derivative. This is precisely the gradient. Further, we have a concrete way of finding this element. Because we have a (complete) inner product space, we have a Hilbert adjoint. The gradient, and ultimately reverse mode, can be calculated by combining the chain rule with the Hilbert adjoint to obtain the gradient. Specifically,
Here, * represents the Hilbert adjoint. Anyway, we get away with this because the derivatives are really linear operators, which we get from the total derivative.
2. Eventually, we feed the the gradients `grad f(g(x))` into the adjoint of `g'(x)`. However, we it turns out that we can delay further delaying sending the values of the adjoint of `g'(x)` into its parent by accumulating all of the gradient information being fed to it first. Essentially, this means that we can follow a computational graph and accumulate all of the intermediate derivative information before moving on. This means that we can traverse the graph a single time in reverse mode, which is efficient. How we traverse this graph corresponds to a topological sort of the graph. This can be generated at compile time using what's called a tape. This may or may not be more efficient modulo a bunch of reasons, which aren't all that important right now.
Anyway, this is more to say that automatic differentiation is not an obvious result from basic calculus. At least not to me. For me, it required knowledge of real and functional analysis, graph theory, and knowledge of programming languages. It's not impossible to learn, but it's not trivial in my opinion.
It seems to me that most of the AD frameworks used for deep learning implement the backward function that returns the jacobian for every initial function, and then chain those backward functions
I used it for non ML tasks in geophysics, which made my life a lot easier. However, I think most scientists and engineers aren't aware of it. It has been described as "criminally underused."
Yes, definitely, there are even battle-tested implementations for Fortran available.
Though I have never seen AD frameworks used in the production contexts for neural networks/backpropagation. As you say, the code for this seems to be mostly handrolled. Please take this negative statement with a grain of salt, I don't actually work in machine learning.
Backpropagation is literally the reverse mode AD applied to neural nets. That said, I wish had a paper that I could point to that shows explicitly that. From what I can tell, the AD community has known this basically since the start, but for whatever reason backpropogation is taught rather than the more general technique.
As long as you write code that's type stable, yea. Type stable means that the compiler can deduce the types of the variables used in a function given the types of the arguments passed to the function. An example of code that's not type stable is code something like this:
function test1(n)
x = 0
for i = 1:n
if i == 10
x = x + 0.1
else
x = x + 1
end
end
return x
end
It isn't type stable because x starts out as an int but then changes to a float in the middle of the loop. This code compiles to 78 instructions.
The following code is type stable:
function test2(n)
x = 0.0
for i = 1:n
if i == 10
x = x + 0.1
else
x = x + 1.0
end
end
return x
end
The performance is awesome, the REPL is super great as well. There's libraries that I'm itching to try out as soon as I've got a relevant project (Flux ML being the top of that list).
There's been a lot of situations in using it where I've just gone "this is everything I needed/wanted": the performance, the language features and API design, etc.
Plus there's a lot of very interesting things going on with the language and ecosystem, I definitely recommend trying it out.
While Julia is indeed a dynamically typed language, it also supports type annotations and can in most cases infer the static type of a variable. If you write a function
function foo(x)
#do something
end
and you call foo(10) and foo("some string"), then the compiler will create specialized methods foo(x::Int) and foo(x::String). Then there is no need for tracking the dynamic type of x inside these functions.
Generally, yes. It sounds like magic at first, but it's really just like a very lazy C++ compiler. You can happily look through the generated LLVM and machine code if you want to make sure it's reasonable.
surprisingly yes, it really is. Julia is smart about type inference, so it compiles a specialized version of a function as needed for whatever argument types actually get used.
the catch is, the first time you run a function there is often a noticeable compile time. but it’s cached after that.
other problems with Julia include a somewhat immature/unfinished set of libraries, in part because the language was constantly changing underneath people.
but now that 1.0 has been released, the language will be stable for a long time and you can expect that to improve quickly.
good language! it gets a lot of hype on HN but that’s because it is actually very nice.
Julia is hardly the first (cf some common lisp environments, 20+ years ago). To really get near (say within 2x) of c you need some sort of type annotation or inference of course, and (more importantly) you have to structure your code such that this works, but it's quite do-able. Your code does tend to end up a bit c-like in those performance critical sections, but that's hardly surprising.
I'm not sure we disagree. In my experience you can get "pretty fast" but not "c-like fast" while keeping many language features. The more you chase c-like performance (or even better fortran), the more you start restructuring things in a c-like way. This isn't universal of course, and I don't think it has anything specific to do with c language, just that it is a semi reasonable proxy for hardware architecture (ignoring SIMD).
Anyway, this isn't really Julia specific, and I haven't tried with Julia recently so I may be wrong :)
It's "cost" is in startup times (compiling your code with LLVM at runtime slows things down, no surprise there), and you still have to write somewhat type-stable code (outputs uniquely inferrable from inputs) if you want it to actually run full speed.
Alright, this looks neat, but I'm having a terrible time figuring out what's going on with the benchmark. Typically for AD, it's easiest to see four things: number of variables, time to compute function directly with no AD, time to compute function with AD enabled and calculate the gradient, ratio between these two numbers. Then, we run the same example with a varying number of variables to see how things scale. The advantage of reverse mode is that, theoretically, this ratio is fixed and bounded at 4 or 5 regardless of the number of variables. Realistically, this is much higher, but I think it's fine if it's somewhat between 10-40 times function evaluation as long as it scales. I can't figure out what their ratio is or whether or not it's appropriately scaling. Can anyone else?
So, fun times: if the 1-based indexing throws you off that much, it is entirely straightforward to configure it to use 0 based indexing if you want (or any other kind of offset that you so desire)
Counting is slightly easier: if you have one object, it has slot number 1, and if you have one more object (meaning you have two objects) that second object has slot number 2.
PS: I am joking here of course, but seriously, try counting things with your fingers starting with a fist, then one finger, then two.
Indexing isn't counting. It represents shifts from the reference position, which is the natural reference when dealing with contiguous objects. Everyone knows that the first floor of a building is right above the ground floor.
"Indexing isn't counting" is not a fact; it is a language implementation choice inherited from C. It may be a useful choice for systems programming but it is not necessarily the best choice for every language.
And even for buildings, we have two conventions in use. Plus a debate about whether the ground floor is really a floor, or whether the word "floor" naturally implies something elevated above the ground...
This is too often the attitude of Julia people. But zero-based indexing is not just some arbitrary offset. Programmers just like it. You say that many things are more intuitive with one-based indexing, but many things are less intuitive too. I don't think python ever would have become popular without zero-based indexing.
The thing is, we really don't know why people prefer certain types of language. All we know is that they do. Computer languages can't evolve gradually like natural languages, but if they could I really don't think we'd go to one-based indexing.
And it's a good attitude for the Julia people to have. Their target audience is not just (or even mainly) full time developers. They're targeting scientists, statisticians etc. As such, it makes sense for Julia to use conventions that are appropriate to their audience. This pervades the whole of Julia, not just indexing although the latter is exceptionally rich for this reason.
> Programmers just like it
That's a reason for having 0-based indexing but not an especially good one. And for Python it was an arbitrary decision. That's not to say that 0-based indexing is necessarily an arbitrary decision, it wasn't for C, for example, but it often is.
> And it's a good attitude for the Julia people to have. Their target audience is not just (or even mainly) full time developers. They're targeting scientists, statisticians etc.
I don't see how scientists or staticians would be unable to understand a basic knowledge like zero-based indexing. In fact, zero-based indexing already is widely used in basic areas such as series and sequences. Why are freshmen quite capable of understanding such a fundamental convention but somehow seasoned scientists and statitians are not?
> I don't see how scientists or staticians would be unable to understand a basic knowledge like zero-based indexing
Of course they can. In fact, they tend to have a complex relationship with indexing which is why Julia supports such a rich indexing language.
But the default the Julia team chose was the natural, mathematical default rather than the natural memory addressing default. Which, given what they were trying to achieve, was entirely sensible.
And, on a side note, just because something is physically possible, it doesn't follow that it's the most sensible choice. Otherwise language development would have stopped at the first Turing complete language.
What the Julia team is attempting to do is make the transition for their target audience as seamless as possible. This can be seen in their indexing language but also in their choice of dispatch, their type conversion architecture etc. A great deal of thought goes into these decisions (you can see it in their issue/RFC trackers). "A CS freshman could work it out" doesn't even come close.
Do you know how much easier it is for me to look at the mathematical definition of something and transcribe it (almost) quite literally directly into Julia code because of it's 1 based indexing?
I'll tell you: it's very, very, very easy.
And when a language makes things easy, people tend to like it (that's one of the reasons people like Python after all).
Programmers might like it for no known reason and as such the preference might be arbitrary, but the choice for a language to reflect that preference is not.
Julia was developed by scientists that used math software like MATLAB. Most mathematics software uses 1-based indexing. (Previous comment about this.[1])
>with 1-based indexing: T[(h-1) * W * C+(w-1) * C+c]
Your example showing how cumbersome and ugly it is to subtract 1 -- isn't how the intended audience uses mathematics software. Instead, they would use idiomatic multidimensional access with commas or multi-brackets syntax that understands 1-based indexes. They do not manually multiply and add the offsets into a raw 1-dimensional array. (E.g. idiomatic T[x,y] or T[x][y] instead of contrived T[(x-1) * W+(y-1)])
It's also the same concept for MS Excel. In VBA macro programming code, one doesn't need to litter the code with "minus 1" everywhere. The 1st row and 1st column (cell A1) is addressed as "1,1" in the Visual Basic function "Worksheets!Sheet1.Cells(1, 1)"
Mathematics software has a tradition of 1-based indexing probably because many (most?) equations in written mathematics are 1-based. (Examples [2] [3].) It doesn't seem so unreasonable that scientists prefer that their math programming code has 1-based indexes to match the 1-based indexes in traditional math notation.
My argument was that in situations where the difference between 0 and 1 based indexing makes a difference it's ususally 0-based indexing that leads to simpler code.
I guess I find it ironic then, that one of the most common programming errors, is of the "off-by-one" variety[1]. It is in part due to differences in convention and notation in various fields[2].
This isn't common in 1-based indexing languages. Very common in 0-based indexing languages[3].
If the 0-based indexing was objectively superior, wouldn't there be fewer instances of such bugs?
I was hoping to read some code analysis that supported your claim of off-by-one being more common in 0-based languages, but all I see in your [3] is a few different speculative reasons - am I missing something?
Off-by-one can show up in a bunch of ways that are agnostic to the underlying indexing, so it isn't obvious to me that there would be a substantial difference at all. Also, given what's said below about the ease of writing code which fails to consider custom indices, I'm not sure I believe just switching to 0-based or anything else in Julia is as painless as you suggested earlier - it could introduce its own off-by-one errors!
Agreed. Off-by-one often shows up when calculating indexes e.g. differences in indexes or offsets. These calculations are independent of the base of the index (0 or 1 or even N).
This implies that off by one is more common in zero based counting. Evidence? Your link didn't seem tob support that changing convention of indexing mattered. Rather, no longer indexing at all is what helped. (E.g., iterators.)
This fits my understanding. I've seen people do off by one just in the wild when counting how many posts/nails/etc are needed.
Have you used julia? You're almost never calculating your offset to a pointer. The VM does that for you. (As it does for python, ruby, erlang). Or actually in the case of julia it's more accurate to say the (extremely lazy ahead of time) compiler does it for you.
In Julia, you don't have to calculate linear indices yourself since you can create a reshaped view of a blob of memory in arbitrary number of dimensions. So this case doesn't come up. Also there are plenty of cases where 0-indexing is more awkward than 1-indexing. Actually reading your Dijkstra link will make that pretty clear. Reading it will also make it clear that the 0-indexing choice is pretty arbitrary and that you can make a very similar case for 1-indexing too. This 0-indexing meme really needs to die, especially the argument from authority by linking to Dijkstra.
Contrary to your summary, the quote makes a good case for 0-based indexing, both theoretical and practical. How can an argument based on reason become an argument from authority just because someone well known said it?
That is in fact how an argument becomes an argument from authority. Also, it has become an HN meme to post the Dijkstra comment every single time there's a post mentioning Julia. It's not like it's news to anyone.
> Addressing an element T[h][w][c] in a linearized 3D tensor with 0-based indexing:
But why would you ever use linearized addressing when you have proper multidimensional access? Aside from doing the low-level implementation of multidimensional access, you shouldn't ever need to do that.
> Or let's say you want to take a string "abc" and repeat it until the length is 10, getting "abcabcabca". With 0-based indexing
Your example assumes not only the stated indexing model but also a range function that returns the integer members of a half-open range, which is a little crazy of a way to do loops, but the least unintuitive way with 0-based indexing, so programmers are used to it from C (where you do it without an actual function call) and newer zero-based languages.
With 1-based languages, the sane way is the more intuitive closed range (for i in 1:5) which doesbt require an extra mental operation to convert to the set of values that will actually be used. There are good uses for half-open ranges, but the kind of explicit loop control they get used for in 0-based languages is an artifact of 0-based languages, so arguing that it is (even more) awkward in a 1-based language isn't an argument against 1-based languages.
One-based indexing obviously means that ranges are inclusive on the right, and you iterate by `for i = 1:length(x)`.
That being said, I also abhor 1-based indexing in julia.
Well, you get used to it, and it is imho a small price to pay for julia's other cool things.
In the end, julia needs to access arrays through pointers, and indirect adressing / LEA use zero-based indexing. Somewhat amazingly, llvm manages to remove/hoist most of the extraneous subtractions. But still, it would be much more transparent if julia simply reflected the hardware standard of zero-based addressing (when in doubt, stay as close to the silicon as reasonably possible).
Offset arrays are a bad choice when avoidable, because (1) they incur an additional pointer load, (2) most julia code uses 1-based indices (conforming to common language idioms is always good), and (3) you will trigger all sorts of bugs in all sorts of packages that falsely assume that all AbstractArray are one-based.
In fact, almost every solid julia package uses arbitrary indexing for input arrays. `for i = 1:length(x)` is discouraged in production code, instead the ideom is `for i in eachindex(x)`.
Some algorithms work better with 0-based indexing. But maybe, just maybe, there are other aspects to look at when evaluating a programming language than "it doesn't by default follow my favourite indexing type". Especially given that 0-based indexing is very easy to use for any array you may wish to use it for.
Hear hear. Everyone that's making excuses for one based is wrong. You need a zero in your set of integers to form a Ring modulo N. I hereby declare everyone that wants 1 based indexing out of some naive sense of it being easier a scrub.
That's so absurd I hope it's satire. Humans instinctively reason about iteration in terms of the natural numbers, not the semantic purity of the underlying abstract algebra.
Or, you know, just use the `mod1` function in those cases: https://docs.julialang.org/en/v1/base/math/index.html#Base.m... . Even more fun is that you can use the special `end` index to implement a stupid-simple circular index: `A[mod1(i, end)]` will always be in-bounds for integer `i`.
It's not a common case, though. Using inclusive ranges and 1-based indexing makes Julia's slicing syntax use first-class ranges that can behave like arrays themselves. You can do math on them. You can do all sorts of amazing things that make your life easier: https://julialang.org/blog/2016/02/iteration
Sometimes you want to count fenceposts, and sometimes you want the distance (e.g., offset) from the first post. Keep this imagery in mind and it's easy to do either in any language.
aesthetically i sort of agree, but i just want to type in equations from papers without having to do any error prone arithmetic — it is slightly less frustrating.
The Julia world has done a lot of experimentation with AD and is converging on some really cool things, so if you're interested in this field it's definitely worth a look.
[1]: https://github.com/FluxML/Flux.jl/blob/master/src/tracker/Tr...