1. If you program in any compiled language, being able to read assembly is critical to figuring out what the compiler is doing with your code. A quick glance over the assembly can often reveal why the code is slower than expected, often including suboptimalities in your higher-level code such as unexpected aliasing.
2. If you program in a C-like language, parsing the output of gdb is impossible (beyond the bare basics) without knowing assembly.
3. SIMD assembly can give performance improvements of 10-20x or even higher for many types of real-world code. Skill at using SIMD is in incredibly high demand at the moment, especially on ARM (with NEON). If you are writing code where performance is critical, you should always consider how to optimize with SIMD. If you don't, you're effectively throwing away >90% of the capabilities of your CPU.
4. If you understand the capabilities of the machine you're optimizing for, you can better write code (in a low-level language like C) to take advantage of it. Examples of this include the ability on ARM to do conditional instructions effectively for free, or the ability on x86 to do reg3 = (reg1+(reg2<<{0,1,2,3})+<const>) in one instruction.
Learning assembly is quite easy; for example, last Google Code-In we had a student (age 17) who started knowing not a single iota of anything about assembly and, within a week, was writing literally thousands of lines of SIMD code and passing stringent code reviews.
> SIMD assembly can give performance improvements of 10-20x or even higher for many types of real-world code.
Using SIMD intrinsics is often preferable. For set-in-stone inner loops, the extra performance you can wring out by hand-writing in assembly may be worth it, especially on compilers with shitty support for intrinsics. For SIMD-based leaf functions that are going to be inlined in many contexts, intrinsics have the advantage since the compiler can schedule and register allocate the instructions piecewise rather than treating them as an indivisible block with hard-wired registers (GCC-style clobber lists are supposed to help the compiler do that with inline assembly, but it sucks in practice).
My opinion is that every programmer should learn enough assembly to fluently read the code generated by their compiler, but most of them should not be writing it. Debugging is the number one pay-off, but keeping tabs on what the compiler's optimizer does with your C code is a close second.
There was a good illustration of this last week on Hacker News. Someone submitted a StackOverflow thread where they were competitively coding a fast fixed-size sort based on sorting networks. The tentative winner at the time was using some god awful hand-cooked branchless code for min/max that was completely defeating the compiler's attempts at generating the simple and optimal cmp r0, r1; mov r2, r1; cmovg r1, r0; cmovg r0, r2 instruction sequence. That would have instantly obvious if they had the habit of reading their compiler's assembly. The result is that they were leaving a ~3x performance gain on the floor.
Intrinsics tend to be incredibly difficult to read (and write) compared to hand-written assembly. For practical purposes, it's write-only code, and C simply doesn't provide the syntactical niceties necessary to easily write assembly code.
Inlining SIMD functions is typically not possible in most applications unless you're compiling a dozen versions, one for each possible CPU, since different CPUs will use different SIMD functions that are optimized for their performance characteristics.
Nevermind the fact that intrinsics generally give atrocious performance compared to properly-written code, as your average 3-year-old can probably allocate registers better than gcc.
Intrinsics are about the same level of readability in my experience. Although it's annoying that compilers tend to pile on prefixes to the instruction mnemonics and thereby make everything longer. If that's a big problem for you, just use your own short-hand macro wrappers.
> Inlining SIMD functions is typically not possible in most applications unless you're compiling a dozen versions, one for each possible CPU, since different CPUs will use different SIMD functions that are optimized for their performance characteristics.
I mainly care about games when it comes to high-performance programming, so this is not an issue for me. Consoles are obviously fixed platforms, and on PC any game is likely to have high enough performance requirements across the board that you can compile your executable against a SSE3 min-spec target without excluding users because of that.
> Nevermind the fact that intrinsics generally give atrocious performance, as your average 3-year-old can probably allocate registers better than gcc.
I'm glad we agree that GCC is a pile of shit (and not just in this respect). MSVC's performance with intrinsics has also gone up and down with versions but by now it's pretty solid.
I mainly care about games when it comes to high-performance programming, so this is not an issue for me. Consoles are obviously fixed platforms, and on PC any game is likely to have high enough GPU requirements that you can compile your executable against a SSE3 min-spec target without excluding users.
The instruction set available is often not the most relevant thing. In many cases, the optimal code will differ wildly between CPUs regardless of the available instruction sets. Let's look at some quick examples of performance characteristics:
Athlon 64: Slow SSE2 unit, MMX is often faster than SSE2 for many functions.
Phenom: Very fast SSE2 unit, but missing SSSE3 support, so can't take use the same optimized functions in many cases as the Core 2 and above. Significantly higher instruction latency than Core 2, so needs significantly more pipelining.
Core 2 Conroe: SSE shuffle operations (punpck, etc) are excruciatingly slow: 4/2 for shuffles that use both arguments as input data (e.g. punpckldq), 2/2 for those that use only one argument (e.g. pshufb). Cacheline-split loads are extraordinarily painful (equivalent to an L1 cache miss, or ~14 cycles).
Core 2 Penryn: Same as Conroe, but shuffles are fast now (1/1 for everything basically).
Nehalem: Cacheline-split loads are cheap now (2 cycles), and shuffles have doubled throughput compared to Penryn (1/0.5).
This doesn't even get into the more instruction-specific messiness, like how some CPUs like movddup and movhlps on integer data while others don't. It's not uncommon to have 3 or 4 assembly functions just to cover the latest CPUs -- not even counting old ones. Fortunately for our sanity, these are usually templated from a single function, with small changes to the relevant areas created via macros.
A gaming PC is already so fast compared to consoles that I'm generally not super concerned about those discrepancies. Compared to the big gains in going from no SIMD to SSE3-level SIMD, it's very minor. As I already said, if you have a few speed-critical inner loops (not generally the case in games but more the domain of programs like x264 and Bink 2) then go ahead and write the whole loop with a good macro assembler. But even then, Bink 2's encoder uses SSE in parts and is highly multi-threaded, and yet Jeff hasn't at all bothered with your level of per-CPU customization of the SSE code, and I doubt he ever will.
Most programs aren't like codecs or even like games but could still make great gains from a properly administered dose of SIMD intrinsics.
My experience differs. We've used SIMD intrinsics extensively, and almost never use assembly for this. We write huge quantities of performance critical code, as do you.
Many of the limitations that you discuss are real, but we've got strategies to avoid them.
We do compile a dozen versions, which is horrible, but there it is.
We get peak performance out of code using intrinsics. By 'peak', I mean close to the limits that can be achieved given the latency of the critical path and the number of operations that need to be carried out in total divided by the ports that can execute them.
In order to persuade gcc to get register allocation and scheduling right, we typically have to program in a style that closely resembles assembly, and a fair bit of our critical code is automatically generated and hand-unrolled/pipelined/scheduled. This is still easier than coding everything in assembly, especially when something goes wrong.
Large quantities of unrolling, and careful management of potential problems like aliasing, typically gets modern x86 what it needs to be able to overcome the problems of gcc. :-) Quite often the asm generated looks like shit but I have noticed a number of cases where improving the asm to be more sane actually reduces performance, often due to the fact that a modern x86 processor is doing weird things with the resulting code.
Not to rain on your otherwise good comment, but last week's SO question about fixed-size sorts was targeting a GPU, where avoiding branches is actually really important. That ugly branchless code is really bad for a speculative out-of-order complicated core like modern x86, but your simple scalar compare/move code would really suck running on a GPU.
That SO question actually did a pretty good job at illustrating your point, because most people assumed that code that works well on their CPU would work well on a GPU, without stopping to think about the architectural differences.
If you read back to my comment in that thread, I explicitly touched on the GPU vs CPU issue. They were benchmarking and ranking algorithms based on CPU performance while allegedly investigating the problem for GPU implementation. Pretty silly of them, but that's besides the point. CUDA and OpenCL and even shader languages like HLSL, GLSL and Cg all have branchless conditional move instructions. They've always had them going back to the days before GPU cores had branching. That's what you should use here, not some homegrown bit-bashing crap. For the higher-level languages like CUDA and OpenCL, the compilers have no problem generating branchless conditional moves from C code that uses branches in a simple and transparent way.
The point is that the 'simple and optimal instruction sequence' I mentioned in my top post has a 1:1 equivalent on every modern CPU and GPU.
In the abstract, obviously any bit of knowledge is worth having. For assembly, I'd even bet the value you get from learning it outweighs the effort you put in.
The relevant question, though, is it at the margins the best use of your time?
Having a solid grasp of the Wikipedia-level knowledge of assembly is very valuable, but beyond that I'd be skeptical. Sure, you could learn it--but you could also learn about other topics, like writing a compiler, the theory of operating systems, understanding the layers of the TCP/IP stack, or working in a purely functional language like Haskell. All or any of those might be a better use of your time. So might learning to salsa, building social capital by hanging out with friends, or learning to bake bread.
So, maybe? But only if you're very curious about it (a good thing!) and have also knocked off other relatively more valuable and lower hanging fruit.
Of course, I've not taken the time to learn it, so it's quite likely there are some hidden benefits I've not managed to grok.
For the "do I need...", "is it worth it...", "should I buy..." type of question, I usually rely on my stock answer. "If you have to ask, then no."
There are many things like this in technology. Things that are rewarding, things that enhance your ability to understand a problem, things that give you new skills and abilities, great tools, or high powered technology. But, if you don't know you need it, then chances are, you don't need it.
On the other hand, perhaps the fact that you are asking is a sign that you are ready to need it. Are you becoming curious with how things work? Do you run up against an assembly wall when you try to debug or get tasks done? Do you need to eek out performance that you can't otherwise get? Only you can answer that question.
Fortunately, assembly programming will always be there for you, and learning the underlying way things work will always make you a better programmer. It's best learned as a passion and a means to understand, not as a magic bullet to get a better job, or to instantly get awesome performance.
I usually rely on my stock answer. "If you have to ask, then no."
Or the converse: those who need to know it, know they need to know it already and will learn it.
Machine level knowledge (not just assembly: caching and DRAM architecture fu is just as important, IMHO) is really useful. But it's not "needed" to write working software, and the hordes of assembly-ignorant (some of them are actually reasonably super-mediocre) programmers out there are an existence proof of this fact.
But the great hackers all know their machines, and that's one of the aspects that makes them great.
Well, I happen to know about that stuff out of interest, but I wouldn't claim it's very useful knowledge for a programmer. The top-level cache (L2 or L3) communicates in big chunks with the DRAM controller so that you are insulated from the details of modern DRAMs like double pumping, burst mode, page sizing, channels, banks, etc.
There is a time in life when you have the mental space and the environment around you where learning anything is relatively free and easy. If you wait until (much) later when you can directly benefit from this knowledge you may not be able to free up the time to acquire it.
Ditto for learning spoken languages, the earlier, the more, the better.
But once you have the realization that there are more things then you could previously have conceived of, you find you need new tools to explore them. Thus you become one of the people who know they need it, and the question is answered.
In fact, you should try to learn as many languages that have real differences between them as possible.
There is something to be gained from a low level perspective on how a computer operates just as much as there is from higher levels, even if you will not use that in your everyday job.
No point in learning three varieties of the same thing but you get real insight from learning languages that are very different.
C/C++/Java
Python/Perl/Ruby/PHP (puts on flame proof gear)
Lisp/Clojure/Scheme
Smalltalk/?
APL/R/J/Mathematica
IO/Forth/Factor
Assembler (there a many different flavors of assembly)
VHDL/Verilog (not true programming languages but very interesting all the same).
Pick one or more entries from each of the above lines and you'll have a better perspective than if you left that line out.
Of course there are many more options than the ones in the list above, but you get the general idea.
Learning new languages is valuable but consider also that this time could be spent learning a new problem domain. Coding a music synthesizer or a simple web server or learning OpenGL can pay more dividends than learning one more language that you'll never really use.
Definitely worth the experience, but not really too useful for most practical programming projects. The big advantages of writing assembler come from understanding the real time nature of a processor (at least for me). Writing code in assembler is truly an art since at that level you can actually interweave code in such a way where blocks of code are both functional in scope as well as globally (such as using a block of code as a delay for a different action, caching the result until later using it, used often in video code). Great assembly will generally beat a compiler significantly especially on embedded systems.
In my experience learning assembly language is not so much an action but more of a process. Build yourself a nice little cheapo MCU circuit (AVR, PIC, ARM) , or get your hands on an emulator, and try to actually program it using assembly language. I'd highly recommend doing this using either a RISC instruction set over making an attempt at a CISC one like x86. Not hating on x86, but my experiences with it did not really teach me much about computers, rather taught me about x86 (if that makes any sense). I'm not a massive fan of how they always teach MIPS in academia, but it's a good place to start since most RISC architectures are very similar if not based on MIPS.
Also, understanding assembler will help you in understanding how CUDA works under the hood. CUDA is C based, but underneath its implementing some interesting SIMD stuff, so at least a basic feel for how instructions are fed into a processor could legitimately aid in optimizing code for CUDA where a 1% difference could be very significant.
I like Rajeev Kumar's C Internals and C++ Internals succinct mini-books. He works top-down, dissecting the x86 assembly that gcc generates from C and C++. Rather than (directly) teaching assembly programming, he is "lifting the hood" for C/C++ programmers who want a better understanding of their code and the assembly you see in the debugger.
A good platform to learn assembly on is actually a graphing calculator. I used to live on ticalc.org when I was 14 just trying to absorb all that I could with the TI-83+, modifying programs like Phoenix to be how I wanted them to be and stuff, talking to Dan Englender on IRC, etc...
Assembly writing, not so much. Assembly reading, a huge win. However, the thing is you can't read assembly very well unless you've written it at some point.
The key factor is that besides the actual instructions and their syntax in assembly language you will also need to internalize the conventions for register usage and subroutine calling for a platform. Or several platforms. That's what allows you to eyeball compiler generated assembly and make sense out of it. Otherwise you'll just see moves between registers and stack, and wonder what happens next whereas the code in fact pops the return address from stack to pc and continues from a whole another location.
I think that the best way to learn assembly these days would be to write an assembler and machine code interpreter for some known platform, such as ARM. You can write the assembler and interpreter even in Python but at the end you should be able to execute existing native binary code, though slowly. The mental investment is larger than toying around with existing toolchains but gives you much more bang for the buck with regard to understanding cpus.
I agree, reading assembly is a great skill to have for some domains.
You'll almost never write assembly these days, except in the most niche of niches. A much more productive approach is to use intrinsics in a higher level language and examine the generated code. You don't even need to know every instruction. At this level you'll be looking out for sillyness such as register spills, hardware dependent badness (hello load-hit-stores)
I have founded it important mostly as a confidence builder.
To use and program a computer without having at least seen how it all works at a lower level felt wrong to me. I felt like I had no foundation to really understand what I was doing. I'm not claiming that I know which opcodes are being executed as I run my django dev server - that's not the point. But it's about having a clue.
Some day, if I have to dig down that deeply into my machine, I will not fear. I have seen what it looks like down there and I know which tools to bring with me.
It totally depends on what kind of programming you want to do.
- If you want to make web applications, then probably not.
- If you want to build web infrastructure, then probably.
- If you want to make desktop games, then probably.
- If you want to make mobile or console games, then yes.
- If you program in higher-level languages and want to exercise your brain, then yes (do it for your "one new language a year").
- If you want to reverse-engineer existing code (security checking, malware analysis, unofficial bug fixing [1]), then of course yes.
[1] http://www.hexblog.com/?p=21 - The author of IDA Pro disassembler wrote a hotfix for the WMF vulnerability in Windows about a week before MS released a fix. From what I've heard, his site completely tanked due to all the downloads.
I took two classes (200 and 400 level) programming the M6800 and M68HC11 processors in my undergraduate degree. The beauty of these processors is that they are very minimal and you can get a very good understanding of the entire chip. The HC6811 has two 8 bit data registers(AR,BR), two 16 bit index registers(XR, YR), one 8 bit Condition Code register(CC), one 16 bit Stack Pointer register(SP), and one 16 bit program counter(PC)
Initially, when I took the 200 level course it was fun since we had to write really small(optimized) programs and match the solution of our Instructor. However, when later on I took Computer Architecture/Organization classes learning pipelining etc, it made me imagine how an instruction PSHA (in 6800/6811) maps to an actual 8 bit binary number, which in turn triggers the signals in an actual gates (AND, OR) of a processor.
Some projects I recollect:
1) A LED based calculator + code for the debouncing logic(keypad)
2) A Chess Game.
3) An OS that context switched between jobs. (cant recollect more details)
4) Given a specific date, find out what day of the week it is on ?
If you're writing C++ code, learning x64 assembly is extremely helpful, as it will tell you exactly what is happening in a function - all of the implicit calls (destructors, operators, copy ctors, etc) are in plain sight when you read the assembly.
I used to say absolutely yes too, but it's been a long time since I needed to mess with it. You no longer need to write in assembly to optimize speed, or even to write drivers for any but the rarest cases that hardly anyone will run into.
Does it help you understand how things really work? Yes, as does knowing things like CPU design down to the gate level. Most don't know that (I am one who does) and I never hear it advocated that they should.
I think it makes more sense to emphasize other difficult core skills that are far more useful like knowing how to build a compiler.
> Most don't know that (I am one who does) and I never hear it advocated that they should.
You and me must not be hanging out in the same circles. I hear it advocated all the time. You can't claim to understand how CPUs work if you can't design a simple CPU. Just as you can't claim to understand how compilers work if you can't design a simple compiler. The inevitable objection to this line of thinking is, where does it end? Should you also have to study device and semiconductor physics to understand how computers work? Yes, you probably should, but you can terminate this recursion any time you feel you're not learning anything sufficiently enlightening or helpful compared to other things you could be learning.
Very few people have designed CPUs. A lot of people manage to write great software despite this. It seems to be irrelevant. Again, coming from someone who knows.
If someone is interested, yes, I absolutely encourage them to pursue that interest and yes it will deepen their understanding of how things work.
But necessary? It's self evident that it's not necessary because many don't know it and write great software despite this. And if deciding what sorts of optional things one should master, it's fairly low on the list.
I completely understand the advocacy. I used to tell people that without fluency in assembly they couldn't consider themselves competent. That might have been true when I said it. It's not true any more, it not a fundamental skill any more because it's not necessary except for certain tasks that all but a few won't encounter. We can't reasonably require that everything that is not necessary for all but a few should be required, there would be too many things. It's now an optional skill.
Very few things are truly necessary to know, so I was working from the premise that such studies would be undertaken as a way of rounding out someone's knowledge. The direct practical applications of that knowledge to programming are rare, I agree.
I have taught myself to never say that you need to know X to be a great programmer. You'll note that's not what I said with respect to designing CPUs. When I was a hot-shot kid who fancied himself something special, I used to think and say to people that great programmers needed to know X, Y and Z, which just so happened to be things I enjoyed and knew a lot about. After close to a decade of working with a lot of different programmers, many good and some very great, I've found that there's really no easy one-size-fits-all set of prerequisite knowledge once you get beyond the basics that everyone agrees are important to master. I pride myself on being well-rounded and having an equal mastery of high-level and low-level elements, but I can't honestly say that by itself makes me better than someone who is more exclusively focused on either of the two ends. It's mostly a matter of personal inclination.
As a self-taught programmer with no formal CS education, learning assembly (and working my way through SICP) gave me both the confidence and a deeper understanding of computer programming.
If you're looking for a book, the best one I've found is Jonathan Bartlett's Programming from the Ground Up. I can't recommend this book highly enough. You can get a free copy from here:
I wrote tons of 6502, Z80 and later 68000 code - programs and demos for the demo scene in the 80s and 90s - and think this was a mind altering experience.
Questions like this seem a bit pointless in my opinion. I mean sure, a lot of people get to answer with their insights and opinions on the said language/tool, but in the end isn't everything you learn going to benefit you and make you a better programmer? So asking if something is worth learning will probably always get a positive answer because the more you know the better (even if you don't get to use that specific tool you ask about).
I think everybody who is a programmer should learn some assembly language...not so much for code optimization, but rather to understand how the computer + compiler work underneath the hood. You don't need to be able to write crazy advanced programs in it or even learn a popular ISA like x86 or ARM, but you should understand some simple instruction set (think of Patterson & Hennessy's undergrad textbook on Computer Organization & Design with the MIPS ISA). The point is to have some insight into how the compiler in your higher level language maps to the atomic level instructions of your computer and why certain operations are costly. For example, if you don't understand how your computer does stack management, how can you understand the overhead of a function call and the benefits of compilers inlining code? Another example for functional languages is the Funarg Problem where it is difficult to efficiently implement higher-order functions and closures in a stack-based model as found in contemporary computers. You gotta understand a little bit about it otherwise you're going to be limited in how much you can grow as a software developer.
Yes, and preferably on multiple machines/VMs. The JVM and CLR has some interesting ideas, x86-64 is very useful for performance, RISC might open your mind etc, just do it, the cost is not very high for people who already know C or Java or C++ or something among those lines.
I had an assembly course the first year of CS, it was really basic stuff on an ARM board but it was definately one of those 'a-ha!' moments for me. I learned so much about computers and how programming languages work and had a ton of fun while at it.
I hacked around with 6809 assembly when I was a kid. Implemented the huffman compression algorithm in it. Sometimes I wonder whether I should learn a more modern architecture, because they have change so much.
After 10 years of Linux/Unix/some Windows administration i'm learning assembly. That's mainly because i've switched to working on a Z enterprise platform. :)
1. If you program in any compiled language, being able to read assembly is critical to figuring out what the compiler is doing with your code. A quick glance over the assembly can often reveal why the code is slower than expected, often including suboptimalities in your higher-level code such as unexpected aliasing.
2. If you program in a C-like language, parsing the output of gdb is impossible (beyond the bare basics) without knowing assembly.
3. SIMD assembly can give performance improvements of 10-20x or even higher for many types of real-world code. Skill at using SIMD is in incredibly high demand at the moment, especially on ARM (with NEON). If you are writing code where performance is critical, you should always consider how to optimize with SIMD. If you don't, you're effectively throwing away >90% of the capabilities of your CPU.
4. If you understand the capabilities of the machine you're optimizing for, you can better write code (in a low-level language like C) to take advantage of it. Examples of this include the ability on ARM to do conditional instructions effectively for free, or the ability on x86 to do reg3 = (reg1+(reg2<<{0,1,2,3})+<const>) in one instruction.
Learning assembly is quite easy; for example, last Google Code-In we had a student (age 17) who started knowing not a single iota of anything about assembly and, within a week, was writing literally thousands of lines of SIMD code and passing stringent code reviews.