It dawns on me why we couldn't have shortcuts for several patterns that show up everywhere:
- mov, mov, mov then call/syscall could just be written as call(arg, arg, arg) since it's not that difficult to figure out which argument needs to go to which register if there was a defined order of arguments.
- push push push push <function body> pop pop pop pop <ret> could just be a define really. I realize that there are cases where you wouldn't necessarily do that, but that seems to be the minority. The compiler could just figure out what registers you use in the routine and push / pop those. If you want to keep a register, there could be added syntax for that.
It seems in both of these cases the language optimizes for simplicity and flexibility, ignoring the common case. Neither of these strike me as situations where introducing the abstraction would require a lot of compiler "magic" to guess and optimize. It's almost just string replacement.
Then again, you could just write all this in C and let the compiler figure it out. Interesting stuff!
In my OS/architecture class we used a textbook whose author had piled on so much macro assembling on top of SPARC asm (macros all in m4, naturally) that he in effect was writing the book using a personal high-level language constructed out of gobs of m4. Like the bizarro-world version of personalized language construction in Lisp-land...
I learned assembly on the Atari 8-bit (like the C64 and AII, all of which used a variant of the Motorola 6502) which had an assembler named MAC/65. MAC, of course, was short for macro. This was in the early 80's also. Good times.
The usage is nonstandard, but it provides an interesting perspective. What if he was planning to run the X86 code on a virtual CPU, is his usage then correct? Does the nature of compiled code change if it is executing on a "real" machine instead of a virtual one?
TASM from Borland had bunch of macros/shortcut keywords for the patterns you describe. There's dozens of library files to do similar things in MASM as well. mammon_ had written a bunch of macros [1] that do same thing for NASM. See how he exploits NASM's preprocessor for this in his "Extending NASM" article in the Assembly Language Journal, Feb/Mar 1998, Issue 3. He also demonstrates how do all the control structures in NASM's preprocessor (for, if, switch/case, while, do/while, etc.).
If you or anyone is looking for something which takes care of a lot of the entry and exits to functions I would suggest they look x264's x86inc.asm [1]. It was designed for using SIMD in DSP functions not for writing whole programs but I don't think it would get in your way of doing that.
Plus it is BSD licensed for those that hate the GPL.
I used this construct often to zero a register, in the time that memory and CPU cycles were scarce. But nowadays, my time is a more valuable resource, and I tend to write:
mov rax, 0
It takes a somewhat longer instruction code, and a few CPU cycles more, but it conveys meaning better.
same here. _but_ if you weren't proficient in writing and reading asm the statement would no longer be true.
now, you could argue if that were that the case, wth is that person doing there anyway.
also if you compile something like `return 0;` it used to be compiled down to `xor eax,eax; ret;`, but meh barely anyone i know coding these days even knew this to begin with.
For modern out-of-order CPUs - xor reg1,reg2 is problematic because the result depends on the previous contents of the register. Hence it cannot be executed out of order.
However as a special case xor reg1,reg1 (along with sub reg1,reg1) will be detected by the cpu (intel anyway) as a 'zero idiom' instruction and because it has a smaller opcode than mov reg,0 its preferred.
There are also other complex reasons - for detailed info see 3.5.1.8 in the Intel optimization manual.
A CPU cycle is Fetch,Decode,Execute,Writeback. The 'zero idiom' instructions do not consume any 'Execute' resources. Apart from this.. as the other poster said, smaller opcodes means more instructions can fit inside the caches.
You'll get way more bang for your buck by minimizing userspace→kernel round trips and memory copies than by hand-optimizing assembler code. sendfile is one great way to do this.
You'll get even more bang for your buck by eliminating the kernel from the packet processing path by using netmap, PF_RING/DNA, or DPDK, and a user-space TCP/IP stack.
Assembly only really helps alleviate GCC's moronic decisions resulting in excessive stack spills and alignment-unaware loads & stores.
I wrote a web server in C. I'm not really a web guy, it was a lot of new stuff to me, I'm a pretty amateurish programmer, and it probably doesn't even deserve to be called a "web server". But it was fun, and really cool to know that something that I wrote (with the help/jump-off point of a tutorial or two) can be used to serve web pages to a client. I can run the program, pop open Firefox, and use the browser to click through a set of test pages as though it was being served by a real web server. That's fucking cool, and that was all the reason that I needed to do it.
Not in the general case. It's been a long time since x86 assembly developers could commonly beat a decent optimizing compiler.
The thing about compilers is that they're leveraging, even if imperfectly, the collective wisdom of their authors and of the companies who actually built the chips and have offered insight, advice, and sometimes even code. It's very probable they know more performance tricks than you do.
One problem is landmines in the ISA, such as instructions that look like they exist to be used, but are really traps implemented in suboptimal microcode for the unwary programmer who didn't look closely at their performance characteristics. Or certain sequences of instructions that might combine to do something ridiculously slow[1].
These landmines vary by microarchitecture. An instruction that's incredibly slow on one line of x86 chips might be a wonder-drug on another. This both increases the probability that your code will hit a landmine on at least some CPUs, and gives you a possible "in": Compilers aren't going to optimize perfectly for every microarchitecture. If you know exactly what you're doing (or spend a hell of a lot of time on trial and error), you might be able to come up with optimal codepaths for specific chips that the compiler didn't.
By and large it's not worth it, though. Hand-tuned assembly still ends up in places, but increasingly rarely, and it's confined to small hot-spots. A particular algorithm or part of an algorithm gets re-implemented in assembly because the compiler just can't get it right.
[1] I could have sworn there was a story about this just recently, but I can't seem to find it. Something like a piece of code running way slower than anyone thought it should, until an AMD engineer piped up and said "Oh yeah, don't do that, it causes a pipeline flush." for reasons that were utterly non-obvious to anyone who didn't know the internals of the chip.
I don't want to be too harsh - this is a fun idea, and I'm prone to silly fantasies about rewriting slow code in assembly myself - but this particular assembly doesn't take advantage of many of the "dirty tricks" that are available in low-level code.
As one example, check out the content-type detection, which is essentially a long chain of repeated strlen + strcmp; assembly language doesn't magically make bad algorithms fast.
Not to mention that long chain is ugly to read. I would rather see a macro defined and called multiple times than to see the same block of code copy/pasted over and over.
Maybe, maybe not, but for a web server, if asm vs. C is making a noticeable impact on the overall performance, one of them is doing something very wrong - they should spend most of their time in sys-calls to shuffle data to/from the network, not executing web server userland code.
> Is handwritten assembly faster than GCC/clang-written assembly?
Sometimes, but the biggest case is if you can carefully arrange a tight inner loop, especially one that case make use of SIMD, like some DSP and scientific-computing code. Auto-vectorizers are getting better, but still miss lots of cases, so a skilled asm programmer can beat the compiler. The more "spread out" the performance-critical code is, in general (i.e. performance not dominated by one or two tight loops), the harder it is for hand-coding asm to beat a compiler; humans are not that good at doing whole-program optimization on large codebases. The more cross-platform the code has to be, the worse for the asm programmer as well: beating gcc's code-gen on one architecture is easier than beating it everywhere.
so what is the implication of this? Does this mean that this webserver will be much faster and using less resource because it's written for linux directly?
https://github.com/nemasu/asmttpd/blob/master/http.asm
It dawns on me why we couldn't have shortcuts for several patterns that show up everywhere:
- mov, mov, mov then call/syscall could just be written as call(arg, arg, arg) since it's not that difficult to figure out which argument needs to go to which register if there was a defined order of arguments.
- push push push push <function body> pop pop pop pop <ret> could just be a define really. I realize that there are cases where you wouldn't necessarily do that, but that seems to be the minority. The compiler could just figure out what registers you use in the routine and push / pop those. If you want to keep a register, there could be added syntax for that.
It seems in both of these cases the language optimizes for simplicity and flexibility, ignoring the common case. Neither of these strike me as situations where introducing the abstraction would require a lot of compiler "magic" to guess and optimize. It's almost just string replacement.
Then again, you could just write all this in C and let the compiler figure it out. Interesting stuff!