Web server for Linux written in amd64 assembly

makmanalp · on Feb 3, 2014

Pretty neat! It's awesome how assembly these days is reasonably high level:

https://github.com/nemasu/asmttpd/blob/master/http.asm

It dawns on me why we couldn't have shortcuts for several patterns that show up everywhere:

- mov, mov, mov then call/syscall could just be written as call(arg, arg, arg) since it's not that difficult to figure out which argument needs to go to which register if there was a defined order of arguments.

- push push push push <function body> pop pop pop pop <ret> could just be a define really. I realize that there are cases where you wouldn't necessarily do that, but that seems to be the minority. The compiler could just figure out what registers you use in the routine and push / pop those. If you want to keep a register, there could be added syntax for that.

It seems in both of these cases the language optimizes for simplicity and flexibility, ignoring the common case. Neither of these strike me as situations where introducing the abstraction would require a lot of compiler "magic" to guess and optimize. It's almost just string replacement.

Then again, you could just write all this in C and let the compiler figure it out. Interesting stuff!

pjmlp · on Feb 3, 2014

> It's awesome how assembly these days is reasonably high level:

There were macro assemblers already available in the 80's!

mjn · on Feb 3, 2014

In my OS/architecture class we used a textbook whose author had piled on so much macro assembling on top of SPARC asm (macros all in m4, naturally) that he in effect was writing the book using a personal high-level language constructed out of gobs of m4. Like the bizarro-world version of personalized language construction in Lisp-land...

Was this book (the first review complains about the same thing): http://www.amazon.com/gp/product/0130255963/ref=as_li_ss_tl?...

cglace · on Feb 3, 2014

I used this book for OS/architecture class as well.

I think the use of m4 needlessly complicated the material.

csmithuk · on Feb 3, 2014

This. Microsoft MASM is the canonical example of one.

Born in 1981, still being updated today:

http://en.wikipedia.org/wiki/Microsoft_Macro_Assembler

gillianseed · on Feb 3, 2014

Indeed, I pretty much learned programming on the C64 using Turbo Assembler which had macro functionality.

_hyn3 · on Feb 3, 2014

I learned assembly on the Atari 8-bit (like the C64 and AII, all of which used a variant of the Motorola 6502) which had an assembler named MAC/65. MAC, of course, was short for macro. This was in the early 80's also. Good times.

http://en.wikipedia.org/wiki/MAC/65

MrCool2000 · on Feb 3, 2014

AFAIK the Turbo Assembler was producing x86 bytecode, how would that work?

qw · on Feb 3, 2014

http://turbo.style64.org/about-the-turbo-assembler-homepage

"Turbo Assembler is a native c64 assembler which was introduced in 1985 by the German company, Omikron."

gillianseed · on Feb 3, 2014

You must be confusing Turbo Assembler on the Commodore 64 with something else, the C64 used an 8-bit cpu called 6510

pjmlp · on Feb 3, 2014

> x86 bytecode

What?!

mehmetemre · on Feb 3, 2014

I guess he's talking about Borland's Turbo Assembler that was coming with Turbo Pascal: http://en.wikipedia.org/wiki/Turbo_Assembler

pjmlp · on Feb 3, 2014

My remark was because the OP used bytecode to describe assembly opcodes.

nkurz · on Feb 3, 2014

The usage is nonstandard, but it provides an interesting perspective. What if he was planning to run the X86 code on a virtual CPU, is his usage then correct? Does the nature of compiled code change if it is executing on a "real" machine instead of a virtual one?

pjmlp · on Feb 3, 2014

I would say yes, given that some bytecode instruction sets make a CISC processor look like a RISC one.

wglb · on Feb 3, 2014

Try the 60's.

sehugg · on Feb 3, 2014

Also CISC instruction sets (VAX, 68k) were already pretty high-level.

oso2k · on Feb 3, 2014

TASM from Borland had bunch of macros/shortcut keywords for the patterns you describe. There's dozens of library files to do similar things in MASM as well. mammon_ had written a bunch of macros [1] that do same thing for NASM. See how he exploits NASM's preprocessor for this in his "Extending NASM" article in the Assembly Language Journal, Feb/Mar 1998, Issue 3. He also demonstrates how do all the control structures in NASM's preprocessor (for, if, switch/case, while, do/while, etc.).

[1] http://mammon.github.io/Text/nasm_apj.txt

J_Darnley · on Feb 3, 2014

If you or anyone is looking for something which takes care of a lot of the entry and exits to functions I would suggest they look x264's x86inc.asm [1]. It was designed for using SIMD in DSP functions not for writing whole programs but I don't think it would get in your way of doing that.

Plus it is BSD licensed for those that hate the GPL.

[1] http://git.videolan.org/?p=x264.git;a=blob;f=common/x86/x86i...

vidarh · on Feb 3, 2014

Most decent assemblers supports macros, some of them fairly complicated.

And there is Randall Hyde's High Level Assembly: http://www.plantation-productions.com/Webster/

Luyt · on Feb 3, 2014

    xor rax,rax

I used this construct often to zero a register, in the time that memory and CPU cycles were scarce. But nowadays, my time is a more valuable resource, and I tend to write:

    mov rax, 0

It takes a somewhat longer instruction code, and a few CPU cycles more, but it conveys meaning better.

sneak · on Feb 3, 2014

> It takes a somewhat longer instruction code, and a few CPU cycles more, but it conveys meaning better.

So do webservers written in Python. I don't think that was the point of this project.

xyzzy123 · on Feb 3, 2014

Personally I find the xor form easier to follow, it's SUCH a long-standing convention that if I saw a mov rax, 0 I would wonder what was going on.

rjzzleep · on Feb 3, 2014

same here. _but_ if you weren't proficient in writing and reading asm the statement would no longer be true.

now, you could argue if that were that the case, wth is that person doing there anyway.

also if you compile something like `return 0;` it used to be compiled down to `xor eax,eax; ret;`, but meh barely anyone i know coding these days even knew this to begin with.

XorNot · on Feb 3, 2014

Couldn't you just find-replace this in a makefile?

Then again almost everything about assembly gets a bit "isn't that what we made other languages for?"

csmithuk · on Feb 3, 2014

xoring something with itself is a common idiom in ASM. Been doing it since the Z80 (XOR A). Not sure that meaning is enhanced at all.

lttlrck · on Feb 3, 2014

A few more instruction cycles? Which architecture and which decade?

ksk · on Feb 3, 2014

This decade :)

For modern out-of-order CPUs - xor reg1,reg2 is problematic because the result depends on the previous contents of the register. Hence it cannot be executed out of order.

However as a special case xor reg1,reg1 (along with sub reg1,reg1) will be detected by the cpu (intel anyway) as a 'zero idiom' instruction and because it has a smaller opcode than mov reg,0 its preferred.

There are also other complex reasons - for detailed info see 3.5.1.8 in the Intel optimization manual.

Dylan16807 · on Feb 3, 2014

>This decade :)

The initial claim was that mov reg,0 was slower to execute.

ksk · on Feb 3, 2014

A CPU cycle is Fetch,Decode,Execute,Writeback. The 'zero idiom' instructions do not consume any 'Execute' resources. Apart from this.. as the other poster said, smaller opcodes means more instructions can fit inside the caches.

teraflop · on Feb 3, 2014

Smaller opcodes will, all else being equal, take up fewer instruction decide cycles and less space in the CPU cache.

Tepix · on Feb 3, 2014

I'm surprised that asmhttpd doesn't use sendfile(2) anywhere. Isn't that the fastest way to send a file to a socket?

nemasu · on Feb 3, 2014

Yeah, I'll be switching over to sendfile and getting rid of the large read/write buffers. So much cleaner and simpler ( and faster? ).

colanderman · on Feb 3, 2014

and faster?

You'll get way more bang for your buck by minimizing userspace→kernel round trips and memory copies than by hand-optimizing assembler code. sendfile is one great way to do this.

You'll get even more bang for your buck by eliminating the kernel from the packet processing path by using netmap, PF_RING/DNA, or DPDK, and a user-space TCP/IP stack.

Assembly only really helps alleviate GCC's moronic decisions resulting in excessive stack spills and alignment-unaware loads & stores.

Something about carts and horses.

nemasu · on Feb 4, 2014

Done ... so pretty now. :)

diestl · on Feb 3, 2014

Why would you do this, write it in C and let the compiler do the hard work.

exDM69 · on Feb 3, 2014

Why not? There are several http servers written in C already.

I am pretty sure that this was an exercise in practical user space assembly programing rather than an attempt to write yet another http server.

xedarius · on Feb 3, 2014

I agree. There's little justification for this other than 'I can'.

If you need to hit the metal you can inline assembly in both c and c++.

mhurron · on Feb 3, 2014

> There's little justification for this other than 'I can'.

And why isn't that enough?

zeeed · on Feb 3, 2014

Security through transparency and readability for a bigger audience.

dllthomas · on Feb 3, 2014

That's a great reason not to use this. Doesn't seem a reason not to do it.

BlackDeath3 · on Feb 3, 2014

That's exactly right.

I wrote a web server in C. I'm not really a web guy, it was a lot of new stuff to me, I'm a pretty amateurish programmer, and it probably doesn't even deserve to be called a "web server". But it was fun, and really cool to know that something that I wrote (with the help/jump-off point of a tutorial or two) can be used to serve web pages to a client. I can run the program, pop open Firefox, and use the browser to click through a set of test pages as though it was being served by a real web server. That's fucking cool, and that was all the reason that I needed to do it.

Don't Reinvent The Wheel - unless you want to!

NigelTufnel · on Feb 3, 2014

I wonder how fast is this server. Is handwritten assembly faster than GCC/clang-written assembly?

nknighthb · on Feb 3, 2014

Not in the general case. It's been a long time since x86 assembly developers could commonly beat a decent optimizing compiler.

The thing about compilers is that they're leveraging, even if imperfectly, the collective wisdom of their authors and of the companies who actually built the chips and have offered insight, advice, and sometimes even code. It's very probable they know more performance tricks than you do.

One problem is landmines in the ISA, such as instructions that look like they exist to be used, but are really traps implemented in suboptimal microcode for the unwary programmer who didn't look closely at their performance characteristics. Or certain sequences of instructions that might combine to do something ridiculously slow[1].

These landmines vary by microarchitecture. An instruction that's incredibly slow on one line of x86 chips might be a wonder-drug on another. This both increases the probability that your code will hit a landmine on at least some CPUs, and gives you a possible "in": Compilers aren't going to optimize perfectly for every microarchitecture. If you know exactly what you're doing (or spend a hell of a lot of time on trial and error), you might be able to come up with optimal codepaths for specific chips that the compiler didn't.

By and large it's not worth it, though. Hand-tuned assembly still ends up in places, but increasingly rarely, and it's confined to small hot-spots. A particular algorithm or part of an algorithm gets re-implemented in assembly because the compiler just can't get it right.

[1] I could have sworn there was a story about this just recently, but I can't seem to find it. Something like a piece of code running way slower than anyone thought it should, until an AMD engineer piped up and said "Oh yeah, don't do that, it causes a pipeline flush." for reasons that were utterly non-obvious to anyone who didn't know the internals of the chip.

drv · on Feb 3, 2014

I don't want to be too harsh - this is a fun idea, and I'm prone to silly fantasies about rewriting slow code in assembly myself - but this particular assembly doesn't take advantage of many of the "dirty tricks" that are available in low-level code.

As one example, check out the content-type detection, which is essentially a long chain of repeated strlen + strcmp; assembly language doesn't magically make bad algorithms fast.

yohanatan · on Feb 3, 2014

Not to mention that long chain is ugly to read. I would rather see a macro defined and called multiple times than to see the same block of code copy/pasted over and over.

vidarh · on Feb 3, 2014

Maybe, maybe not, but for a web server, if asm vs. C is making a noticeable impact on the overall performance, one of them is doing something very wrong - they should spend most of their time in sys-calls to shuffle data to/from the network, not executing web server userland code.

userbinator · on Feb 3, 2014

Speed is probably not going to differ that much but size is one area where even a novice asm programmer can easily beat a compiler.

mjn · on Feb 3, 2014

> Is handwritten assembly faster than GCC/clang-written assembly?

Sometimes, but the biggest case is if you can carefully arrange a tight inner loop, especially one that case make use of SIMD, like some DSP and scientific-computing code. Auto-vectorizers are getting better, but still miss lots of cases, so a skilled asm programmer can beat the compiler. The more "spread out" the performance-critical code is, in general (i.e. performance not dominated by one or two tight loops), the harder it is for hand-coding asm to beat a compiler; humans are not that good at doing whole-program optimization on large codebases. The more cross-platform the code has to be, the worse for the asm programmer as well: beating gcc's code-gen on one architecture is easier than beating it everywhere.

0x0 · on Feb 3, 2014

That depends on how skilled the developer is, and what cpu subtype they are targeting :)

callesgg · on Feb 3, 2014

Runs with about 3% less CPU load than apache on my server. However apache also does ALLOT more stuff.

driverdan · on Feb 3, 2014

Under heavy load?

callesgg · on Feb 4, 2014

Yeah downloading a large file. Apache was about 14% CPU Load asmttpd about 11% CPU Load

gtirloni · on Feb 3, 2014

No libraries required? That is awesome!

edsiper2 · on Feb 3, 2014

asmttpd - 0.04

Using Document Root: htdocs/ An error has occured, exiting

nemasu · on Feb 3, 2014

It's a crappy error message, but usually means the port is taken.

notastartup · on Feb 3, 2014

so what is the implication of this? Does this mean that this webserver will be much faster and using less resource because it's written for linux directly?