Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Asmttpd – Web server for Linux written in amd64 assembly (github.com/nemasu)
159 points by pykello on May 19, 2015 | hide | past | favorite | 95 comments


I'll try and get this building for OSX.

For the uninitiated, might I recommend my: http://nickdesaulniers.github.io/blog/2014/04/18/lets-write-...

Though, this is written in yasm syntax, which is slightly different.

Also, keep an eye out for a blog post on Interpreters, Compilers, and JITs I'm working on (cleaning it up and getting it peer reviewed this or next week)!

update 1 Actually, would the syscall's be different between Linux and OSX? Let's find out, once this builds! hammers away

update 2 Got it building and linking. bus error when run, debugging with gdb.

update3 Can't generate dwarf2 debug symbols for OSX? $ yasm -g dwarf2

update 4 Careful, this tries to listen on port 80 [0] (0x5000 (LE) == 5*16^1 == 80), I would never run any assembly program off the web with elevated privileges. I recommend 0xB8B0 (LE, port 3000).

update 5

> Actually, would the syscall's be different between Linux and OSX?

Looks like yes: http://unix.stackexchange.com/a/3350 These might be close to shim out (OSX and Linux at least share a calling convention, unlink Windows). I'll upstream what I have.

[0] https://github.com/nemasu/asmttpd/blob/master/main.asm#L24


Freudian slip?

unlink(Windows) indeed...


It's been closer to 20 years since I last read a complete program in x86 assembly, so this is quite fun to look at.

I'm somehow disappointed (quite unreasonably, of course) that the code uses plain old zero-terminated C strings instead of something more exotic. One of the fun things about assembly is that you get to reinvent basic language features on the fly -- calling conventions, data layout, strings, everything.


It needs to do so to interoperate with the OS, so using those avoids having multiple conventions and converting between them.


AFAIK Linux doesn't use zero-terminated strings anywhere in its syscalls, or at least not in those like write, where you pass a size alongside the buffer.


However (almost?) all syscalls dealing with filesystem paths take null-terminated strings. See for example the implementation of the open() syscall:

https://github.com/torvalds/linux/blob/fb65d872d7a8dc629837a...

(Hence the need for the strncpy_from_user()-function: https://github.com/torvalds/linux/blob/fb65d872d7a8dc629837a...)


Ahh, I totally missed the path arguments, my bad.


write() syscall writes a sequence of bytes (not string) therefore cannot use zero-terminated convention.


open takes a NUL-terminated filename.


I'm surprised it doesn't do length-prefixed strings with a null terminator anyways. Makes a whole lot of things easier.


As a general principle, bugs are reduced when you use representations that don't allow inconsistent representations. Unless you have an overriding reason, it's best to use a data representation that doesn't allow non-canonical representations (if such exists).

If you null-terminate a length-prefixed string, what if there's a null in the middle of the string?

(1) Allow inconsistency, and go with the length prefix in case of inconsistency. You could allow null bytes in the middle, treating it as a normal length-prefixed string, but then why do you null-terminate the string? (Is it so that you can still pass the string to functions that will choke on embedded nulls? Why would you do that?) This is just asking for kernel bugs.

(2) Allow inconsistency and go with the position of the first null byte in the case of inconsistency. If the length prefix is inconsistent with the position of the first null byte, you could go with the position of the first null byte, but then why even have the length prefix?

(3) Disallow inconsistency. You could disallow embedded nulls, but then the length prefix is just there as a place to cache strlen calls? If you're defining a syscall interface and requiring the length and first null to be consistent, then you need to run strlen anyway in order to sanity check what userspace gave you... why not simplify the external interface to just be either null-terminated or length-prefixed?


I should have specified a little more, perhaps. Don't think of it as a null-terminated + length-prefixed string. It's effectively a purely length-prefixed string. There just happens to always be a null one byte after the end of the string.

The length prefix is the only thing you use, ordinarily. The only time the null comes into play is if you've already had a bug.

Think of it like a stack protector.


> Think of it like a stack protector.

... but one that doesn't terminate execution, but instead hides your bugs. In most use cases, I'd prefer to find my bugs in the majority of cases, rather than to hide the bugs except for corner cases.


[flagged]


Please don't make usernames that attack another user. It's uncivil and distracts from the topic.


Out of curiosity What would you have done for strings?


Well, for a HTTP server, I don't have a specific idea... But in general, the fun part would be trying to come up with string representations that are optimized for the particular application.

The original 1984 Elite computer game is famous for its huge galaxy full of planets. Each of them had individual names and descriptions such as "Lave is most famous for its vast rain forests and the Laveian tree grub."

Yet those strings were never stored as plain strings. The game had to run in 32kB of memory, so almost all strings were stored in a tokenized form and expanded using a pseudo-random number generator:

http://wiki.alioth.net/index.php/Random_number_generator

That article shows how the planet description strings were stored and reconstructed on the fly. The base representation for the aforementioned description of planet Lave was only a handful of bytes: "\x8F is \x97"

So I think Elite is a pretty good example of an application written in assembly that didn't have anything like a generic string type.


Indeed! Full details of the string routine at http://xania.org/201406/elites-crazy-string-format if you're interested in quite how mad it was!


That is awfully cool stuff, thank you for that ! but to be honest none of that is really specific to assembly. You can do the same thing in C/C++/Ada/Rust/pick your favorite systems language, or even in a higher level language if you don't care so much about data representation.


I would argue this has nothing to do with ASM, one could achieve the same compression with any programming language. The bottom of the linked article has a nice, few lines of Python.


As others have said, have a length field in with the string. This also has advantages other than making buffer overflows a lot harder, such as making string copying faster and easier. For example, let's have a skeletonized view of a normal string copy routine in assembly (disclaimer: my assembly is rusty, so this may not be completely right. void where prohibited):

        push rax ; save our registers
        push rdi
        push rsi
        mov rsi, location ; get the pointer to the right place
	mov rdi, destination
		
    beginning:
        mov rax, [rsi] ; copy contents to the register so we can compare
       test rax, rax ; compare our source to itself, if it's zero, it'll set a flag
       jz done ; we're done
       movsb ; copy the byte, increment the rsi and rdi registers
      jmp beginning
	
    done:
        pop rsi ; restore our registers
        pop rdi
        pop rax
		
In contrast, with a length parameter, we can do

        push rcx ; using a different register here
        push rdi
        push rsi
        mov rsi, location ; same as before
        mov rcx, length ; moving our length into the counter register
        mov rdi, destination

        cld ; okay our first change. Clearing the direction flag so the copy
            ; goes from the first byte to the end
        rep movsb ; it'll repeat cx times the movsb command, and then carry on
	    
        pop rsi ; restoring our registers
        pop rdi
        pop rcx
Having the length of strings means you can have much more concise code. It makes loops easier, makes your code cleaner, and in some environments, gives a speed boost.


Handling strings with (ptr, length) also means that some string copies can be entirely avoided. Text can be chopped, shared and extended by adjusting the two parameters, while the underlying storage remains untouched.

e.g. a really basic example for a web server would be splitting up a URL into a path and query string: both strings can use the underlying URL without any copying.


Lots to add to that if you want: sanity check the length against external buffer limit; move word/dword/mmx128; consider alignment; not use deprecated movs instruction.

In fact the best, alignment-sensitive solution I've never seen written. It would load 2 large words, shift them to target alignment if needed, store, load, repeat. This would guarantee aligned fetch/store and still do whole-bus operations.

I've waited to see an instruction to do this in any machine ever (why do we have to hand-code this kind of thing, when the processor chip KNOWS the best way to get it done?) I've waited 20 years.


You mean vectorizing memcmp/memcpy to work word-at-a-time instead of byte-at-a-time? Google does this when compiling their binaries, and I think Facebook does too. I thought the latter had open-sourced it with Folly but couldn't find a code pointer. I've heard rumors that LLVM can sometimes vectorize them to use SIMD instructions when available, too.

It's hard to do this safely for strcpy/strcmp because you might read past the end of the buffer when trying to test against a null terminator. memcmp/memcpy and length-prefixed blocks let you use a Duff's-Device-like construct to test only the last word byte-by-byte.


In a paged system, reading past the end of the string, but no further than the last whole aligned word containing the null terminator, will never page fault. So that can still work.


not use deprecated movs instruction

I've waited to see an instruction to do this in any machine ever (why do we have to hand-code this kind of thing, when the processor chip KNOWS the best way to get it done?) I've waited 20 years.

Look up "enhanced REP MOVSB"; this link may also be interesting reading: https://software.intel.com/en-us/forums/topic/275765

REP STOS (memset) has also gotten the same boost throughout the generations of x86, and if the trend continues I'd expect REP CMPS and LODS to get the same treatment. These string instructions are tiny (1-2 bytes) and yet very powerful; their greatest advantage is that they don't take up the astoundingly large amount of space in the icache that some extremely micro-optimised routines (i.e. ridiculous amounts of loop unrolling) do.


Turbo/Borland Pascal used a NUL-optional string format of length (byte IIRC) followed by data. [0]

Java IIRC also uses a length-oriented format for string constants in .class files. [1] It's been a while since I wrote a .java to MIPS asm compiler in C++ from scratch (don't ask).

This is because real-world strings may contain 0 to N NULs and escaping them is too much of a PITA for serialized formats, so it's easier and common to do things like TYPE LENGTH DATA de/serialization. For modern, efficient binary de/ser, check out binc and msgpack [2,3].

0: http://math.uww.edu/~harrisb/courses/cs171/strings.html

1: https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.ht...

2: http://msgpack.org/

3: https://github.com/ugorji/binc/blob/master/SPEC.md


Another option for dealing with nuls in a string:

http://en.m.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuf...

Basically eliminates a null in a byte stream so it can be used as a terminator.


Could have a pointer, length pair. That's how it's done in some non-C languages.


Pointer, length, encoding. In 2015, giving me a bag of bytes labeled "string" is about as meaningful as saying it's "music" or "a picture".


It's 2015. UTF-8 won.


UTF-8 is not well suited for a general purpose string implementation because it is a variable length encoding and therefore addressing a character becomes a linear time operation. UTF-16 would probably be a better choice in most cases.


UTF-16 is also a variable length encoding and addressing a character is a linear time operation. Then again, even UTF-32 can have composed characters, such as ¨a separately forming ä.


True but UTF-16 captures really a very large share of actually used code points. UTF-32 captures all of them but wastes at least 11 bit per code point, more right now because most of the code points are unassigned. It seems a good tradeoff for a lot of use cases and as you mentioned once operating on code points is no longer good enough you will have to face the issue regardless of the encoding.


UTF-8 consumes less space and has pretty much same trade-offs in terms of iteration. No worries about endian (UTF-16LE vs UTF-16BE). Almost all input text is in UTF-8 and so is almost all output text. Conversion back and forth to UTF-16 is just wasted CPU time.

I think even counting number of UTF-8 code points in a string is faster in UTF-8 than in UTF-16, if you're allowed to use SSE2/AVX2/AVX-512, because all UTF-8 sequences start with a byte that has highest bit 0, all other bytes in the sequence have highest bit 1.

So just SIMD vector compare [1] to find all "positive" bytes (highest bit == 0), which gives you a nice mask. Then move the mask to a general purpose register [2] and popcount [3] it. 16/32/64 (SSE2/AVX2/AVX-512) bytes processed at a time, no branches other than loop control branch.

You can use the same idea to quickly scan UTF-8 string to approximately right position to retrieve a given (random) code point index. Still O(n), but with 10-50x smaller constant factor. If that's not enough, you can simply pre-index every n-th code point (say, every 64/128/256th) in a separate array for larger UTF-8 strings. That gives you constant time random access.

[1]: http://www.felixcloutier.com/x86/PCMPGTB:PCMPGTW:PCMPGTD.htm...

[2]: http://www.felixcloutier.com/x86/PMOVMSKB.html

[3]: http://www.felixcloutier.com/x86/POPCNT.html

Note: UTF-8 code points start either 0xxxxxxx or 11xxxxxx. Regardless of this the basic idea should work, just need to do two compares and to bitwise-or the masks. At the end, AVX-512 of course would use mask-register (k0-k7) and AVX2 would probably need to convert the mask in two parts, once for both 128-bit register halves.

Note 2: Thinking about it a bit more, I think it's enough to check if signed bytes are greater or equal than -64 (0xc0)! This covers bit patterns from 11000000 (-64) to 01111111 (127); all the sequences that can start a sequence. So no two compares and bitwise-or needed after all.


This page has the best implementation I've seen: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strle...

It basically counts continuation bytes (which all start 10xxxxxx) and substracts, rather than trying to count characters.

Additionally, if you know how many bytes are in the string, you can remove the check for the null terminator.


Tested my idea quickly. Initial messy (but correct) version finishes in 31% of time that version you linked [1] (cp_strlen_utf8) takes to run. I think I can still improve it quite a bit.

Only tested with long 30 MB strings.

Edit: Now at 26%. But it can still be improved more. Both benchmarks are with hot cache.

  new_strlen_utf8 12352856 clock cycles
  cp_strlen_utf8  47544818 clock cycles
Edit 2: Well, 4x performance 32 bit, but compiling it 64-bit in VS2015RC manages to optimize cp_strlen_utf8 more, almost doubling performance. 45% then. Will try gcc 5, clang, etc. later. And it can still be optimized further.

Edit 3: Ended up at 35% (2.9x) execution time for 64-bit and 18% (5.5x) for 32-bit. My version is as fast in 32 and 64-bit, but cp_strlen_utf8 benefits quite a bit from 64-bit mode. Probably memory bandwidth limited at this point, but I didn't profile yet. In any case, it does utf-8 code point strlen at 16 GB/s at this point. CPU is i5-4430 CPU @ 3.00GHz, two memory channels @1600 MHz.

[1]: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strle...


> and therefore addressing a character becomes a linear time operation. UTF-16 would probably be a better choice in most cases.

> but UTF-16 captures really a very large share of actually used code points.

That most characters[1] in use are a single code unit in UTF-16 is meaningless to code that needs to index[2] into a UTF-16 by code point (or grapheme): the only correct way to accomplish this in a typical UTF-16 string implementation is O(n).

[1]: I love emoji, and they are outside the BMP.

[2]: I think you'll find that most code does not need to index into a string. (Though languages that lack iterators on strings will make writing the code without indexing difficult.)


UTF-8, UTF-16 and UTF-32 are different encodings of the same character set, so i'm not sure what you mean in UTF-16 captures really a very large share of actually used code points.

next, your claim that UTF-8 is not well suited for a general purpose string implementation because it is a variable length encoding and therefore addressing a character becomes a linear time operation is incoherrent: UTF-16 is a variable length encoding just as well. come out and say that you want to be lazy and pretend surrogate pairs don't exist.


Then you're not actually talking about UTF-16 but rather UCS-2


Well now that we have supplementary planes, UTF-16 should also be considered variable-length.


I used Rust and Swift languages a little bit and I must admit that I almost never had to use indexing for characters. Iterating and slicing is enough for most algorithms.

Actually today it might be hard to understand what indexing is, even if you store your string in UCS-32 encoding. There are graphical symbols that may occupy variable number of UCS-32 items. And they are used out there (e.g. flags).


If only Java and NT would get the message...


Encoding of string types can be (and often is) implicit within the context of a program.

Encoding of an opaque byte array is a different story. E.g. Python 2's unicode vs string.


I remember one program that had a string print subroutine that took zero args. It just used the return address from the stack to grab the nul-terminated string immediately following the JSR/CALL instruction. It then patched the return address on stack to return just after the nul.

Bad for storing data in .text, but still a neat hack that shaved whole tens of bytes off the program size.


It needs to be security-hole compatible.


I wrote httpdito, a web server for Linux in 386 assembly, a couple of years ago (mostly outdated discussion is at https://news.ycombinator.com/item?id=6908064; a README is at http://canonical.org/~kragen/sw/dev3/httpdito-readme) and I was happy to get the executable under 2000 bytes. I actually used it the other day to test a SPA, although for some things its built-in set of MIME-types leaves something to be desired.

But it doesn't have default documents, different kinds of error responses, TCP_CORK, sendfile() usage, content-range handling, or even request logging. So asmttpd is way more full-featured than httpdito, and it's still under 6K.

(...httpdito possibly doesn't have any bugs, either, though ☺)



Days ago I saw a lightweight httpd written in C here, yesterday a C++ header-only httpd library caught my mind, and now an httpd in assembly. I'm curious what would come next...


Oh, I bet someone writes httpd in vhdl or verilog... Unbeatable header parsing time I'm sure.

Or maybe in CSS. https://news.ycombinator.com/item?id=9567183


> Oh, I bet someone writes httpd in vhdl or verilog

I would love to see that.


Probably one written in JavaScript, I'm guessing.


That already exists, Node.js.


Where the web-serving part is written in C.


Shameless plug for a companion IRC bot in ARM assembly: https://github.com/wyc/armbot


The name is confusing, the first thing i thought of was an SMTP-server.


A little strange to see: "Sendfile can hang if GET is cancelled." in the readme and no corresponding issue. Not even one closed as "wontfix". Sounds like DOS?


Why would someone write a web server in assembly? just for fun?


Exactly. Because why not?


I have a few devices with 4 to 16mb of flash - an 8kb web server would be very useful*

*Granted, it's the wrong arch for those, MIPS would help me.


Probably to lower overhead associated with C language features, similar to the reason why many people write things in C instead of a higher level language.


To avoid overhead people implement specialized compilers suited for the task at hand. If anything, going down to C, and especially assembly will hurt performance as low-level code is much harder to optimize for obvious reasons. Above all of that real-world performance comes from proper system-level design, not micro-optimizations, and using a low-level language (be it C, C++ or assembly) will prevent one from quickly iterating over different ideas.


will prevent one from quickly iterating over different ideas.

I consider that a good thing. Making implementation harder means you'll be forced to be more thoughtful in design; Asm is so explicit and "low density" that you will naturally want to make every instruction count. You won't be easily tempted to make copies of strings, allocate memory, or do frivolous data movement, because those things all take instructions - instructions that you have to write. Even if you're calling functions, you still have to write the instructions to call them and pass parameters every time. You'll be more careful about not doing work that you don't have to.

Contrast this with high-level languages that make copying data around and allocating huge amounts of memory as easy as '=' and '{}'. They're good for prototyping high-level "does it work" types of things and exploring concepts - the "quickly iterating over different ideas" that you mention - but once you decide what to do, are a lot less controllable with the details because of their high-level nature. And the details, the constants in algorithms, do matter a lot in the real world. Moreover, the difference in constants can be so big that even "proper system-level design" in HLLs can't beat a theoretically less efficient design in Asm, because the constants with the latter are miniscule.

See KolibriOS, MenuetOS, or TempleOS for an idea of what Asm can do.


Asm can do anything. :)


> Above all of that real-world performance comes from proper system-level design

There are situations where "proper system-level design" just doesn't cut it, and even traditional "rewrite this module in C" doesn't work, because there is no single module to optimize, but rather system is being slowed down by many little overheads all over the place. JITs help with this, but they are not always available. It really pays off to switch to the language with less overall overhead and a focus on performance if you find yourself in such a situation.

> and using a low-level language (be it C, C++ or assembly)

C++ is not a low-level language.


It is by all accounts. It pretty much relies on the underlying machine memory model, and ordering of operations in a C++ program directly corresponds to the resulting ordering of the operations on the machine it is running on. Furthermore, the language itself is clearly speicific to register machines and follows the corresponding semantics -- it'd be hard to target other kinds of computers from C++.


You start with a proper design and then hack it to squeeze performance out of it.

Low level code is not that difficult to optimize... especially not assembler.

Let's just say it's going to be a lot easier to get 200,000 req/sec from asm than from rails.


> low-level code is much harder to optimize for obvious reasons

What are the obvious reasons I'm missing?


As cgabios noted below low-level code obduscates intended behaviour. Ever tried to write a C optimizer? A trivial example is a for loop vs map -- the former has inherent ordering semantics and compiler does not have any way of knowing if this behavior needs to be preserved, while the latter just tells that a particular operation needs to be applied to each element so compiler is free to reorder/parallelize/etc. There are much worse situations that arise from low-level language having to preserve the underlying machine memory semantics (that is one of the reasons why it is hard to compile low level languages like C or C++ to e.g. Javascript. Compiling x86 assembly would reuire a full machine emulation).

This is discussed in detail in most introductory CS books if you would like to learn more.


In asm you write loops because they are easy, map is hard (and generally slow because it's a lot more code.)

ASM derives performance from specialization, ASM asks, how often is this code ACTUALLY going to run on another architecture, OS, etc? And then gains performance by not supporting those things via abstractions, etc.

Throw away your CS textbook and run benchmarks, reality dictates theory, not vice versa.


Specialization is what compilers do really well. :) Humans -- not so much.


I'll give you a counter example.

GPU drivers spend a lot of time trying to optimize beneath their corresponding high-level API. This more-or-less equivalent to compiling GPU machine code on the fly based on GPU configuration - that is, very much like optimizing a high-level language.

If everything goes smoothly, the drivers can do a pretty good job of optimizing everything.

However, if you deviate slightly from the "fast path", the whole thing falls off a performance cliff, and because it's a high level language with a secret black-box optimizer behind it you're actually worse off investigating performance issues than you would be if you'd just written things at a lower level. Not coincidently, graphics APIs are moving to lower levels precisely to remove the complexity from the compiler, increasing transparency and making things more predictable.

Now you might suggest that a "sufficiently advanced compiler" wouldn't do that, but such a thing is a fiction. In practice, the compiler is never sufficiently advanced to optimize in all cases effectively.

---

Consider Javascript, where exactly the same thing happens. Your definition of a "high level language" may not include JS, but it's hard to argue it's not higher than ASM.

Modern day JS engines do a pretty good job of optimizing code JIT. However, you make some innocuous code change and suddenly your function is running in the interpreter instead of being optimized (see https://github.com/GoogleChrome/devtools-docs/issues/53 for examples)

If you were using a lower-level language, your chances of falling off mysterious performance cliffs is significantly reduced. Further, you have the capacity to do low level optimizations that your compiler literally cannot do.

So what if your high level language can now do parallel-maps, if it ignores cache thrashing, or hits load-hit-stores or any one of myriad actual performance holes that real code can fall into?

Or you add a field to the objects you're iterating over and the parallel map implementation hits a weird memory stride and perf drops through the floor. How do you even debug something like this in a high level language where all you see is "map()"?

---

I also think your compiling-ASM-to-JS example is a bit of a strawman, FWIW. The parent was talking about how high-level languages yield higher performance than lower-level ones, not about the portability or transpilability of ASM->JS. (A "suitably advanced transpiler" would handle this problem perfectly anyway)


Low-level code often omits high-level intended behavior (description, pseduocode, documentation, test cases, etc.) and semantic meaning like variable names. In such styled codebases, the absence of these makes it harder to refactor, reuse and/or modify than say concisely & precisely documented codebases in higher-level languages (Python, Ruby, Go) or quality asm.


I'm struggling to see what you mean about variable names.


Maybe it's just me, but I honestly don't know what this is doing near the HN top. It's more or less literal translation from C.


Possibly because it illustrates that using assembly isn't necessarily the insane scary idea that it first seems to many (like me), even to those that should know better because they have used it in the past (like me).


No, a literal translation is what you get when you write a http server in C and inspect what assembly code it produces for x86.64. Since this assembly code is nowhere close to that output it is not a literal translation.


Only if you use a completley naive compiler, any level of optimisation moves from being a literal translation


No, any correct translation a c compiler produces is, by definition, a literal translation.


No, that simply means that they are semantically equivilant, which is very different to a literal translation. To quote an online dictionary [0], "2. Word for word; verbatim: a literal translation.". Optimisation in compilers is certainly not word for word.

Take a simple problem like the FizzBuzz problem, write it as the simple obvious branching style. Now compile it with GCC or Clang (with -O3) and you end up with lookup table (or at least I did a few months back. Semantically equivalent but not literal "word for word" translation.

[0] http://www.thefreedictionary.com/literal


You are missing that there exists many more than one literal translation for a particular C program to asm. Using your logic a compiler could not produce a literal translation of any program unless its output to 100% matched that of all other compiles for the same program.


No dependencies at all, runs on Docker 'FROM scratch', Nice! - https://registry.hub.docker.com/u/0xff/asmttpd/


I propose that a web framework be called Assembly on Ambulator.


Given that it's so small, 6k, I'd called the framework based on this Assembly on Alleys.

It could totally implement a DSL... call it "C" for convenience, that generates the required assembly code :-).


> call it "C" for convenience

As distinct from C, I take it.


No, that is part of the joke I was (apparently) failing at making.


benchmarks?

:-)


Just because it is in ASM, doesn't mean an exact equivalent in C won't smoke it performance wise. Just sayin... Aside of intellectual curiosity, one would be very hard pressed to write ASM code that is even barely more efficient than C code generated by a decent compiler, i.e. clang, or Intel C....


Just because it's asm doesn't mean it's fast, sure. Just because it's C doesn't mean it's faster than an implementation in your favourite scripting language.

Having said that, compilers do pretty badly on C-to-simd optimisation. The best you get is loop vectorization if it's really simple logic. You can usually get some pretty good wins there. The fact you lay out your memory for simd usually is a win all of its own due to cache prefetching even if you don't actually use any simd instructions. Compilers need heuristics to manage cache when you know what you're trying to do, (eg when should it use non-temporal writes, for example?) Fast C code is written while having a really clear mental model of the underlying architecture and the assembly that the C will produce with -O3 (or whatever flag is relevant to your compiler) and then checked with -S or objdump -D, profiled with callgrind/cachegrind, perf, rdtsc etc...

The compiler really can't "Do it for you" You /can/ use a compiler as one of your tools when /you/ do it. As Randy Hyde points out you can always beat the compiler because you can use its generated assembly language in every case you can't beat, so the absolute worst you get is a tie.

So yeah, you can totally smoke clang, Intel, microsoft and gnu C compiler and get paid something for doing it in certain industries too. :-)

Mike Acton being aggressively opinionated on the subject, but the lecture is really good (despite/because of) the bits you'll disagree with and the manner he'll rub you the wrong way. https://www.youtube.com/watch?v=rX0ItVEVjHc


That's often true. Really often true. But sometimes, a GOOD assembly programmer can beat a C compiler by a LOT. https://news.ycombinator.com/item?id=8508923 "Hand Coded Assembly Beats Intrinsics in Speed and Simplicity"


something like "wrk -d 10 -c 1000 -t 4 http://127.0.0.1/byte.txt (one byte file) did:

Requests/sec: 100.00 Transfer/sec: 11.91KB

for a JPEG image (200KB) the results are similar:

Requests/sec: 99.86 Transfer/sec: 19.27MB


Doesn't look like it supports HTTP pipelining:

    [trent@ubuntu/ttypts/4(~s/wrk)%] ./wrk -c 1 -t 1 --latency -d 5 http://localhost:8080/Makefile 
    Running 5s test @ http://localhost:8080/Makefile
      1 threads and 1 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency    35.00us    0.00us  35.00us  100.00%
        Req/Sec    10.00      0.00    10.00    100.00%
      Latency Distribution
         50%   35.00us
         75%   35.00us
         90%   35.00us
         99%   35.00us
      1 requests in 5.10s, 1.57KB read
    Requests/sec:      0.20
    Transfer/sec:     314.55B
Note the 1 request. For 1000 clients, it's only doing 1000 requests:

    [trent@ubuntu/ttypts/4(~s/wrk)%] ./wrk -c 1000 -t 1 --latency -d 5 http://localhost:8080/Makefile
    Running 5s test @ http://localhost:8080/Makefile
      1 threads and 1000 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency   307.93ms  552.88ms   1.63s    87.10%
        Req/Sec   414.29    439.07     1.34k    85.71%
      Latency Distribution
         50%    4.11ms
         75%  407.63ms
         90%    1.63s
             99%    1.63s
      1000 requests in 5.01s, 1.53MB read
      Socket errors: connect 0, read 41, write 0, timeout 0
    Requests/sec:    199.79
    Transfer/sec:    312.96KB


this is really really fast.


are u kidding ? expectation is no less than a 100k req/s.


What's the performance like?


Amazing!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: