Question to the author : Seing the benchmark really made me wonder, how can you be twice as a fast as nginx ?
I've always thought writing assembly manually was for some very specific edge cases, or to talk to some very specific hardware, but that it was just a waste of time for anything else, especially compared to C (and especially with all the progress in compilers).
Is it the death by a thousand cuts scenario, or are there some big chunks of performances gained thanks to some specific tricks (and if so, could you give some example ?). I'm thinking maybe cryptographic functions ?
Hey thanks! Contrary to popular belief, x86_64 assembler isn't really that bad to deal with. An eagle-eye perspective is simply that regardless of how good optimising compilers get, they can never really know my intent. My zlib implementation for example is consistently 25% faster than the reference version, despite me simply "hand compiling" it straight from the C source. There are lots of contributing factors that all add up to the end result as seen in the benchmarks.
The short answer is: there is no single reason, it is the culmination of all of the underlying bits of the library that made it what it is.
I added code profiling to rwasa so that you can run load tests against it and watch call graphs and individual function timings, which makes for interesting inspection of the library itself for specific "tricks" that were employed. (the library's page contains rwasa-specific profiling examples: https://2ton.com.au/HeavyThing/ )
> My zlib implementation for example is consistently 25% faster than the reference version, despite me simply "hand compiling" it straight from the C source.
I am aware of Intel's patch re: psubusw usage, and interestingly chose the same solution when hand-compiling it as theirs well before I saw their patch to it (in fill window, which does make a substantial difference).
Have you compared my code to [your?] repo yet? I will endeavour to do so, though I am not sure submitting a fasm-based patch to that repo would make sense. Cheers
Regarding zlib, is that the fastest implementation that's currently available?
I rememeber stumbling into a guy that claimed his implementation was way faster (2x or more) than the original zlib, and it was a drop-in replacement and scaled fairly well. Unfortunately, I can't find it right now (it should be bookmarked on another computer).
Various people have attempted to speed up zlib; it's not that high a bar, if you're OK with not producing output binary-identical to the original zlib. Deflate compression algorithms keep references to possible LZ backreferences in a hash table, and depending on your choice of hash algorithm, and how much you prune your table versus spending memory and time storing and searching it, you'll end up emitting different backreferences. The implementation at https://github.com/jtkukunas/zlib/ replaces the hash algorithm with something much faster.
libslz is really, really fast (and fairly new). It's used in the next version of haproxy. In my own tests it's proved substantially faster than both zlib and miniz.
> Contrary to popular belief, x86_64 assembler isn't really that bad to deal with.
I think most beliefs regarding x86_64 asembly is largely "guilt by association" with i386... It's amazing the difference just from making use of the larger register set.
It's the same in the high assurance systems field albeit with different goals. We're concerned with optimization-induced failures, subversion, complexity, and so on. Mainstream languages and their compilers... doesn't exactly help on this. I've been out of it a while, evangelizing & informing mainly these days. Yet, your post brings back memories.
My strategy, leveraging the Write Great Code book, was to map language constructs onto assembler via macro-assembler or languages such as LISP with good metaprogramming. Then, I hand compiled the code in other's projects with my macro's. So, we took a similar approach there although my goal was to show a correspondence argument between source & asm. Meta tools turned that into full program for assembling and linking.
One wild idea I had for portability was doing optimized routines of a safe HLL in a bytecode like LLVM. That knocks out most of the uncertainty of above layers that limit optimizations the most. Then, the simple optimizations that machines are good at can be performed from there along with generation of assembler. Close to portability of C and efficiency of hand-written assembler with inline available.
For instance, code zlib functions in pretty optimal LLVM and let toolchain do the rest on full-optimization. Think yours will be 25% faster, 10%, similar? And I'm talking what you can code quickly rather than spend 30min-1hr optimizing by hand. Just curious to hear your thoughts as you have way more experience in that stuff.
Saying that you're afraid by optimization-induced failures and then saying that you will feed some LLVM bytecode to the optimizer seems contradictory to me...
They're kind of two different things. The first was hand-compiling (or tracing) things for highly assured work. The other was a tangent where I wonder if a bytecode like LLVM could be used as a cross-platform assembler that's closer to hand-optimized assembler than C due to less knowledge of intent being required. Not to mention simpler structure.
They're different things. If it's a worthwhile path, then the formal efforts on LLVM IR and verified optimizations could be combined with hand-made, inline LLVM for a safer, portable alternative to different inline assembler for each platform. Complimentary, not contradictory, when in context.
Ah ok. Appreciate that detail. My concept was replacing inline assembler in an otherwise portable 3GL with LLVM bytecode. The compiler wouldve taken care of ABI in that scenario. LLVM should be easier to optimize than C and maybe fast enough to eliminate need for several different assemblers.
> My zlib implementation for example is consistently 25% faster than the reference version, despite me simply "hand compiling" it straight from the C source.
Yeah, but that is a very CPU bound processing pipeline. You would expect that to maximize the impact of any inefficiencies in the compiler.
That you can hand tune for better performance is conceivable. That you can get a 2x win over some pretty tuned code suggests that there is something larger at work than simply tuning lots of little things.
Here's a short-list of some very simple things you can easily do in assembler which are either hard for a compiler to do, or which requires all kinds of extra optimization "magic". A lot of it boils down to having more information available:
* Allocate registers globally or across large substs. Especially when targeting architectures (like x86_64, but unlike i386) with decent numbers of available registers, this has lots of potential for typical applications that e.g. frequently needs to access common data structures. Compilers for many languages (e.g. C) have a hard time doing this if doing separate compilation per module (since you need to be able to link to code that hasn't been optimized the same way). You need whole-program optimization for this typically, but when programming assembler, it's the natural thing to do if you have enough registers to treat some of them as assigned to specific variables that you know will be frequently accessed.
* Omit stack frames entirely or selectively. Many compilers have options for doing this, but often still ends up pushing/popping more stuff than necessary for things like local variable frames and arguments, where a programmer will often see that a function is not going to use much space and decide to put in extra effort to shuffle things around to keep things in registers only.
* Selectively violate the "normal" calling conventions. E.g. if you have a utility function you often need to use in settings where it's convenient not to clobber certain registers, then you can opt to pass arguments in different registers easily. This again takes whole-program optimization for a compiler to do.
* Avoid saving/restoring certain registers based on the functions you're calling. Again requires whome-program optimization for the compiler to know that the function you're calling won't clobber specific registers.
* Specifically adjust what registers you're using based on what registers may be clobbered by other code you're calling to avoid having to save/restore.
Other things include similarities/patterns in code that are non-obvious in a higher level language because it depends on how the code is translated, which often can allow you to re-write things to eliminate common sub-expressions that are not actually visible/present in the high level code.
It's not that compilers can't do all of these if given sufficient freedom and information, but it often violates expectations of the higher level environment (e.g. separate compilation in C)
* "unusual" control-flow: return several levels up from a function without requiring extensive elaborate exception-handling mechanisms, coroutines, and techniques similar to continuation-passing-style. After all, functions and procedures are just an artificial construct imposed by HLLs.
* Easily return multiple values from a function by using several registers, even normally inaccessible ones like EFLAGS (very useful for booleans.)
* Generating self-modifying-code, like a simple JIT compiler, is straightforward to do. Works especially well for tight loops that have several variants of their bodies.
Compilers are still bound by conventions and the features the HLL exposes. Asm is only limited by what the CPU can do (and what the programmer can come up with.) I admit that, while I do prefer using something like C for much of the "mundane" code I write, it's quite frustrating in those situations where I can think of a very elegant way to do something that either can't be expressed in C without some extreme compiler-fighting, or is completely impossible because of how it generates code and what the language allows.
This makes me think a direction worth pursuing would be to save the whole-program-analysis from the last compile pass for the new compile pass, so you can avoid re-analysing things. So you'd spend maybe 20 minutes on the first compile, then edit one .cpp file and the next compile takes 10 seconds.
Possibly, though a lot of whole program analysis is not necessarily slow (for many simple optimizations), it just violates a lot of assumptions that many systems make.
E.g. just being able to determine application wide what call sites exists for a given function makes it fairly simple to handle specialized calling conventions, throw away stack frames, avoid saving/loading registers etc. as the biggest barrier against this is that with separate compilation you don't know upfront if a given function will be called from some piece of code expecting standard calling conventions.
You certainly can do really expensive optimizations too that might benefit from saving information, though.
Haskell web servers have been faster than nginx on single and multi-core micro-benchmarks for years. Not implausible at all that someone using assembler could do better still.
Submitting this so I can make a feature request. :)
The -logpath option is fine, but it would be nice if it could create subdirectories too (e.g. $LOGPATH/YYYY/mm/access.log.YYYYmmdd ). Otherwise, over time the log dir is going to get unwieldy.
I'm currently running several sites in alpine+rwasa Docker containers; I'm liking having a set of entirely isolated web servers based on a 10MB container image, each apparently consuming ~6KB RAM while idle.
Author here: Feature request noted, although the one-logfile-per-day for me doesn't seem too unwieldy. I have not seen webserver logs stored in the way you describe, is that a common practice (each month's worth in its own separate directory?)
Depending on what you read [1] cronolog is a decent way of doing things (e.g. unlike logrotate it doesn't require a reload of web server). Without exporting to something like ELK, it also gives you a decent audit/accountancy trail. Limiting logs is not something I can really do in production (think six-years-plus of mandated data retention).
The generally accepted logging convention is that you work with how logrotate works (listen to appropriate signals for freeing up the file to let logrotate truncate it and so forth). If anyone needs to setup anything other than that default, it is typically better to target customizing logrotate and unify your endpoint logging there rather than to keep modifying each application's configuration. This is about separation of concerns for me as a sysadmin v. getting away from operations decisions. If your project is widely accepted you'll want to limit the amount of code you maintain on stuff that's not in your domain. Heck, it's really the Unix Way. If you were writing against Windows, I'd be using the event logger for application events with my own text based logs and let users configure what I emit to the main event logger.
An example of how people let logrotate do most of the lifting is in the Chef server codebase.
Yes - at least, from a sample size of 1 (me) I like the old cronolog way of doing this, and it means I don't get the rotate-a-day-later-to-avoid-trailing-lines-in-compressed-logfiles issue.
h2o is cool too, I contributed to that project early this year for their OpenSSL/DHE parameter settings.
Re: How much time, hard to say from my perspective since a fair amount of rwasa's functionality resides in the library itself. Start to finish for all of the showcase pieces and the library (from 0 lines of code to release) took 13 months of my life :-)
I almost downvoted you by accident. Thankfully when I reloaded the arrows were still there. I reply think this "arrows to small" problem needs to be fixed. Although the refrain seems to be that it is not a problem: the community would vote such accidental downvotes back up
I've been wondering when something like this would emerge... I'm excited to try it out on some side projects; seems simple and fast-- hard combo to beat :)
I'd also like to say, as someone who is quite ignorant of writing x86 assembly, the function hook example is incredibly readable and clear. I'm looking forward to grokking the rest of the code base in attempt to learn more.
That's a huge effort, kudos. Certainly, it is harder to get maximum performance from C, as sometimes you have to reorder things in non-trivial loops in order to "help" the C compiler to get it "right" in the performance sense (many times involving function inlining and other things that usually increase code size/bloat). Although the problem of assembly-only is to be tied to just one platform, because currently not an issue for this case, with most web servers running on x86_64 CPUs, if allows to increase performance while reducing memory usage, it makes a lot of sense for super low cost architecture (tiny memory and CPU usage per active connection).
It would be interesting to see how x86_64 implementation of specific elements compare to equivalent C code, in terms of instruction per clock efficiency, cache miss ratio, etc. (e.g. using the "perf" tool in Linux, or any other tool that use the CPU hardware counters).
Is that Rwasa as in Russians? Because in one non-Russian dialect I'm familiar with, it translates exactly to that. If so, I find the reference interesting considering Nginx author's country of origin. I'm curious how you came up with the name Rwasa.
Late Sunday night here, but before I sign off for the night, thought you'd be amused to know that the name rwasa really, truly was an acronym from "Rapid Web Application Server in Assembler". I have historically sucked at coming up with decent names, and at the time pre-release this seemed like as good a name as any :-) I assure you any relation to Russians, African politicians, etc is wholly coincidence.
Do you have specific criticisms of my work and/or attention to detail re: the TLS specification, and/or my library's application of modular arithmetic et al?
I'll add another slight spin, which is that I'd never run this in production, ever, even if I actually personally trusted you a great deal, because I can't possibly audit this sort of code base. By extension, for the same reasons I can't audit the source very well, neither can anyone else. (i.e., no, I do not personally audit everything I run but I reasonably expect that because it is possible others have. "Many eyes make bugs shallow" may be oversold but it is not simply false.) The only practical way I know to audit this sort of code base at this time is naked-eye inspection, and I don't trust that.
I say "I know" because there may be something out there. I know there exists tools for source code analysis that deal directly with assembler, because I see theses around writing them. I don't know where to get one, though, or how to use it, or how much to trust it. There's a lot more that deal with "C" or "Java" rather than raw assembler, so I can fire several tools, both commercial and open source, at the problem.
And all that said I'm still extremely strongly in the "STOP WRITING C CODE AND PUTTING IT ON THE INTERNET DAMMIT" side, even with that support. Without even that support, I frankly don't care if it's 100 times faster than nginx. nginx is already maxing out my risk tolerance as it is, and I've begun a long, slow program to get it out of my stack too.
I want to emphasize how this is explicitly (but also unapologetically) non-specific, none of this is personal, none of this is directly critical of your code (because if you've gotten the impression I haven't even glanced at it, that is correct), and in particular, please by all means do whatever you like with your spare time. The "problem" here isn't that you have somehow failed to leap my bar, the problem is that the bar is impractically high for code written in raw assembler. I suppose you could provide a math proof but I'd almost argue in that case the server becomes implemented in said proof language rather than assembler anymore.
The related element is this: if the team behind nginx somehow all get by the same bus, I am confident someone appropriate will pick up and maintain the product.
In the case of an ASM project, I would be very surprised if anyone came along with the appropriate knowledge to ever want to touch the codebase. LibreSSL is currently pulling bits of ASM out of the codebase just to remove that factor.
Like Jerf said, I want to be clear that I'm incredibly impressed you got this project over the line, and I can't make any complaint about how you've done things.
Go's http server. The inside of nginx is an incredible mess of C. I begrudgingly trust it since it is being actively attacked and maintained. The inside of Go's http server is incredibly clearly written. I am confident that it too will be maintained. And it's roughly half the speed of nginx, which is plenty fast. (Few web application servers in the world are sitting there with nginx using all the CPU.)
It's a long slow process partially precisely because I intend to do this carefully, and discarding nginx is not something to be taken lightly, but long term, as I said, I want the C out of my stack.
That's interesting because I'm seeing people do the exact opposite : since almost all security holes in web environments these days come not from the applications but the myriad of frameworks and unauditable layers making fun use of objects all over the place, now using C or even ASM is the only way to limit the moving parts and to ensure that your code base doesn't change between two audits.
I'd be happy to give some more detailed thoughts. I write a lot of assembly myself [1], so hopefully that lends me some credibility.
I haven't read much of your TLS implementation, but I'm not a security researcher and I don't think I'm qualified to give an opinion on your particular implementation. However, there are some points to be made here. First of all, almost no one ships their own crypto for a good reason. To trust that something is secure, you need to have lots of people working on it and lots of projects invested in it (something that OpenSSL and such have). "Many eyes makes bugs shallow" is the common phrase, but it holds truer when large companies with highly skilled engineers are putting their secrets and their customer's secrets on the line. Your implementation has the unfortunate problem of being written in assembly. While I don't think there's anything inherently wrong with assembly, many people don't share the same opinion. People will be reluctant to contribute to it (how do people contribute, anyway?). Also, it is easier to make mistakes in assembly, and I would be surprised if there weren't several mistakes in your TLS implementation and in the rest of the server. This doesn't reflect poorly on you as a developer, but is instead a consequence of choosing assembly.
Assembly also inherits a lot of the common security problems C has (like buffer overflows), but makes them harder to identify. I would feel very uncomfortable exposing anything written in assembly to the public net, and doubly so if it used an unproven TLS stack written in the same. Other projects avoid the problem of untested crypto by using tested crypto from an external module like OpenSSL.
Agreed regarding general trust in any crypto stack. I've been doing commercial software development for 28 years now, and my company's products all reflect this. Whether I expect high-value security sites to use my software in production or not, well I certainly do not. Hardened stacks are few and far between, and OpenSSL can by no measure be deemed hardened (though certainly getting better of late thanks to all of the bug releases). Do I expect that my entire stack is 100% bug-free? No, but one of the niceties IMO of doing assembly language programming is that it is far less error tolerant in the ways you describe. Reading all of the nasties re: security-related code, and then applying the commonly-accepted mitigation strategies was applied throughout.
Re: how do people contribute, it is on my list of things to do for github's linguist x86_64 support (which is why I didn't put it all on github to begin with).
At the end of the day, trust is a function of time and perceived scrutiny of the stacks at hand. We are getting there slowly but surely :-) Cheers!
I'm sorry, you've held off on publishing on github due to missing source code highlighting? Or am I completely misunderstanding what you're saying (I think I am...)?
Admittedly I haven't checked recently, but before I released 2 Ton Digital I did a few test github projects and they all looked horrific so yeah I left it out on purpose. It's been on my "someday when I am bored" list since then (to fix up linguist so it all looks half-decent), and also why all the "library as HTML" on 2ton.com.au is self-highlighted.
Assembly also inherits a lot of the common security problems C has (like buffer overflows), but makes them harder to identify.
Actually, I'd say that it has advantages because the mindset is very different when writing Asm - it naturally forces you to think about things in a low-level and precise fashion, which keeps considerations such as buffer lengths more in the mind than higher-level languages that attempt to abstract it away.
Programming at the instruction level also allows much more fine tuning of instruction ordering and such to resist timing attacks, without any compiler optimisations getting in the way.
I don't know this firsthand, but I've heard from many people that had delved into OpenSSL code that it is an example of code where "many things could be made better", to put it lightly.
I think although a lot of people used OpenSSL only few looked into its source code. Those who did might have been horrified but since there was no real alternative continued using it.
Only after massive security vulnerabilties and a lot of media attention more people looked at OpenSSL in detail and eventually decided to do something about it. Which mostly was "let's write a new library or fork it". Thus, LibreSSL, sodium, nacl and such.
What got us in the mess with OpenSSL, was to leave a key component of many software projects to a struggling, small team. It's amazing how much open source relies on a few ancient programs written and maintained by few with little to no financial support (e.g. NTP, GPG).
>Only after massive security vulnerabilties and a lot of media attention more people looked at OpenSSL in detail and eventually decided to do something about it. Which mostly was "let's write a new library or fork it". Thus, LibreSSL, sodium, nacl and such.
Right, but those vulnerabilities _were_ found. I worry that they wouldn't be found in this. The only people who'd go looking are the people who see that a specific website is using it and want to exploit it.
That would only be true if those vulnerabilties occured because someone found a bug in the source code. Given the horrible mess the openssl code is said to be, i'd argue that most vulnerabilties were found without source.
> BREACH/TIME/etc
>
> Both the BREACH and TIME attacks rely on measuring the size of compressed response bodies. Since rwasa supports dynamic content compression by default, the HeavyThing library's default setting for webserver_breach_mitigation is enabled and set to 48 bytes. For each rwasa response when TLS and gzip is active, this setting adds an X-NB header that contains a random 0..48 bytes that is hex-encoded to each response header. While this doesn't render response sizing attacks completely useless, it makes a would-be attacker's job much more difficult due to the highly variable response lengths.
It's my understanding that random padding doesn't in fact make the attacker's job "much more" difficult. Only a little more, or not at all?
Could you comment on how integrated the TLS stack is with the webserver? Normally I'd think that using some kind of dedicated SSL terminating proxy, either a new version of HAproxy -- or stunnel/stud or similar -- would make more sense than deploying a new TLS stack that hasn't been through any outside review?
That said, as mentioned by others here - openssl is clearly not a great example of a secure/good TLS implementation. I'm not sure there are any (yet). Hopefully libressl will become one. Personally I'd like to see a minimal library that combined a couple of AES/ECC primitives and implemented TLS 1.2+ only (No SSL), with a sane and clean API on top.
Something along the lines of NaCl but with a goal to support a subset of standard TLS with forward secrecy (and explicitly throw old clients under the bus, Android 2x be dammed).
> It's my understanding that random padding doesn't in fact make the attacker's job "much more" difficult. Only a little more, or not at all?
The BREACH attack verbage at http://breachattack.com spells it out fairly clearly, by adding random bytes to all of the HTTP responses, it makes small compressed HTTP payloads impossible to determine whether guessed bytes were correct or not (well, depending of course on the size variable of the random bytes added).
> Could you comment on how integrated the TLS stack is with the webserver?
The TLS layer is entirely separate from the webserver layer. I built the epoll, TLS, SSH, webserver and client as "IO layers", such that they can be stacked together arbitrarily (imagine epoll/IPv4 listener -> TLS -> SSH -> TLS -> Webserver, perfectly doable, albeit a little nutty).
> it makes small compressed HTTP payloads impossible to determine whether guessed bytes were correct or not (well, depending of course on the size variable of the random bytes added).
Hm, ok. At least you didn't "just add some random padding" :-)
Thanks for the comment on structure. Might be nice to try and make ssl/tls terminating proxy as a separate binary I guess.
As for the code, for someone new to fasm it wasn't immediately obvious that to build one had to assemble then link (fasm -m $((bignumber))[1] project.asm project.o && ld -o project project.o # optionally strip project). Might want to but that in a Readme/makefile/build.sh. I found the general recipe in the hello-example - but a short readme in the various project folder and/or top level wouldn't hurt.
> I built the epoll, TLS, SSH, webserver and client as "IO layers"
I think you should just finish the job and implement the entire OS in assembly. ;-)
I'm kidding. To me, assembly programming has always seemed like a true art form. You're forced to think about everything, and if you can successfully fit all the pieces together properly, it's beautiful. Also, not many can hack it through assembly, so there's a huge selection bias too.
Probably the amount of eyes that read it. This stuff is insanely easy to get wrong in very subtle ways and reimplementing the stack always carries a risk.
This is why people usually prefer to stick to the "regular" libraries even if they're known to be slower.
I've always thought writing assembly manually was for some very specific edge cases, or to talk to some very specific hardware, but that it was just a waste of time for anything else, especially compared to C (and especially with all the progress in compilers).
Is it the death by a thousand cuts scenario, or are there some big chunks of performances gained thanks to some specific tricks (and if so, could you give some example ?). I'm thinking maybe cryptographic functions ?