Rwasa – A high-performance web server in x86_64 assembly

bsaul · on July 25, 2015

Question to the author : Seing the benchmark really made me wonder, how can you be twice as a fast as nginx ?

I've always thought writing assembly manually was for some very specific edge cases, or to talk to some very specific hardware, but that it was just a waste of time for anything else, especially compared to C (and especially with all the progress in compilers).

Is it the death by a thousand cuts scenario, or are there some big chunks of performances gained thanks to some specific tricks (and if so, could you give some example ?). I'm thinking maybe cryptographic functions ?

2ton_jeff · on July 25, 2015

Hey thanks! Contrary to popular belief, x86_64 assembler isn't really that bad to deal with. An eagle-eye perspective is simply that regardless of how good optimising compilers get, they can never really know my intent. My zlib implementation for example is consistently 25% faster than the reference version, despite me simply "hand compiling" it straight from the C source. There are lots of contributing factors that all add up to the end result as seen in the benchmarks.

The short answer is: there is no single reason, it is the culmination of all of the underlying bits of the library that made it what it is.

I added code profiling to rwasa so that you can run load tests against it and watch call graphs and individual function timings, which makes for interesting inspection of the library itself for specific "tricks" that were employed. (the library's page contains rwasa-specific profiling examples: https://2ton.com.au/HeavyThing/ )

JoshTriplett · on July 25, 2015

> My zlib implementation for example is consistently 25% faster than the reference version, despite me simply "hand compiling" it straight from the C source.

Have you compared it to Intel's optimized implementation of zlib, at https://github.com/jtkukunas/zlib/ ?

If you can improve on that implementation, please consider submitting a patch.

2ton_jeff · on July 26, 2015

I am aware of Intel's patch re: psubusw usage, and interestingly chose the same solution when hand-compiling it as theirs well before I saw their patch to it (in fill window, which does make a substantial difference).

Have you compared my code to [your?] repo yet? I will endeavour to do so, though I am not sure submitting a fasm-based patch to that repo would make sense. Cheers

JoshTriplett · on July 26, 2015

I hadn't compared the code yet; I figured it would make sense to compare performance first.

anotherangrydev · on July 26, 2015

Congrats on your work 2ton_jeff!

Regarding zlib, is that the fastest implementation that's currently available?

I rememeber stumbling into a guy that claimed his implementation was way faster (2x or more) than the original zlib, and it was a drop-in replacement and scaled fairly well. Unfortunately, I can't find it right now (it should be bookmarked on another computer).

JoshTriplett · on July 26, 2015

Various people have attempted to speed up zlib; it's not that high a bar, if you're OK with not producing output binary-identical to the original zlib. Deflate compression algorithms keep references to possible LZ backreferences in a hash table, and depending on your choice of hash algorithm, and how much you prune your table versus spending memory and time storing and searching it, you'll end up emitting different backreferences. The implementation at https://github.com/jtkukunas/zlib/ replaces the hash algorithm with something much faster.

Jweb_Guru · on July 26, 2015

libslz is really, really fast (and fairly new). It's used in the next version of haproxy. In my own tests it's proved substantially faster than both zlib and miniz.

vidarh · on July 26, 2015

> Contrary to popular belief, x86_64 assembler isn't really that bad to deal with.

I think most beliefs regarding x86_64 asembly is largely "guilt by association" with i386... It's amazing the difference just from making use of the larger register set.

nickpsecurity · on July 26, 2015

It's the same in the high assurance systems field albeit with different goals. We're concerned with optimization-induced failures, subversion, complexity, and so on. Mainstream languages and their compilers... doesn't exactly help on this. I've been out of it a while, evangelizing & informing mainly these days. Yet, your post brings back memories.

My strategy, leveraging the Write Great Code book, was to map language constructs onto assembler via macro-assembler or languages such as LISP with good metaprogramming. Then, I hand compiled the code in other's projects with my macro's. So, we took a similar approach there although my goal was to show a correspondence argument between source & asm. Meta tools turned that into full program for assembling and linking.

One wild idea I had for portability was doing optimized routines of a safe HLL in a bytecode like LLVM. That knocks out most of the uncertainty of above layers that limit optimizations the most. Then, the simple optimizations that machines are good at can be performed from there along with generation of assembler. Close to portability of C and efficiency of hand-written assembler with inline available.

For instance, code zlib functions in pretty optimal LLVM and let toolchain do the rest on full-optimization. Think yours will be 25% faster, 10%, similar? And I'm talking what you can code quickly rather than spend 30min-1hr optimizing by hand. Just curious to hear your thoughts as you have way more experience in that stuff.

Joky · on July 26, 2015

Saying that you're afraid by optimization-induced failures and then saying that you will feed some LLVM bytecode to the optimizer seems contradictory to me...

nickpsecurity · on July 26, 2015

They're kind of two different things. The first was hand-compiling (or tracing) things for highly assured work. The other was a tangent where I wonder if a bytecode like LLVM could be used as a cross-platform assembler that's closer to hand-optimized assembler than C due to less knowledge of intent being required. Not to mention simpler structure.

They're different things. If it's a worthwhile path, then the formal efforts on LLVM IR and verified optimizations could be combined with hand-made, inline LLVM for a safer, portable alternative to different inline assembler for each platform. Complimentary, not contradictory, when in context.

Joky · on July 26, 2015

OK I see. That said the LLVM bytecode is not portable or cross platform. The front-end is responsible for respecting the target ABI for example.

nickpsecurity · on July 26, 2015

Ah ok. Appreciate that detail. My concept was replacing inline assembler in an otherwise portable 3GL with LLVM bytecode. The compiler wouldve taken care of ABI in that scenario. LLVM should be easier to optimize than C and maybe fast enough to eliminate need for several different assemblers.

What do you think of that?

cbsmith · on July 26, 2015

> My zlib implementation for example is consistently 25% faster than the reference version, despite me simply "hand compiling" it straight from the C source.

Yeah, but that is a very CPU bound processing pipeline. You would expect that to maximize the impact of any inefficiencies in the compiler.

That you can hand tune for better performance is conceivable. That you can get a 2x win over some pretty tuned code suggests that there is something larger at work than simply tuning lots of little things.

vidarh · on July 26, 2015

Here's a short-list of some very simple things you can easily do in assembler which are either hard for a compiler to do, or which requires all kinds of extra optimization "magic". A lot of it boils down to having more information available:

* Allocate registers globally or across large substs. Especially when targeting architectures (like x86_64, but unlike i386) with decent numbers of available registers, this has lots of potential for typical applications that e.g. frequently needs to access common data structures. Compilers for many languages (e.g. C) have a hard time doing this if doing separate compilation per module (since you need to be able to link to code that hasn't been optimized the same way). You need whole-program optimization for this typically, but when programming assembler, it's the natural thing to do if you have enough registers to treat some of them as assigned to specific variables that you know will be frequently accessed.

* Omit stack frames entirely or selectively. Many compilers have options for doing this, but often still ends up pushing/popping more stuff than necessary for things like local variable frames and arguments, where a programmer will often see that a function is not going to use much space and decide to put in extra effort to shuffle things around to keep things in registers only.

* Selectively violate the "normal" calling conventions. E.g. if you have a utility function you often need to use in settings where it's convenient not to clobber certain registers, then you can opt to pass arguments in different registers easily. This again takes whole-program optimization for a compiler to do.

* Avoid saving/restoring certain registers based on the functions you're calling. Again requires whome-program optimization for the compiler to know that the function you're calling won't clobber specific registers.

* Specifically adjust what registers you're using based on what registers may be clobbered by other code you're calling to avoid having to save/restore.

Other things include similarities/patterns in code that are non-obvious in a higher level language because it depends on how the code is translated, which often can allow you to re-write things to eliminate common sub-expressions that are not actually visible/present in the high level code.

It's not that compilers can't do all of these if given sufficient freedom and information, but it often violates expectations of the higher level environment (e.g. separate compilation in C)

userbinator · on July 26, 2015

Several more that come to mind:

* "unusual" control-flow: return several levels up from a function without requiring extensive elaborate exception-handling mechanisms, coroutines, and techniques similar to continuation-passing-style. After all, functions and procedures are just an artificial construct imposed by HLLs.

* Easily return multiple values from a function by using several registers, even normally inaccessible ones like EFLAGS (very useful for booleans.)

* Generating self-modifying-code, like a simple JIT compiler, is straightforward to do. Works especially well for tight loops that have several variants of their bodies.

Compilers are still bound by conventions and the features the HLL exposes. Asm is only limited by what the CPU can do (and what the programmer can come up with.) I admit that, while I do prefer using something like C for much of the "mundane" code I write, it's quite frustrating in those situations where I can think of a very elegant way to do something that either can't be expressed in C without some extreme compiler-fighting, or is completely impossible because of how it generates code and what the language allows.

anon4 · on July 26, 2015

This makes me think a direction worth pursuing would be to save the whole-program-analysis from the last compile pass for the new compile pass, so you can avoid re-analysing things. So you'd spend maybe 20 minutes on the first compile, then edit one .cpp file and the next compile takes 10 seconds.

vidarh · on July 26, 2015

Possibly, though a lot of whole program analysis is not necessarily slow (for many simple optimizations), it just violates a lot of assumptions that many systems make.

E.g. just being able to determine application wide what call sites exists for a given function makes it fairly simple to handle specialized calling conventions, throw away stack frames, avoid saving/loading registers etc. as the biggest barrier against this is that with separate compilation you don't know upfront if a given function will be called from some piece of code expecting standard calling conventions.

You certainly can do really expensive optimizations too that might benefit from saving information, though.

coolsunglasses · on July 26, 2015

Haskell web servers have been faster than nginx on single and multi-core micro-benchmarks for years. Not implausible at all that someone using assembler could do better still.

darkf · on July 26, 2015

Which comparably featured Haskell servers beat nginx? Curious.

tome · on July 26, 2015

Warp, apparently: https://github.com/snoyberg/posa-chapter/blob/master/warp.md...

(although I don't know anything about a feature comparison)

applecore · on July 26, 2015

In some respects, assembler language is actually at a higher level than C.

fao_ · on July 26, 2015

Would you care to give an example? The only thing I can think of is the typeless-ness.

jonathonf · on July 25, 2015

Submitting this so I can make a feature request. :)

The -logpath option is fine, but it would be nice if it could create subdirectories too (e.g. $LOGPATH/YYYY/mm/access.log.YYYYmmdd ). Otherwise, over time the log dir is going to get unwieldy.

I'm currently running several sites in alpine+rwasa Docker containers; I'm liking having a set of entirely isolated web servers based on a 10MB container image, each apparently consuming ~6KB RAM while idle.

2ton_jeff · on July 25, 2015

Author here: Feature request noted, although the one-logfile-per-day for me doesn't seem too unwieldy. I have not seen webserver logs stored in the way you describe, is that a common practice (each month's worth in its own separate directory?)

jonathonf · on July 25, 2015

Depending on what you read [1] cronolog is a decent way of doing things (e.g. unlike logrotate it doesn't require a reload of web server). Without exporting to something like ELK, it also gives you a decent audit/accountancy trail. Limiting logs is not something I can really do in production (think six-years-plus of mandated data retention).

[1] https://startpage.com/do/search?q=cronolog+vs+logrotate

devonkim · on July 26, 2015

The generally accepted logging convention is that you work with how logrotate works (listen to appropriate signals for freeing up the file to let logrotate truncate it and so forth). If anyone needs to setup anything other than that default, it is typically better to target customizing logrotate and unify your endpoint logging there rather than to keep modifying each application's configuration. This is about separation of concerns for me as a sysadmin v. getting away from operations decisions. If your project is widely accepted you'll want to limit the amount of code you maintain on stuff that's not in your domain. Heck, it's really the Unix Way. If you were writing against Windows, I'd be using the event logger for application events with my own text based logs and let users configure what I emit to the main event logger.

An example of how people let logrotate do most of the lifting is in the Chef server codebase.

sambull · on July 25, 2015

No. Using something that truncates and rotates the log files is pretty common.

pjc50 · on July 25, 2015

People generally use logrotate. And limit the amount they keep.

kevinbowman · on July 25, 2015

Yes - at least, from a sample size of 1 (me) I like the old cronolog way of doing this, and it means I don't get the rotate-a-day-later-to-avoid-trailing-lines-in-compressed-logfiles issue.

faizshah · on July 25, 2015

Great project, I have been watching another similar project pretty closely: https://github.com/h2o/h2o

I've always wondered, how many man hours does writing a small web server like that take?

2ton_jeff · on July 25, 2015

h2o is cool too, I contributed to that project early this year for their OpenSSL/DHE parameter settings.

Re: How much time, hard to say from my perspective since a fair amount of rwasa's functionality resides in the library itself. Start to finish for all of the showcase pieces and the library (from 0 lines of code to release) took 13 months of my life :-)

faizshah · on July 25, 2015

Thanks for sharing! Is there a repository I can subscribe to somewhere? I'd like to watch this project's development.

I also want to add I love the documentation.

srean · on July 26, 2015

I almost downvoted you by accident. Thankfully when I reloaded the arrows were still there. I reply think this "arrows to small" problem needs to be fixed. Although the refrain seems to be that it is not a problem: the community would vote such accidental downvotes back up

Narishma · on July 26, 2015

They should just put one arrow to the left and the other to the right of the name.

meowface · on July 25, 2015

Indeed, I think this should really be benchmarked against the other new experimental web servers: H2O and lwan.

rubyn00bie · on July 25, 2015

I've been wondering when something like this would emerge... I'm excited to try it out on some side projects; seems simple and fast-- hard combo to beat :)

I'd also like to say, as someone who is quite ignorant of writing x86 assembly, the function hook example is incredibly readable and clear. I'm looking forward to grokking the rest of the code base in attempt to learn more.

Thanks for the hard work!

faragon · on July 26, 2015

That's a huge effort, kudos. Certainly, it is harder to get maximum performance from C, as sometimes you have to reorder things in non-trivial loops in order to "help" the C compiler to get it "right" in the performance sense (many times involving function inlining and other things that usually increase code size/bloat). Although the problem of assembly-only is to be tied to just one platform, because currently not an issue for this case, with most web servers running on x86_64 CPUs, if allows to increase performance while reducing memory usage, it makes a lot of sense for super low cost architecture (tiny memory and CPU usage per active connection).

It would be interesting to see how x86_64 implementation of specific elements compare to equivalent C code, in terms of instruction per clock efficiency, cache miss ratio, etc. (e.g. using the "perf" tool in Linux, or any other tool that use the CPU hardware counters).

tux · on July 26, 2015

Few questions to the author.

1. Is there a config file ? Like "nginx.conf" for example.

2. Is "reverse proxy" possible ? Like nginx reverse proxy.

Some examples would help.

jonathonf · on July 26, 2015

You should really just read the configuration examples on the rwasa page.

1. There is no configuration file. 2. Yes, you can set an upstream (-backpath).

jjoe · on July 26, 2015

Is that Rwasa as in Russians? Because in one non-Russian dialect I'm familiar with, it translates exactly to that. If so, I find the reference interesting considering Nginx author's country of origin. I'm curious how you came up with the name Rwasa.

Good luck with the project!

2ton_jeff · on July 26, 2015

Late Sunday night here, but before I sign off for the night, thought you'd be amused to know that the name rwasa really, truly was an acronym from "Rapid Web Application Server in Assembler". I have historically sucked at coming up with decent names, and at the time pre-release this seemed like as good a name as any :-) I assure you any relation to Russians, African politicians, etc is wholly coincidence.

tyho · on July 26, 2015

From their contact page:

    $ ssh 2ton.com.au

jrk_ · on July 26, 2015

If only there would be a way to write a high level description of a web server which shomehow gets translated into assembly.. :)

Just kidding, nice effort!

ddevault · on July 25, 2015

A very cool project, but they seem to think people will run it in production. Probably a bad idea to run an assembly TLS implementation in production.

EDIT: Here's the header of the file that contains their TLS implementation. I'll let you be the judge: https://gist.github.com/SirCmpwn/ec8aaec128aa3e47ddda

2ton_jeff · on July 25, 2015

Do you have specific criticisms of my work and/or attention to detail re: the TLS specification, and/or my library's application of modular arithmetic et al?

jerf · on July 26, 2015

I'll add another slight spin, which is that I'd never run this in production, ever, even if I actually personally trusted you a great deal, because I can't possibly audit this sort of code base. By extension, for the same reasons I can't audit the source very well, neither can anyone else. (i.e., no, I do not personally audit everything I run but I reasonably expect that because it is possible others have. "Many eyes make bugs shallow" may be oversold but it is not simply false.) The only practical way I know to audit this sort of code base at this time is naked-eye inspection, and I don't trust that.

I say "I know" because there may be something out there. I know there exists tools for source code analysis that deal directly with assembler, because I see theses around writing them. I don't know where to get one, though, or how to use it, or how much to trust it. There's a lot more that deal with "C" or "Java" rather than raw assembler, so I can fire several tools, both commercial and open source, at the problem.

And all that said I'm still extremely strongly in the "STOP WRITING C CODE AND PUTTING IT ON THE INTERNET DAMMIT" side, even with that support. Without even that support, I frankly don't care if it's 100 times faster than nginx. nginx is already maxing out my risk tolerance as it is, and I've begun a long, slow program to get it out of my stack too.

I want to emphasize how this is explicitly (but also unapologetically) non-specific, none of this is personal, none of this is directly critical of your code (because if you've gotten the impression I haven't even glanced at it, that is correct), and in particular, please by all means do whatever you like with your spare time. The "problem" here isn't that you have somehow failed to leap my bar, the problem is that the bar is impractically high for code written in raw assembler. I suppose you could provide a math proof but I'd almost argue in that case the server becomes implemented in said proof language rather than assembler anymore.

technion · on July 27, 2015

The related element is this: if the team behind nginx somehow all get by the same bus, I am confident someone appropriate will pick up and maintain the product.

In the case of an ASM project, I would be very surprised if anyone came along with the appropriate knowledge to ever want to touch the codebase. LibreSSL is currently pulling bits of ASM out of the codebase just to remove that factor.

Like Jerf said, I want to be clear that I'm incredibly impressed you got this project over the line, and I can't make any complaint about how you've done things.

arthurcolle · on July 26, 2015

How would you replace nginx??

jerf · on July 29, 2015

Go's http server. The inside of nginx is an incredible mess of C. I begrudgingly trust it since it is being actively attacked and maintained. The inside of Go's http server is incredibly clearly written. I am confident that it too will be maintained. And it's roughly half the speed of nginx, which is plenty fast. (Few web application servers in the world are sitting there with nginx using all the CPU.)

It's a long slow process partially precisely because I intend to do this carefully, and discarding nginx is not something to be taken lightly, but long term, as I said, I want the C out of my stack.

wtarreau · on Aug 6, 2015

That's interesting because I'm seeing people do the exact opposite : since almost all security holes in web environments these days come not from the applications but the myriad of frameworks and unauditable layers making fun use of objects all over the place, now using C or even ASM is the only way to limit the moving parts and to ensure that your code base doesn't change between two audits.

ddevault · on July 25, 2015

I'd be happy to give some more detailed thoughts. I write a lot of assembly myself [1], so hopefully that lends me some credibility.

I haven't read much of your TLS implementation, but I'm not a security researcher and I don't think I'm qualified to give an opinion on your particular implementation. However, there are some points to be made here. First of all, almost no one ships their own crypto for a good reason. To trust that something is secure, you need to have lots of people working on it and lots of projects invested in it (something that OpenSSL and such have). "Many eyes makes bugs shallow" is the common phrase, but it holds truer when large companies with highly skilled engineers are putting their secrets and their customer's secrets on the line. Your implementation has the unfortunate problem of being written in assembly. While I don't think there's anything inherently wrong with assembly, many people don't share the same opinion. People will be reluctant to contribute to it (how do people contribute, anyway?). Also, it is easier to make mistakes in assembly, and I would be surprised if there weren't several mistakes in your TLS implementation and in the rest of the server. This doesn't reflect poorly on you as a developer, but is instead a consequence of choosing assembly.

Assembly also inherits a lot of the common security problems C has (like buffer overflows), but makes them harder to identify. I would feel very uncomfortable exposing anything written in assembly to the public net, and doubly so if it used an unproven TLS stack written in the same. Other projects avoid the problem of untested crypto by using tested crypto from an external module like OpenSSL.

[1] https://github.com/KnightOS/kernel

2ton_jeff · on July 25, 2015

Agreed regarding general trust in any crypto stack. I've been doing commercial software development for 28 years now, and my company's products all reflect this. Whether I expect high-value security sites to use my software in production or not, well I certainly do not. Hardened stacks are few and far between, and OpenSSL can by no measure be deemed hardened (though certainly getting better of late thanks to all of the bug releases). Do I expect that my entire stack is 100% bug-free? No, but one of the niceties IMO of doing assembly language programming is that it is far less error tolerant in the ways you describe. Reading all of the nasties re: security-related code, and then applying the commonly-accepted mitigation strategies was applied throughout.

Re: how do people contribute, it is on my list of things to do for github's linguist x86_64 support (which is why I didn't put it all on github to begin with).

At the end of the day, trust is a function of time and perceived scrutiny of the stacks at hand. We are getting there slowly but surely :-) Cheers!

e12e · on July 26, 2015

I'm sorry, you've held off on publishing on github due to missing source code highlighting? Or am I completely misunderstanding what you're saying (I think I am...)?

2ton_jeff · on July 26, 2015

Admittedly I haven't checked recently, but before I released 2 Ton Digital I did a few test github projects and they all looked horrific so yeah I left it out on purpose. It's been on my "someday when I am bored" list since then (to fix up linguist so it all looks half-decent), and also why all the "library as HTML" on 2ton.com.au is self-highlighted.

userbinator · on July 26, 2015

Assembly also inherits a lot of the common security problems C has (like buffer overflows), but makes them harder to identify.

Actually, I'd say that it has advantages because the mindset is very different when writing Asm - it naturally forces you to think about things in a low-level and precise fashion, which keeps considerations such as buffer lengths more in the mind than higher-level languages that attempt to abstract it away.

Programming at the instruction level also allows much more fine tuning of instruction ordering and such to resist timing attacks, without any compiler optimisations getting in the way.

anotherangrydev · on July 26, 2015

I don't know this firsthand, but I've heard from many people that had delved into OpenSSL code that it is an example of code where "many things could be made better", to put it lightly.

ddevault · on July 26, 2015

OpenSSL could be way better. That's why LibreSSL exists. But it is something that lots of people are looking at.

buster · on July 26, 2015

I think although a lot of people used OpenSSL only few looked into its source code. Those who did might have been horrified but since there was no real alternative continued using it.

Only after massive security vulnerabilties and a lot of media attention more people looked at OpenSSL in detail and eventually decided to do something about it. Which mostly was "let's write a new library or fork it". Thus, LibreSSL, sodium, nacl and such.

What got us in the mess with OpenSSL, was to leave a key component of many software projects to a struggling, small team. It's amazing how much open source relies on a few ancient programs written and maintained by few with little to no financial support (e.g. NTP, GPG).

ddevault · on July 26, 2015

>Only after massive security vulnerabilties and a lot of media attention more people looked at OpenSSL in detail and eventually decided to do something about it. Which mostly was "let's write a new library or fork it". Thus, LibreSSL, sodium, nacl and such.

Right, but those vulnerabilities _were_ found. I worry that they wouldn't be found in this. The only people who'd go looking are the people who see that a specific website is using it and want to exploit it.

buster · on July 26, 2015

That would only be true if those vulnerabilties occured because someone found a bug in the source code. Given the horrible mess the openssl code is said to be, i'd argue that most vulnerabilties were found without source.

deoxxa · on July 26, 2015

Wasn't it that assumption ("lots of people are looking at [it]") that lead to the current state of OpenSSL?

e12e · on July 26, 2015

From the documentation:

> BREACH/TIME/etc > > Both the BREACH and TIME attacks rely on measuring the size of compressed response bodies. Since rwasa supports dynamic content compression by default, the HeavyThing library's default setting for webserver_breach_mitigation is enabled and set to 48 bytes. For each rwasa response when TLS and gzip is active, this setting adds an X-NB header that contains a random 0..48 bytes that is hex-encoded to each response header. While this doesn't render response sizing attacks completely useless, it makes a would-be attacker's job much more difficult due to the highly variable response lengths.

It's my understanding that random padding doesn't in fact make the attacker's job "much more" difficult. Only a little more, or not at all?

Could you comment on how integrated the TLS stack is with the webserver? Normally I'd think that using some kind of dedicated SSL terminating proxy, either a new version of HAproxy -- or stunnel/stud or similar -- would make more sense than deploying a new TLS stack that hasn't been through any outside review?

That said, as mentioned by others here - openssl is clearly not a great example of a secure/good TLS implementation. I'm not sure there are any (yet). Hopefully libressl will become one. Personally I'd like to see a minimal library that combined a couple of AES/ECC primitives and implemented TLS 1.2+ only (No SSL), with a sane and clean API on top.

Something along the lines of NaCl but with a goal to support a subset of standard TLS with forward secrecy (and explicitly throw old clients under the bus, Android 2x be dammed).

2ton_jeff · on July 26, 2015

> It's my understanding that random padding doesn't in fact make the attacker's job "much more" difficult. Only a little more, or not at all?

The BREACH attack verbage at http://breachattack.com spells it out fairly clearly, by adding random bytes to all of the HTTP responses, it makes small compressed HTTP payloads impossible to determine whether guessed bytes were correct or not (well, depending of course on the size variable of the random bytes added).

> Could you comment on how integrated the TLS stack is with the webserver?

The TLS layer is entirely separate from the webserver layer. I built the epoll, TLS, SSH, webserver and client as "IO layers", such that they can be stacked together arbitrarily (imagine epoll/IPv4 listener -> TLS -> SSH -> TLS -> Webserver, perfectly doable, albeit a little nutty).

e12e · on July 26, 2015

> it makes small compressed HTTP payloads impossible to determine whether guessed bytes were correct or not (well, depending of course on the size variable of the random bytes added).

Hm, ok. At least you didn't "just add some random padding" :-)

Thanks for the comment on structure. Might be nice to try and make ssl/tls terminating proxy as a separate binary I guess.

As for the code, for someone new to fasm it wasn't immediately obvious that to build one had to assemble then link (fasm -m $((bignumber))[1] project.asm project.o && ld -o project project.o # optionally strip project). Might want to but that in a Readme/makefile/build.sh. I found the general recipe in the hello-example - but a short readme in the various project folder and/or top level wouldn't hurt.

[1] ed: from https://2ton.com.au/HeavyThing/#echoserver

fasm -m 262144 echo.asm && ld -o echo echo.o

discardorama · on July 26, 2015

> I built the epoll, TLS, SSH, webserver and client as "IO layers"

I think you should just finish the job and implement the entire OS in assembly. ;-)

I'm kidding. To me, assembly programming has always seemed like a true art form. You're forced to think about everything, and if you can successfully fit all the pieces together properly, it's beautiful. Also, not many can hack it through assembly, so there's a huge selection bias too.

TheLoneWolfling · on July 27, 2015

Why can you not simply repeat the request a bunch of times and take the minimum length / 10th percentile / maximum / something along those times?

It increases the number of requests required, yes, but I don't see why it makes it impossible.

inglor · on July 25, 2015

Probably the amount of eyes that read it. This stuff is insanely easy to get wrong in very subtle ways and reimplementing the stack always carries a risk.

This is why people usually prefer to stick to the "regular" libraries even if they're known to be slower.

lucio · on July 26, 2015

All "regular" libraries were, at some point, "new" libraries.