Hacker News new | comments | ask | show | jobs | submit login
Websocketd (websocketd.com)
680 points by mmcclure 27 days ago | hide | past | web | favorite | 220 comments

This is the second or third "it's CGI again" thing I've seen in the past year. While these things are cool and definitely have their place, it's still worth noting that process per connection scales fairly poorly, simply because processes and forking are relatively expensive, and therefore it's probably unwise to deploy something like this in production anymore. It is what it is, I suppose.

Depends on what your needs are. In many cases, you need to scale to a few dozen connections per day, and simplicity is far more valuable than performance. If you're pushing the limits of performance, any off the shelf solution is probably not going to scale as well as it could for your specific workload.

Yes! Every mention of CGI also gets a sneer and a reminder of how poorly it scales, but nothing I do requires scale in that sense. "In production" doesn't always mean "hundreds of requests per second." Every road doesn't need 8 lanes.

I’ve had developers say this before when building public websites and it has cost us in lost revenue when the site has gone offline or cost us lots in virtual hardware having to scale the number of servers available to meet higher than expected volumes of traffic.

Don’t get me wrong, I'm not suggesting that everyone should be building their site like it’s Facebook or Google, but when every public website is already sat on a proverbial motorway, it can be dangerously shortsighted to build your site to only meet demand with no room to scale. CGI falls under that category. There are plenty of platforms that are still very easy to build and deploy, developer friendly, and still runs circles around the performance of CGI.

(Please note that I am talking about public sites specifically and not intranets or other IP white listed resources)

Used in production doesn't always mean used by all users of a public app that might go viral.

If you have an app with an admin interface/CMS/monitoring interface and you know there won't be 50 administrators for your app before the end of the year this works great (assuming you can reverse proxy this to add auth or something).

> Used in production doesn't always mean used by all users of a public app that might go viral.

Hence why I specifically said "public website". Intranets and other IP whitelisted resources are clearly a different topic entirely. Sites only intended for 50 administrators but are not hidden behind a firewall make me nervous for a whole other set of reasons but I accept that's got to happen sometimes (but even there, most of the best / developer friendly CRUD platforms these days aren't CGI so there is little reason to use CGI even for that specific use case).

Sometimes you know your scale. Say you have a website for your local area D&D club. You know that nobody outside a 20 mile radius is going to ever really bother looking at your stuff. And you know that there are about 200 people interested in D&D in this area. You need a contact form on your site to email you when someone has a question. How many connections at most would you realistically expect to have to process in this case per second/year/decade?

More than you’d assume, once you factor in search engine crawlers, miscellaneous bots and automated tools that bad actors run to probe websites looking for vulnerabilities. However if still expect CGI to stand up against that.

Performance arguments aside and given the type of site you describe, wouldn’t it be more convenient for the developer to use Wordpress or one of those website builder as a service things instead of inventing something from scratch in CGI?

I get the argument that some personal sites wouldn’t get much traffic but the argument that CGI is easier to build than any non-CGI alternative simply isn’t true any more and hasn’t been the case for more than a decade.

I know I’m coming across as passionately against CGI and I assure you that isn’t the case (I recently chose to use CGI as a private API endpoint for some Alexia skills I’d written for myself). But for a public site there isn’t really a strong argument in favour of CGI anymore given the wealth of options we have available.

Point taken on the bots but the static part shouldn’t go through the CGI anyways.

The irony of decrying poor performance yet suggesting WordPress is pretty priceless :)

I agree that most people would be better served with Squarespace/Wix but I run into the situation or static site + 1-2 bits of dynamic form processing frequently enough to warrant having a simple solution for it. When the site doesn’t warrant spending money on, sometimes a shared host + CGI is just right. But I think this is for some very rare cases. Most people should outsource these types of headaches.

> Point taken on the bots but the static part shouldn’t go through the CGI anyways.

Event the non-static stuff will be hit by bad bots (and by "bad bots" I mean any crawler - malicious or otherwise - that doesn't obey robots).

> The irony of decrying poor performance yet suggesting WordPress is pretty priceless :)

It's not ironic at all. I've done extensive benchmarking on this in a previous job and discovered that even the out-of-the-box Wordpress experience would easily outperform the same equivalent in CGI - and that's without taking into account all the caching plugins available for Wordpress (sure you could optimize your CGI as well, but you'd have to write all that yourself where as with Wordpress it's a 2 minute install).

Even DB and page caching aside, you still have a problem with CGI forking for each web request where as with PHP you have mod_php (Apache) or php-fpm which do opcode caching and such like. This alone can make a dramatic difference once you start piling on the requests.

> When the site doesn’t warrant spending money on, sometimes a shared host + CGI is just right. But I think this is for some very rare cases. Most people should outsource these types of headaches.

Do shared hosts even allow CGI? I'd have thought that was a bit of a security faux pas for a shared host. In any case, almost all of them support PHP (and those that don't are special purpose ones for node or other frameworks) so there's no forced requirement to use CGI even on shared hosting. In fact I'd go further and argue that PHP is even easier to write than CGI so you're better off using that regardless of whether CGI is available.

Disclaimer: I pretty much hate PHP as a language and not too fond of Wordpress either. But I'm being pragmatic in this discussion and leaving out my own personal biases. Personally I wrote my blog in Go (which, as it happens, was originally written in CGI/Perl, then ported to Apache+mod_perl before being ported to Go about 9 years ago and where it's been running ever since)

Some kinds of API endpoints and webhooks are exactly what I love CGI for—when what I want to do is exactly “bind an HTTP endpoint to a shell command.”

I agree but I'd never do that for a production system because it's far to risky from a security standpoint. For personal projects, sure. However even there I'd IP whitelist (wherever possible), sit the CGI behind an authentication screen and put log monitoring on there (eg fail2ban) to auto-blacklist any IPs identified as potentially abusing the endpoint.

If this isn't all stuff you already have set up on your dev environment then you might find hardening CGI becomes as much work as re-writing those tools in a more secure framework.

You cannot serve the whole internet without cloud scale resources anyway. So what does it matter if your application is 10K/s less capable than an optimized version? You pay for another instance. For 5k more time and energy yearly you save .026 an hour. Congrats.

Let’s be clear about one thing, we are not comparing CGI to C++ or Java. There’s PHP, Ruby, JavaScript, Python, etc that are all good languages to rapidly develop backend code in. So The dev time between CGI and non-CGI is the same or even quicker to build with non-CGI versions because modern web languages and frameworks are much better than they were in the 90s when CGI was popular. Plus the sysadmin / DevOps time spend deploying and hardening CGI would be greater than a more modern framework.

Also your coats are hugely optimistic. I’ve done benchmarks with CGI and non-CGI code in previous companies and found it wasn’t just a couple more servers, it was often 10x more. Even at $0.26 an hour, that quickly adds up over the course of a month and year. Plus the slower throughout also has a knock on affect in your database as well. You’ll find as you’re running fewer connections on each web server you’d end up with smaller connection pools per node but more overall DB connections across the farm with those connections held open longer per web request. That means you then need to beef up your RDBMS instance and that gets very costly very quickly (even on the cloud)! And we’ve not even touched on the options of serverless et al that aren’t even available for CGI which would further bring down the cost of hosting a non-CGI site.

Let’s also not forget one of the key metrics when building a commercial web platform: performance form a UX perspective. Amazon, Google, etc have all done studies on page load times and user patterns. Their somewhat predictable result was that sites with slower response times will see more users leave that site in favour of a competitors one than sites with faster response times. So if your website is your business, running CGI could cost you in lost revenue as well as running costs.

None of what I say above is theoretical costs - this is actual data from my experiences migrating CGI platforms to non-CGI alternatives. I’ve done the benchmarking, cost analysis and so on and so forth. The arguments in favour of CGI are simply untrue for the modern era of web development.

The myth that fork is expensive is pervasive, and speaking to the ways that it is true†, well: performance is relative.

fork() only takes around 8ms on my Linux machine and I can get 100,000 posix_spawn() per second there with 100MB RSS.

That's "fast enough" for a large number of applications.

†: fork() is a lot slower (over 20x) on Windows

I see this argument all the time but its not the fork() that is the most expensive anymore, its the actual program initialization after the fork(). If your app is non trivial (lets say a websocket based chat server that needs to persist messages to a DB and use pubsub to sync them to other processes) it probably needs a connection to a database, a connection to a cache server like Redis or Memcached, etc, or perhaps a connection to some other backend service. Reinitializing these dependency connections from scratch in a new forked process for every single incoming websocket connection is expensive and slow.

On the other hand a Node.js or Go program can have a preestablished pool of keep alive connections to the backends already ready to go and reuse that connection pool for many hundreds or even thousands of concurrent websocket connections. You can approximate something like this with the fork() model by having a local daemon process that manages the connection pool and have your forked process talk to that local helper daemon when it needs a connection, but you are still going to pay a penalty for that compared with having a fully preestablished connection ready to use right there in the process already

Then consider posix_spawn(), since it pays that cost, but then we have to pay IPC as well:

fasthttp(go) can do around 80k/sec on my laptop; dash(C) can do almost 88k/sec on my laptop†. As soon as dash does IPC, it drops to around 51k/sec, and as soon as it needs a reply, we're down to 25k/sec††. I see no reason to believe fasthttp would do any faster.

That means I'm spending around 70% of my time in IPC -- something posix_spawn() would let me avoid (if my application were designed to do so). My same laptop will do 100k/sec posix_spawn() so I'd find this difficult to believe (1-4 msec per call) fork() or exec() is the bottleneck for any application with this architecture. Do you think posix_spawn() represents 70% of your costs? If our goal is to beat 51k/sec requests, sure, but NodeJS on my laptop (btw) gets 10k/sec, so if it's a contender, I'd say posix_spawn() is as well.


†† https://github.com/geocar/dash/blob/master/README.md

Well, you can manage resources by shared memory and semaphores. That will make nearly all of the environment setting time go away. Better yet if you just serialize the processes and assign resources to a serial number.

What is just another way to make a NxM server, so maybe forget about it...

You can prefork. That is a tried and true method. And why do you need to create so many connections? Certainly you can architect the solution better than that.

8ms per fork means you can only accept 125 connections per second. (per core)

That means it's only viable for connections where the connection is very long lived and messages are very sparse (because context switches).

If you use websocket for short lived connections, you are doing something wrong.


Meta observation:

Maybe this is the bane of smartphone era and small screens, but the context of the discussion seems to disappear instantly.

Subject: websockets > forking processes > .. aaand the websocket context is lost and we are talking generally about forks in web applications with growing thread.

> If you use websocket for short lived connections, you are doing something wrong.

Pretty-much true, but I remember a funny story from Dropbox where their websocket service couldn't come back up after a crash because their normal users trying to re-open super long lived connections all at once was well-beyond the capacity of the system

I find it ironic that people will complain about milliseconds for a fork and talk about the time for context switches... and then serve their pages using an interpreted language that is an order of magnitude slower than it could be in a compiled language...

Who did that? I saw no one talking about interpreted languages here.

The comment immediately below the GP that claims that performance is relative literally says that you don't have to care about forking in Node.js.

Not defending node.js, but it is not interpreted but run with jit compilation.

I think the real issue now is not how fast you can accept and process a new connection, but how many you can have in parallel. If you're going to have a native os process for each, you'll soon run out of resources

Cores are cheap: I've got 200, and a lot of that cost is page faults.

It also means you should use posix_spawn instead of fork+exec since you can control when the page faults occur better.

I was going to mention CPU context switching as well, it's very expensive even when you don't account for the initial fork overhead but I guess it's definitely a better idea than CGI for HTTP given the long lived connection.

In an ideal scenario, you shouldn't have more active processes than you have CPU cores. As soon as that happens, context switching kicks in and performance degrades sharply.

Does this concern hold for containers per machine?

Yes I would think so because containers run on top of the OS.

If you can run multiple containerized apps side-by-side on the same machine at the same time on a single CPU core, then you can be sure that there is some kind of context switching happening.

Modern Operating Systems are good at minimizing the amount of context switching. If you run 4 CPU-intensive processes at the same time on a machine which has 4 CPU cores, then the OS will typically assign each process to a different CPU core (with minimal context switching). Then if you launch a 5th CPU-intensive process, then the OS will have no choice but to start doing context switching since it doesn't have any idle cores left.

On Linux, based on tests I did a couple of years ago with a multi-process WebSocket server, I can confirm that the penalty of context switching is proportional to the CPU usage of each process. So for example, if you have a very CPU-intensive process sharing a CPU core with a non-intensive process, then the penalty will be small, but if you have two intensive processes sharing the same core, the penalty will be high.

The initial fork is fast because of the copy on write memory semantic. You'll pay a price later.

Only 8 ms? That's slower than a ping.

You were right about performance being relative. And of course the trade off between ease of development, use and performance. At the end of the day, practical considerations are going to determine what is "expensive".

> Only 8 ms? That's slower than a ping.

Also depends what you're pinging. I'm in the UK, so everything in America is 30-80msec away anyway.

Related: when I worked on a server (on linux) that would spawn threads for new connections that were mostly short lived, I tried to use a thread pool instead. Hand crafted, with push and pop both O(1). It was still mostly slower than just spawning a new thread every time.

It's generally considered the right move to just spawn your own thread for things like connections instead of pooling threads - people tend to think they should throw everything into a threadpool but it's not actually encouraged to do that for stuff like web server connections, compiles, etc unless the task is short-lived. I wish more people knew that going in :) Some threadpool APIs actually ask you whether the job is going to take a while and if it is, they function more as a job limiter - not 500 active threads all competing for CPU - than a thread reuser.

But websocket connections are usually long lasting. So the cost of the fork is less important.

It probably scales better than a forking HTTP server, but probably not much; modern HTTP connections tend to be at least a little bit long lasting and few people would dare serve a large site on a forking webserver (in part thanks to the fact that most webservers have moved on to workers or event loops.)

It certainly would hold up inordinately poor to a DDoS attack.

A process per client with keepalive is definitely a losing proposition. But if you're willing to run old-school one request per connection, it's not unreasonable (although adding TLS to that is). Many years ago, while I was at Yahoo, someone made a clever hack: have a daemon that holds keepalive sockets and passes them to the (y)Apache daemon when they have something to read -- when Apache is done with the request, give it back to the daemon. (Sockets passed back and forth as file descriptors on a Unix socket). A further many years ago, David Filo came up with the idea of accept filters -- allowing a program to request the kernel to accept connections and have accept only return sockets that have a fully formed http request already, so an Apache (or whatever crazy webserver before Yahoo switched to Apache) wouldn't have to wait for the client there either.

> modern HTTP connections tend to be at least a little bit long lasting

But CGI doesn't fork per TCP connection; it forks per HTTP request/response.

True, but as far as I know none of the major HTTP servers use forking anymore either. I believe there was a time when forking per connection was fairly standard for servers.

I think it all boils down to how much can it scale. At what point do # of processes tip over a server versus how many # of threads can it handle versus how many # green threads a program runtime can manage.

It's not so much the fork but the memory cost. Each of those subprocesses has at least one call stack = 2 megabytes of memory. 2 megabytes per connection is many many orders of magnitude more that you would use in an asynchronous server.

1) that's virtual size, and most likely (depending on OS/cfg) COW (assuming no call to execve).

2) that's a default - most systems allow tuning

You can have pretty decent performance with forking models if you 1) have an upper bound for # of concurrent processes 2) have an input queue 3) cache results and serve from cache even for very small time windows. Not execve'ing is also a major benefit, if your system can do that (e.g. no mixing of threads with forks). In forking models, execve+runtime init is the largest overhead.

It will not beat other models, but forking processes offer other benefits such as memory protection, rlimits, namespace separation, capsicum/seccomp-bpf based sandboxing, ...


I think you guys are both right. Back in the days when I measured UNIX performance, it was fork that was expensive due to memory allocation - but not the memory itself. It takes time to allocate all the page tables associated with the memory when you are setting up for the context switch. But I should admit that it was a long time ago that I traced that code path.

prior thread with some ad-hoc measurements: https://news.ycombinator.com/item?id=16714403

Socket connection alone is too expensive.

IRC shows us that maintaining a reliable socket for most folks is next to impossible.

On the contrary. IRC shows that maintaining a consistent network where every part can always reach every other part is tricky, but the most common problem with irc networks is not clients getting booted off, but net-splits that usually automatically resolves pretty quickly.

That they're visible to clients is an issue with how channels spans servers and how operator status and channel membership is tied to who happens to be on a channel on a specific partition at a certain time, and how messages are propagated when splits resolve, and how inter-server communications happens.

So it has plenty of lessons if you want to build a chat network, but nothing with it suggests maintaining a connection is otherwise a big problem.

In general what it boils down to is the word "reliable": You would want to write your app so that a client that disconnects and reconnects gets a sensible behavior on reconnecting, e.g. by queuing messages when it makes sense, and discard them if it does not, to paper over temporary connection failures.

But you would need to do that if you were to use stateless request/response pairs anyway.

I disagree. I've not seen a netsplit in months, but clients constantly timing out and rejoining is a fact of life.

That is a different problem, clients can always have bad connections that is not solved by UDP/HTTP.

Well http is a short lived request/response.

Sockets need keep alives and resources allocated et al.

I would really like to see a benchmark for: stand alone fork in small statically compiled binary and also every combination of: [[vfork, fork] + exec, posix_spawn] + [small static binary, small dynamic binary]. I understand that for big interpreted languages like Python it's certainly slow, but with advent of Rust I would believe that CGI could make a come back. It's fast enough for many tasks and it gives great isolation. Using separation of concerns by using separate processes gives also ability to easily sandbox whole application even by as crude means as original seccomp.

A long time ago I used to write CGI endpoints in C (strangely, not because I had to, I just wasn't as good at perl ). I used to say "its not a CGI 'script' maaan! I wrote it in C" with the accompanying l33t glance of respect in response ;)

Anyhoo - same arguments about forking, process & memory limits arose ; so things like FastCGI and ISAPI were the solution to reduce/avoid some of those overheads.

I suspect Rust based CGI endpoints would have similar <quote>issues<unquote>.

But I've been thinking about this CGI comeback from time to time - the left side of backend infrastructure (load balancerss, reverse proxies etc) and middle tier machine resources are much better now. Docker containers and lambda spin up is similar to some degree in terms of topology but maybe with even more overhead?

..so does does make one wonder if cgi like simplicity make a comeback with today's order of magnitude better infrastructure? I keep waiting for someone to come up with "lambdaGI" ..with will lead to "fastLambdaGI" and so on (and that's not a snarky comment).

Not that much on modern Linux/BSD systems at least. The kernel is smart enough to just copy fewer pages when forking not the entire parent process image.

Then you run a GC and dirty all your pages.

Yeah; forking is cheaper for low memory programs with no moving GC.

GC usually runs on a separate thread right? Isn't it true you shouldn't run fork() when you have a multithreaded program?

Not usually. Only an advanced GC will be concurrent. Something like Perl or Ruby for example runs in the foreground. If it is concurrent then you shut down the GC thread, fork, and then start a new thread in each new process.

It's probably not safe to fork() from most frameworks threads (OpenMP) and is hard to make safe for complicated parallel processing. But if you control the flow of execution and minimize mutex and critical section(s) it can be done withal. Several examples of thread served queues performing user defined actions upon message receipt to include fork() + exec() come to mind.

Of course any UNIX-like worth its salt is going to support Copy on Write and whatnot. But even then, forking is still quite slow relative to not doing anything at all.

Here's my point of view:

1. Forking is first and foremost a system call. (To be fair, I realize even memory allocation is much of the time, but still.) The kernel is going to do a bunch of work (as fast as it can, of course) and you're going to end up with two different OS level tasks by the end.

2. Those two tasks are now scheduled in tandem by the OS scheduler. For two tasks, this is fine. For hundreds of tasks, it becomes less effective. Threads will rapidly go in and out of I/O wait and the scheduler has to balance all of this.

3. CGI dies right here, though: CGI is not just a fork. It is a fork and exec! The exec, of course, also run in the kernel. It's going to effectively load the binary from scratch. Then you probably hit the linker in usermode, which has to go through the shared library resolution, resolving and mapping shared objects into memory and filling out import tables. This stuff has some caching as far as I know, but still... it's not free.

4. Now you are in the entrypoint of your program... or are you? If you are using CGI with Perl or Python or even Bash, we're not done yet because the script interpreter has to load all of its state from scratch, parse your script, bla bla, and THEN finally we can run your program.

5. Your application now has to do all of its common setup. Every. Connection. If you need to connect to a database, you can't connection pool: you have to open a new socket every time. Redis? same thing. Need to read a config file? Yep, every dang time. You can hack around some of this but in general it's probably going to be like this for most CGI applications. There's a reason why FastCGI exists after all.

The forking model of servers is elegant... but it doesn't work so well in my opinion. The cost of OS-level context switching is non-trivial, forking is relatively expensive, and things get worse when you are talking about something like CGI where your app effectively gets loaded from scratch each time.

My favorite model is definitely the Go model, where the language schedules lightweight threads or fibers across n OS threads (where n = number of logical threads.) It is cheap for the OS scheduler, and the language scheduler can very efficiently deal with things like I/O blocking and GC without huge latency hits.

But you can go about it many ways. Node.JS's event loop model proves pretty effective. Node is far from perfect but I think I would bet on a Node.JS server with a proper event loop over a CGI server forking a C program, myself.

Of course I'm really no expert. But, I think the jump from CGI to FastCGI told me all I need to know: CGI just didn't scale well at all. FastCGI was a much better experience for me, though I no longer use it.

> 2. For hundreds of tasks, it becomes less effective. Threads will rapidly go in and out of I/O wait and the scheduler has to balance all of this.

Surprisingly, even thousands of threads can often have better throughput than using event loops and non-blocking IO: https://www.slideshare.net/e456/tyma-paulmultithreaded1

> The cost of OS-level context switching is non-trivial

On modern hardware it's non-zero but trivial: https://eli.thegreenplace.net/2018/measuring-context-switchi...

>Surprisingly, even thousands of threads can often have better throughput than using event loops and non-blocking IO

Hundreds may be a bad example, it's going to vary based on how big your server is but there is a tipping point where threads become very infeasible. I've yet to hit that limit with Goroutines and have literally hit millions of them on relatively small allocations without much of a hitch. Event loops and non blocking IO may perform worse than threads, but I have extreme doubts that threads beat goroutines (and equivalent models like erlang processes.)

> I have extreme doubts that threads beat goroutines

Why? Goroutines use the same syscall heavy non-blocking IO as event loops, just with coroutines for programmer convenience.

The biggest difference between Goroutines and pthreads is that Goroutines only have a 2kb default stack size and pthreads have a 2MB stack.

In high thread concurrency situations it's common to turn the pthread stack size down to 48 or 64kb to allow running tens or hundreds of thousands of OS threads. There's even a JVM config flag for it.

1, because the Goroutine context switch is Insanely cheap. There's no preempting, the scheduler runs on function calls and the context switches are very cheap - just swapping a few registers I believe. This means a single thread could blow through an enormous amount of goroutines in no time. I know, kernel mode switches are fast, we optimize this all the time, but doing nothing in place of doing something will always be inordinately cheaper.

2, because the Go scheduler has awareness about the state of the goroutines that an OS scheduler would not, so it can make more intelligent decisions about what goroutines to wake up and when.

You can really have pretty much as many goroutines as you want. Hundreds of thousands of OS threads lands you firmly into tweaking kernel settings land. My local thread max on my desktop is just 127009 - that wouldn't fly for a huge machine running many Go apps, which is exactly the kind of circumstance I was in (using Kubernetes, to be exact.)

Completely true, but in a realistic small workload situation with a 0.5ms response time the pthread context switch is already only 2usec or 0.4% of the total time. Goroutines can be infinitely faster and not be able to meaningfully improve overall performance.

This is why thread per request servers like jlhttp are right up there with fasthttp etc in terms of total throughput.


Something notable is that jlhttp and fasthttp are both using worker pools. fasthttp uses worker pools of goroutines, and jlhttp uses traditional thread pooling.

Thread pooling is an effective solution to improve webserver performance, and it generally works well. In these synthetic benchmarks, you can't even really see much of a downside. In reality a lot of these benchmarks are only so good because they have requests that complete very quickly, and I think if you add a random sleep() into them many of them will just die outright because they can't handle that much concurrency and block waiting for free workers. You might think that is unrealistic, but consider that many people have Go servers that are making RPCs all over the place, and an actual large amount of time can just be spent waiting on other RPCs. It's a real thing!

And if the world were just responding to HTTP requests, obviously something like Goroutines would be overkill. But, one of my favorite uses of Goroutines was implementing a messaging server where each consumer, queue, exchange were all their own Goroutines. I was inspired by RabbitMQ for the design but unfortunately could not use it in this use case. Luckily Goroutines worked really great here and I was able to scale this thing up hugely. To me this is where they're really great: they're super flexible. They work pretty well for short-lived HTTP requests, but also great for entirely different and more complicated use cases.

Looking back at the benchmark, one of the more interesting approaches here is go-prefork[1], which works by spawning n/2 executables with 2 threads each. I can only imagine the optimal amount of threads was complicated to determine and maybe even has something to do with hyperthreading. Of course the advantage here is the reduced amount of shared state that leads to less contention, and it does indeed show up on the benchmark. In this setup, it looks weird because there's no load balancer (could be something as simple as some iptables rules) or anything in front. In practice, this would be much akin to running separate instances on the same box, which is also a reasonable approach, and I used this approach myself when scheduling servers. Oddly, I don't think they tried the same approach for fasthttp.

I think what else the benchmark shows is how clever you don't have to be in Go to get good performance. go-postgres is routinely in the middle of the pack and it is literally just using the standard library and goroutines in the most basic fashion. It's effectively not optimized. And in reality, in many cases with more complex servers, the overhead is low enough that it isn't worth your time to optimize it much more.

[1]: https://github.com/TechEmpower/FrameworkBenchmarks/blob/mast...

They tried green threads years and years ago in Java and then reversed course, and M:N was also a thing on NetBSD years ago. I wonder why Go is working where the others went away from it?

> I wonder why Go is working where the others went away from it?

Quite frankly, because Go developers don't learn from other's mistakes and reinvent square wheels.

Kernel threads will always be better and faster than usermode threads.

This is because any inefficiency in threading comes from the scheduler. Your usermode scheduler will always necessarily be slower and worse; it just makes so much sense to put your scheduling code where your context switching and process isolation code already is.

Sure, you can get a short-term gain by iterating fast and testing code if kernel developers are too slow and/or unwilling; but then eventually your code will get merged into the kernel anyways once you're done. So really usermode threads are only good as a rapid prototyping tool, not something for serious use.

>Quite frankly, because Go developers don't learn from other's mistakes and reinvent square wheels.

Your response doesn't even start on the right foot, since Java never used green threads for performance and therefore is not a good place to hedge this argument at. Citation needed if you're going to start with that.

You can also go ahead and insult the Erlang developers for Processes since they did it first:


>This is because any inefficiency in threading comes from the scheduler. Your usermode scheduler will always necessarily be slower and worse; it just makes so much sense to put your scheduling code where your context switching and process isolation code already is.

You know, a Go context switch doesn't hit the kernel. It's not more expensive than a kernel context switch. I don't know why you'd think it is. Why not inline scheduling to the process where it knows what's blocked on what?

>Sure, you can get a short-term gain by iterating fast and testing code if kernel developers are too slow and/or unwilling; but then eventually your code will get merged into the kernel anyways once you're done. So really usermode threads are only good as a rapid prototyping tool, not something for serious use.

You can hold your breath for threads to become cheaper than goroutines, but careful not to suffocate. A goroutine pretty much just needs a stack, 4kb. The kernel thread has structures, well, in the kernel, for paging, for the task itself, and the stack structure is bigger (think you can at least adjust that though.)

And as for integrating the Go GC into the scheduler... I'd love to see what Linus's thoughts on merging that to kernel are!

Doesn't matter if you just have a few thousand threads but that's a limiting way to look at something. RabbitMQ can have processes all over the place for everything and have more processes than you could ever have threads and it remains among best in class for performance.

Goroutines only have a 2kb initial stack since 1.4, AFAIK? It was originally 4kb, then 8kb to avoid split stacks, then back to 2kb, I think?

The Go concurrency model can't possibly be implemented in the kernel because it trades some isolation guarantees for better performance.

I am pretty sure Java's historical use of green threads was an issue of portability and not performance. They used traditional threading primitives. Java's decision to use green threads may have actually made a lot of sense in an era when consumer computers typically only had one physical thread to begin with.

Go on the other hand is based on CSP principles and a threading model that looks like actors. The threading model is deeply ingrained in the language, and it is designed around it. Goroutines are very cheap in Go, and the scheduler is fairly effective because it knows what threads to wake when - it's not blind. Goroutines are scheduled across n threads, not just a single OS thread, so they can take good advantage of multicore or multi CPU systems. This design does hurt C interoperability a bit, but imo it's greatly worth it.

Go is not the only language that works this way. I believe its concurrency model was inspired a lot by Erlang with its 'processes' model.

The reason why this gets more attention in Go than elsewhere is because it's unusual for the kind of language Go is (lot lower level than something like Erlang). Everything else in Go is really "better C so long as it's not C++", but goroutines are an experiment in their own right.

From what I recall, Java didn't switch because it didn't work. I think it switched because the green threads were built on what Solaris supported, and they moved to full OS threading to support Windows, Linux, Mac etc.

While it's cute, I've never liked fork(). It's almost always followed by exec(), which means the default case is to do bookkeeping that is immediately discarded, something CoW helps with but doesn't eliminate (consider all those fds!)

Unless you develop in Elixir/Erlang where you can spawn millions of concurrent processes.

AFAIK, those are not 'processes' in terms of operating system processes, but instead some kind of parallel running tasks within the Erlang VM (much closer to threads).

Edit: If you think I am wrong, could you please explain what is wrong?

AFAIK there is very little difference between processes and threads in linux (both are instances of task_struct), the variability in implementation details specific makes that use for classification useless, and thus the only possible separation of any use is semantics: the difference between processes and threads is whether they share internal state (generally / by default).

Erlang's don't, thus processes. Go's do, thus threads.

There is an enormous difference between processes and threads in Linux and it's a fairly simple one - each process is allocated a memory space. All those mappings and page table entries are what makes process spawn expensive.

There is a terminology issue though, you're right, and that comes with using OS terms for userspace scheduling. Go coming up with their own term - goroutines - significantly simplifies the conversations. Gorountines, Erlang processes, green threads and fibers are all fundmentally the same thing - M userspace 'tasks', scheduled on to N operating system threads of execution, likely in one OS process. There are some language details as to how they are presented to the user.

Not sure how it is in Linux, but in Windows it's only threads that can execute code. Spawning a process will cause the OS to create a thread for it as well.

Interesting. My interpretation was more along the lines of having its own process identifier (PID) and to my knowledge, OS processes have that, OS threads don't and I thought Erlangs processes don't get PIDs by the operating system either (although that might be wrong too).

> OS processes have that, OS threads don't

At the kernel level, both "processes" and "threads" have PIDs since they're the same thing. In fact, before NTPL was implemented Linux broke posix because it returned the kernel PID directly when queried[0]. That's no more than an interface choice of POSIX though. And POSIX threads still have a thread id, a (pid, tid) is a unique visible identifier. So the distinction seems completely arbitrary.

And Erlang processes do have (erlang-level) PIDs. In fact, they get (erlang-level) PIDs across multiple machines when in a cluster.

And again I think ancillary properties are not what matter, a PID is a consequence of being a process, not a cause. The cause of being a process is not sharing internal state, that's the useful bit. Erlang's tasks are not os processes (that would rather defeat the points), but it doesn't seem useful to call them "not processes".

[0] it now returns the tgid, a "thread group" at the kernel level is what you see as a process from userland: the tgid is the pid of the original task, creating a "process" will create a new task with a new tgid while creating a "thread" will create a new task with the existing tgid https://stackoverflow.com/a/9306150/8182118 provides an excellent primer on how this works

Thanks for clarifying.

Traditionally in Unix and Windows contexts, "processes" means separate memory address spaces and OS-guaranteed isolation, and "threads" mean threads of execution that have shared access to one address space.

Erlang processes are threads as seen by the OS, but the Erlang runtime implements process-y restrictions that enforce isolation and forbid shared memory between Erlang processes that do infact exist as threads inside te Erlang VM.

The programming model, as seen by the Erlang/Elixir prgrammer, is thus anologous to Unix processes, just with lower overheads.

The Unix programming model does not forbid shared memory between processes. It just gives them separate address spaces, but aside from that you can do whatever you want. Yes, separating address spaces makes processes memory safe by default, but it's not just Erlang that enforces this sort of memory safety programmatically within a single address space. What about Rust, or Haskell. They don't refer to threads, or to runtime-scheduled fibers created via async programming or via the work-stealing model, as "processes".

I don't know about Haskell parallelism primitives, but in Rust the normal thing to do is to share references to the same memory location correctness-checked by the type system. The correcness guarantees don't come from private storage, but from compiler made proofs done on the control flow / dataflow. So it wouldn't make sense to say "processes".

Erlang is a dynamic language without any such type system checks. The process separation is all just based on the fact that it's impossible to get or make a shared value, it's just not a concept in the language. You send and receive messages, which implies a copy, and you faff around with local values inside your process.

Re shared-memory support in Unix: Yeah, you have escape hatches from the memory models in Unix processes, and Rust, and probably Erlang and Haskell. But they're exceptions and safe to ignore when discussing terminology to describe the platform's native model. Also the Unix shared memory APIs were a late addition to the OS and everyone agrees they're ugly :)

> Erlang processes are threads as seen by the OS

Not true, Erlang processes are userspace threads, not kernel threads. It's an M:N model -- you can run thousands of Erlang processes on a single OS thread. See:


and note that each "scheduler" runs on a single OS thread, whereas many processes can run on each scheduler.

Good correction. But for the purpouses of this discussion re the processy-nature of Erlang processes, it's an implementation detail without difference in programming semantics.

> processes and forking are relatively expensive

This and "async is always faster" are two things which are no longer true on modern hardware.

Forking a process on Linux uses the same clone() syscall as creating a thread so forking a small binary takes only tens of microseconds, leaving plenty of time for <1ms total response times.


Sure. Under Linux, processes and threads are both just tasks anyways. Processes may be nearly as efficient as threads, but you can do one better by not even having to spawn new threads per worker, which is what a concurrency model like Go's or Erlang's enable with ease. CGI and Websocketd also have significantly more costs than just a fork since they need to execute a target program.

And even forgetting the costs of forking, threads are not necessarily much better anyways. I can easily have a single Go program scheduling literally hundreds of thousands of Goroutines across just a few OS level threads and have no problems whatsoever, and in fact I've done exactly that in production, whereas I would never even dream of doing that with threads.

Node.JS's event loop model also doesn't need to spawn threads per connections.

When I say processes and forking are relatively expensive, I don't mean compared to doing the exact same thing with threads; I mean compared to more modern alternatives, like using an event loop or Goroutines.

I never wrote that going back to CGI and creating a process or thread per request was a good idea, only that forking is much cheaper than often thought.

Event loops and userland cooperative multitasking like Goroutines predate Linux 2.6 kernel threads. They're not "more modern".

For sure, but the biggest overhead is usually application startup (in say Java, Dart, Node, Python, etc) where just the VM can take a few 100ms to start. Using a compiled language would most likely be the way to go for maximum performance (although it's less important if the WebSocket is long lasting)

> Forking a process on Linux uses the same clone() syscall as creating a thread

The conclusion to be drawn is "Linux threads are costly", not "Linux processes are cheap".

Is it? I seem to remember that spawning a process on Linux was about as fast as a new thread on Windows XP.

This is something that confuses me. Servers offer CGI and fastCGI and if you aren't using those to interface with your own code, how do you do so otherwise? Do some languages have their own deep connections into the server flow that makes them faster? I just don't understand this.

If you’re using CGI or FCGI, then you’re using a general web server that is probably serving lots of different applications.

These days, folks often write the app to contain a web server. Write your app in JavaScript, pull in a dependency that serves HTTP, write enough code to route requests and start the server listening- welcome to Node. Python and Ruby offer the same. Swift gets this functionality from Kitura and Vapor. I’ve experienced C and C++ varieties.

Back in the day when Apache was seriously hot, mimicking today’s architecture would have meant writing an Apache module - the app would then live within the server and no communication channel/pipe/socket/fork/spawn would be required.

Yes but the poster, above, is complaining about using CGI with a server. My question is, if you're not using CGI/fastCGI with Apache or nginx for example, what is he suggesting should be used instead?

Ok, lemme spell it out: bundling the app and the server together; that’s what’s implied.

Which comes back to my question. Can one do that with apache and nginx? I'm pretty sure I've seen methods for this in Apache but only the paid version of nginx. Am I wrong?

Today, app servers are mostly separated from front-end servers because, it turns out, it's best to not hold on to too much memory and/or threads while waiting for a slow client to receive their stuff. So async servers like Nginx are used on the front tier and they knock to app servers over the net.

But Nginx does have a Lua module, and, I think, Perl. Afaik both are in the free version. Mostly used to augment configuration, or for lightweight interfacing directly with databases (ahem Redis and Memcached cough).

I mentioned Apache modules in my comment. Write a module, add it to Apache’s config, your code loads with the web server.

Looks like one can also write nginx modules, but I’m not familiar with nginx licensing to know if a purchase is required to use that in a commercial deployment.

It's not required, and in fact there's a big open source project taking advantage of that: OpenResty uses a module that loads LuaJIT into Nginx, and builds very fast applications into Nginx using Lua.

Using your language of choice: You open a socket on a port, listen for connections, when connection from client arrives read bytes from the socket, parse http request from those bytes, do something that may yield bytes in response, write the response bytes to the socket, close the socket (optional).


I asked about how to do this through apache or nginx. I know how to do it as you suggest.

You don't need a server like Apache or Nginx. Your code can just bind to 80/443 directly.

Yes you can but that has nothing to do with what I asked.

If you get the same answer three times but don't think any of them answer your question, maybe consider that you're the one failing to communicate.

You can also just serve HTTP.

That has nothing to do with my question. Of course you "serve HTTP". I want to know how one interfaces with a server without CGI or fastCGI.

You serve HTTP to the server. To the front-end server or the load balancer, aka ‘reverse proxy’. That's how.

Both Apache and Nginx can proxy HTTP, out of the box.

IOW, opening a socket from the routing HTTP server to an application HTTP server. That’s pretty much how FCGI works.

Yeah, people just seemingly realized that there's nothing special about FastCGI that couldn't be done with plain http. And as a bonus, the app itself can work as a server in the dev env, and testing is simpler.

(Only, CGI managed to get headers sorta right by embedding them in protocol vars and not the other way around.)

But I already do this and that doesn't answer the question about the difference between fastCGI and what everyone else seems to be using, but not explaining, is a method of direct integration or ability to talk with the server without CGI/fastCGI.

After doing a little research, it appears this can be done by implementing or writing modules that will talk directly to server internals if I'm understanding this correctly.

You don't usually need to talk to the server from the app. Pretty much the only thing needed in practice is returning an internal redirect that ends up served by another path (with error pages being variations of this). This is solved fine with an HTTP header in the response.

CGI and FastCGI actually have the exact same flow of information as HTTP (to my knowledge). Anything else is either implemented on top of that, or you'll need a module reaching into internal functions.

Is that a feature or a bug of *nix?

Not really; IIRC forking is more expensive on NT, and *nix processes/threads tend to be fairly cheap. It might be a generally expensive model, and threads/pools are probably the better-scaling option regardless of OS.

On NT though, there's a deeper divide between processes and threads. Threads on NT should be a lot cheaper. Especially since you can't fork on NT without diving into undocumented APIs* :)

*On the Windows subsystem, of course. There's the lower level ZwCreateProcess function which can/could be used to fork, but it's undocumented and I believe it only existed for the old SfU. Now that that's gone and Linux Subsystem uses something called Pico Processes, I'm guessing this old fork flag is pretty much unsafe to use for anything at all.

Doesn’t this have an impedance mismatch? Stdin/out are stream based. Websocket is message based. There is no guarantee you can transmit the content of a single WebSocket message inside a single os read or write call. Unless you expect that on both sides messages might be fragmented across multiple calls and callbacks. But I don’t see the docs mentioning that.

Newlines. Each line written to STDOUT is sent as one frame, and each frame received is read from STDIN followed by a synthetic newline. The FAQ explains how to escape multiline messages (or binary data, presumably).


Which is a common anti-pattern. You take a medium that is fully transparent to binary data, and then needlessly restrict it to text. There are excellent ways of packetizing binary data in a stream, most notably SLIP.

Ah thanks that makes sense. I didn't see it when I glimpsed over the site.

That should provides reliable framing on both sides. But as shown in the linked it page it also comes with the downside of not being able to send raw websocket messages which contain a new line - so it's not possible to port existing applications from other websocket servers to this one without having to change communication.

I don't understand why the daemon needs to client to escape newlines in the message. Can't it handle that before feeding the message to STDIN?

For example, my browser sends:

    This is line one...
    ...and this is line two
as a string ("This is line one...\n...and this is line two") and the websocketd receives that, and passes along

    This is line one...\\n...andthis is line two\n
to the STDIN.

It knows that the message is one thing, so the newlines aren't the same as the end of message, right?

It'd also be nice to use EOF as the indicator to flush STDOUT to the client, so the program can also emit newlines without needing to escape them first.

All that being said, this is nifty. I think I can use this for some low-volume ideas I have on an internal app. The newline handling isn't a problem because I'm not replacing anything that already exists.

> It'd also be nice to use EOF as the indicator to flush STDOUT to the client, so the program can also emit newlines without needing to escape them first.

EOF is not an actual control code though (and is implementation dependent). However there are control codes for message framing e.g. ETB, ETX, EOT (often used as terminal EOF), FF (form feed / page break), RS/GS/FS

> EOF is not an actual control code though

You're right. I think I have to relearn that periodically :)

It's line based. A newline character separates messages.


Interested to know in what situations you anticipate this would cause problems?

In every situation? The API interface doesn’t match.

Websocket isn’t a stream, it is framed. Obviously it can be used like a stream but that depends on the client and server. The issue here is that all the client side (browser) APIs expose frames instead of streams.

Suppose I send a JSON object I have no idea what framing this socket server will use and if it’ll break it up into multiple pieces and require the client to reassemble it. If I send two JSON objects in a row does it arrive in two websocket frames or could it buffered into one?

Secondly the websocket protocol is much more complex than a tcp stream. It has keep alive packets that can be sent, it has close packets that allow a connection to terminate with a given code that can be handled by the client. How are these exposed?

I estimated that the daemon just tried to send the data from the local application as a websocket messages whenever it received something through the FD. If the application tries to write a 100kB message to stdout, the daemon might pull this out in smaller chunks of it's socket, since the OS buffer might be not as big. If it's 50kB, then the daemon would send 2 50kB messages instead of a single 100kB one. If the JS on the other side would have estimated that everything is inside a single message things would be broken.

However it looks like the Daemon might wait for a newline until it forwards everything as a message. Which at least fixes this problem, but might have other side effects.

> However it looks like the Daemon might wait for a newline until it forwards everything as a message. Which at least fixes this problem, but might have other side effects.

That you can't put newlines in messages — whether input or output, you've got to modify the protocol to escape them somehow due to intermediate implementation details.

> Stdin/out are stream based. Websocket is message based.

WebSocket is implemented on top of TCP/IP, which is stream based.

Good time to plug this utility, swiss army knife for websockets: https://github.com/vi/websocat

Author is extremely nice as well.

Websocketd is neat. It inspired me to make the websocket directive in Caddy: https://caddyserver.com/docs/websocket

Aha! My first thought was "Why use this instead of Caddy's websocket support?"

Answer: Websocketd came first.

Unpopular opinion: AWS lambda is basically equivalent to CGI. The web has come full circle.

It's closer to FCGI, since Lambda processes aren't restarted for each new request.

Question: why doesn't Amazon just use the standard FCGI interface, then?

Maybe they are; as far as I know, the protocol they use to communicate with the worker isn't specified, all they say is "implement a function with this signature, then import our SDK, and it will call your function". This gives them the flexibility to switch protocols at will, at the cost of having to implement these SDKs.

That said, building a FCGI "bridge" isn't hard, it's just that nobody cared enough to do it.

They have specified the lambda runtime. https://docs.aws.amazon.com/lambda/latest/dg/runtimes-api.ht...

Basically: your instance POSTS to an endpoint to get/reply to requests, one at a time.

I can't wait until you can specify that a lambda instance can handle a certain # of requests in parallel. Ex: things just blocked on IO/DB. That would make better use of your RAM.

My new serverless-to-CGI bridge will be called "hipstercgid"

Several times. :D

Is this supposed to be a bad thing?

No, it's just ironic. It also means that AWS could potentially reuse standard (F)CGI interfaces.

It looked simple enough, but I was curious about the threading example on https://github.com/joewalnes/websocketd/wiki/CPP-Input-Outpu...

The variable `count` appears to be incremented non-atomically from two different threads. Is that safe in C++?

It is unsafe. But it would be safe if these two lines were swapped:

since the reader already holds the mutex while reading `count`.

Do a PR or equivalent?

Friendly reminder that not everyone is in a position where they can create pull requests against random projects: they might not have time, or clearance from their company's legal department, or…


Note it was a question. As you mention, not everyone can do so (etc).

The question looked like "why are aren't you making a PR for this".

The second example is also leaking memory.

i’ve been using this in production to stream logs to a web console (~2,000 sets of logs distributed across 4 server with about 350 active connections at a time) and have never had any issues

How do you know that you haven't had any issues?

Usually you notice issues by the server locking up, the clients reporting problems, or messages being missing from the log (which you tend to notice when you search for specific things in the logs).

So how do you know that you're not just getting 80% of "log entry X" just because you get hits for them every now and then?

Logging is usually pretty deterministic.

A request might generate log entries a, b1, c or a, b2, c, depending on some conditions. The exact contents vary by request (otherwise there would be no need to log them), but the type is always the same.

If you find logs for c without either b1/b2 or a, you know log entries went missing.

If you have a 20% miss rate with recording your log entries, and you analyze just 20 log entries, the chance that one of them is missing is already around 98.8%.

If you actually use your logs for anything, it becomes pretty obvious pretty quickly when they are incomplete.

we track errors both server and client side. only errors we routinely see are network issues.

to be fair there could be silent failures, but after 4 years of daily use by people who are experts on these logs, we would have had at least a couple reports of missing lines.

we often download the files post viewing on streams also. mostly it’s exceptions so we’d easily notice missing lines in stacks

If you really wanted to follow UNIX philosophy, why not build atop xinetd?

for example... simply accept stdin and stdout as I/O streams by default, but don't provide a network mechanism.

You can put websocketd behind xinetd. xinetd doesn't talk websocket, so you still need to "provide a network mechanism" to bridge the incoming connection and the client's streams.

> xinetd doesn't talk websocket

Why not extend xinetd so that it can interact with websockets?

Because the point of xinetd is to spawn other daemons on socket activations. xinetd handling stuff itself is the exception not the rule (5 service are internally provided, none of which you want to run: RFC 862 "echo", RFC 863 "Discard", RFC 864 "CHARGEN", RFC 867 "daytime" and RFC 868 "time").

So, all of the same sorts of scaling problems as inetd?

Well, unless it's doing something exceedingly clever, it looks like it will be launching one process per connection.

For communication servers, this could prove a challenge – you'll probably want to use some kind of pub-sub architecture. By the time you've gone down that road, you could've gone down one of the more robust paths.

Still, this looks great for smaller apps. And, it seems a great way to prototype – especially if your favourite language doesn't have great websocket support.

> Still, this looks great for smaller apps. And, it seems a great way to prototype – especially if your favourite language doesn't have great websocket support.

Totally agree. inetd is a pretty good model, it just falls down in the face of thousands of slow, low-computational effort connections tying up gigabytes of RAM. That and most schedulers seem to struggle with that number of threads.

Wondering if it would be more "unixy" to use sock files [1] ? I think some WSGI servers such as Gunicorn [2] support said functionality. I do think there is a place for websocketd provided you do not use tools like WSGI servers already.

1: https://en.wikipedia.org/wiki/Unix_domain_socket

2: http://docs.gunicorn.org/en/stable/deploy.html?highlight=soc...

Reminds me of a little project I did a while back, but instead reading from stdin like netcat


This is really useful for building one of utility websites. I used this in one of my previous companies to tail the logs on our little QA server and push them via websocktd to an internal web page.

I'd be interested to know if anybody has used this in production. I've used Websocketd for quick and dirty prototyping and making things like in-browser monitoring tools and it was very fast to setup, but a full-fledged library like Gorilla or uWebsockets seems more practical for real-world applications with thousands or more simultaneous users.

Could this then be used to replace node and this type of mess? https://github.com/phoboslab/jsmpeg/blob/master/websocket-re...

Code looks reasonable.....

I've built similar tool, but for HTTP at https://github.com/ostrolucky/stdinho

FYI, the page needs a viewport tag in the head to display properly on mobile: <meta name="viewport" content="width=device-width, initial-scale=1.0"/>

The unix philosophy is massively overrated because processes are unwieldy in practice, and support only weak notions of composition even in theory.

is it me, or does the site look really similar to letsencrypt.org? I looked on both sites and neither say they're using a template.

FWIW the css template is just some bootstrap stuff, and this for the syntax highlighter in the code snippets: /* http://prismjs.com/download.html?themes=prism-twilight&langu... / /* * prism.js Twilight theme * Based (more or less) on the Twilight theme originally of Textmate fame. * @author Remy Bach */

If someone really cares to investigate, the Letsencrypt.org website (Hugo based) is here for starting with:


Seems useful! Remind me of pushpin.

Does anyone have any experience using websockets in a serverless architecture?

>Avoid threading headaches

>Each inbound WebSocket connection runs your program in a dedicated process.

Not the best design decision

The quality of the design depends on the goals. For simplicity, seems like a great decision.

Or just use cowboy which is written is in erlang and will scale better.

Is there any other benefit than scaling better? Because I'm pretty sure 99.9% of projects never hit the "need to scale more" part, and if they did it doesn't seem like this would be too difficult to move off of

what if you need two clients to connect to the same process

How do posts like this end up at #2 on the front page with just one comment? Websocketd doesn't strike me as an particularly popular or well known tool.

Well if it was popular or well known why would it be news-worthy?

Indeed. Discovering interesting new things is one of the main purposes of HN.

I think the velocity of the votes and the unique-ness of their geography matters a lot more than comments or discussions.

> unique-ness of their geography

The location of the server is a factor??

It's not listed in https://github.com/minimaxir/hacker-news-undocumented/blob/m... or in https://drewdevault.com/2017/09/13/Analyzing-HN.html

It's the first time I hear anyone mention it. It would be great if OP had some supporting link.

The location of the voter.

Ohh I see, okay thanks. Surprising but cool!

So if one creates a HN profile from Amundsen–Scott Station, you can have a disproportionate effect? Very interesting.

I imagine it's not the mere creation that matters but where the login is usually from? And I'd assume the account age matters too...


But websocket connections are usually long lasting. So the cost of the fork is less important.

So kind of like CGI. Okay.

That's what they say on their site: "It's like CGI, twenty years later, for WebSockets"

CGI 20 years later? can somebody explain what they mean by this

edit: okay this is pretty cool, what use cases is there?

Ok I know the software-today-is-so-bloated trope is overplayed but... "the UNIX way"?

The compiled Linux x86_64 binary is 7 megabytes. All of System V combined was not that big.

Is unix defined as having small binaries? I don't think that's true. Yes, it's statically linked and as a result somewhat large.

This is a program built around executing programs and communicating over pipes. It's composing programs together. I think that's certainly UNIX-ish.

> Is unix defined as having small binaries? I don't think that's true. Yes, it's statically linked and as a result somewhat large.

No, but literally the first point is:

> Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features.

It's not hard to see how this would apply to libraries, and how including third party code in your binary would break this idea because you're essentially "freezing" the application in time.

Your definition regarding "composing programs together" is not all that makes a program UNIX-ish.

It then goes on to list a whole load of feature creep so the claim of doing one thing is a bit ridiculous.

> Written in Go

I have a fairly simple web-app written in go – no websockets, just net/http. I just checked and the binary size is 6.8 megabytes.

There's a lot in there. For starters, there's the go runtime. Then there's the HTTP server.

I suspect there may be ways to improve on the binary size if it mattered so much. xz gets it down to 2.1MB, suggesting there's some redundancy in there.

It's certainly an issue, especially now with wasm: https://github.com/golang/go/issues/27266

Probably unavoidable for this kind of app though. Unicode alone involves a whole lot of data tables that are going to be hard to get rid of unless they're dynamically linked.

7MB is still a lot smaller than Docker containers :)

You can cut your binary size almost in half on Linux by removing debug information and symbols with:

`go build -ldflags="-s -w"`

I thought it was dangerous to use `strip` on any go binaries, because it would result in an incorrect program. LDFlags might cause something else but isn't it the same concept?

... and yet Unix was critiqued for being written in a high level language instead of assembler

UNIX was originally written in Assembly, B and C came later.

And why should it be criticized? Outside AT&T, people were writing OSes in high level languages since 1961.

It's kind of just the way golang's compiler statically compiles everything. If you use gccgo, you can get a 762KiB binary easily.

SVR4 came out 30 years ago. I think it's time to let it go.

Aah the joys of false progressivism.

I hear that representing the absence of value as a number was invented about 6000 years ago, perhaps we should abandon that as antiquated as well..

also, to wit, a Unix kernel of recent vintage:

    $ uname -msr
    OpenBSD 6.3 amd64
    $ du -skc bsd.rd bsd  
    9664    bsd.rd
    12912   bsd
    22576   total
vs, say, a Linux kernel of recent vintage:

    $ uname -msr
    Linux 4.9.0-8-amd64 x86_64
    $ du -skc /boot/vmlinuz-4.9.0-8-amd64 /boot/initrd.img-4.9.0-8-amd64 /lib/modules/4.9.0-8-amd64
    4152	/boot/vmlinuz-4.9.0-8-amd64
    21616	/boot/initrd.img-4.9.0-8-amd64
    212248	/lib/modules/4.9.0-8-amd64
    238016	total
yes, there are likely more drivers in the latter. highly doubt there is an order of magnitude more though.

not to mention some 'modern' npm+webpack monstrosity.

that said, given the latter, i can hardly fault a <10Mb go executable as 'excessive', so you're right on that front.

> yes, there are likely more drivers in the latter. highly doubt there is an order of magnitude more though.

That's exactly what they are; ~180M of those 212M are in drivers/. And even outside that, it includes stuff like fs/ocfs2, which I don't think OpenBSD supports.

Fun fact: OpenBSD 6.4 /bsd kernel can be shrunk from 14.8MB to 6.6MB. With gzip.

This turned me on.

I'm fine with letting it go! I just don't understand why they're claiming this is "the UNIX way" as if that's a good thing.

I hate to be trollbait, but: "the Unix way" is less about the size of programs, and more about programs being built as composable, modular components that interact together over a common interface.

Curious people can read more about it at https://en.wikipedia.org/wiki/Unix_philosophy or https://homepage.cs.uri.edu/~thenry/resources/unix_art/ch01s...

The websocketd site outlines this fairly well with the big quote on their homepage.

Looked at another way, a Unix-like ecosystem satisfies two of the principles of a SOLID software architecture: the Single Responsibility Principle and (arguably) the Open-Closed Principle.

If websocketd focuses on handling the nuts and bolts of websocket connections and invoking other programs and piping data into and out of them over a standard interface, then it's a Unix-like architecture, even if it's a fat, monolithic, statically linked binary.

The unix way is 'decoupled microservices', breaking down components in to small pieces for reuse, it continues to be used and reinvented in different contexts. Both monolith and decoupled components have their cheerleaders but simply dismissing either as 'not a good thing' doesn't add anything to the debate.

> Each inbound WebSocket connection runs your program in a dedicated process. Connections are isolated by process.

I see you and 10,900 other "software developers" get the point of WebSocket. It's especially "impressive" when you written the whole thing in Go and could have used goroutines and channels, which could easily handle hundreds of thousand of connections (maybe millions). Terrible design!

For a long time I thought every software developer must be smart, but a lot of them don't see the big picture very often.

Why this bothers me so much? You created something that a lot of developers think is good enough, so less innovation will happen and more and more slow websites will appear because of tech like this...

There are definitely use cases for this, for instance you might have some command line application that uses standard io. Sure it might be adapted to have support for web sockets natively, but this of course takes time. And as the old adage goes time is money.

And sure you might be upset by "slow" websites, but for the small consultancy productising their scripts, it might make them - dare I say it, a quick buck. In the end the invisible hand of the market decides how much money is invested making websites fast and that is what guides the design of the technology being developed.

Perhaps it is you who fail to see the big picture.

The business logic is all outside of websocketd. There is no standard as to how to do this, so obviously the simplest way is to mimic CGI. Invocation happens when the status quo isn't enough, and the status quo until now was that there was no 10-second way to send whatever you want through websocket. It can happen now, and you can be the one to propose some better protocol for a websocket endpoint and a business logic process to discuss.

I've never felt a stronger sense of https://xkcd.com/386/ in my life. Why are you being so cynical? The author made a tool with no interface that turns any program into a websocket server, making no claim that it can directly replace production servers focused on scale. What are you complaining about? Who hurt you?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact