Hacker News new | past | comments | ask | show | jobs | submit login
Privileged Ports Are Expensive (2016) (adamierymenko.com)
265 points by phaer on July 6, 2017 | hide | past | favorite | 210 comments

In terms of security/isolation, processes, users, containers and virtualization are all essentially the same thing. I wish the people working on these things would step back and notice the forest for the trees.

Whatever the ultimate "isolation unit" ends up being, it needs to be recursive. That means being able to run processes within your process (essentially as libraries), create users within your user account (for true first-class multi-tenancy), or VMs within your VM (without compounding overhead).

It turns out that this author also wrote "Docker: Not Even a Linker"[1] which was also deeply insightful about unconscious/accidental architecture decisions. I'm impressed by his insight and disturbed that most people don't seem to understand it.

[1] https://news.ycombinator.com/item?id=9809912

Let me take that one step further:

>processes, users, containers and virtualization are all essentially the same thing.

...and so are modules/objects/whatever your language of choice calls them. Abstraction boundaries, to be precise. Abstraction, security, and type-safety, are all very closely related.

These language-specific mechanisms for isolation are recursive - trivially so. And language runtimes and compilers make security cheap - so cheap that it's ubiquitous.

Processes, users, containers and virtualization all rely on an operating system for security, which in turn relies on hardware features. Specifically, virtual memory and privileged instructions. And those hardware features are slow, and more importantly: they're not recursive!

But hardware-based isolation does have one key advantage over language-based isolation: It works for arbitrary languages, and indeed, arbitrary code.

I completely agree that recursive isolation is necessary. We need to figure out rich enough hardware primitives and get them implemented; or we need to migrate everything to a single language runtime, like the JVM.

Great point. The JVM tried for this position and failed IMHO (I think it abstracted too much). Now the browser is slowly honing in on it, and it might succeed (mostly due to sheer inertia). As opposed to the JVM, I like to call the ultimate goal the "C Virtual Machine" (just process isolation++).

I think moving isolation out of hardware is really important (both to make it recursive and portable). NaCl is an interesting step in that direction. If you could use something like it to protect kernelspace (instead of ring 0), syscalls could be much, much faster.

There's another problem with language-based isolation: it makes your language/compiler/runtime security-critical. Conversely, NaCl has a tiny, formally proven verifier that works regardless of how the code was actually generated, which seems like a much saner approach.

I'll also say that I don't think it's reasonable to expect every object/module/whatever within a complex program to be fully isolated (in mainstream languages at least). There's no need for it, and it will have too much overhead (in a world where objects in many languages already have too much overhead). Better to start relatively coarse-grained (today the state of the art is basically QubesOS), and gradually improve.

"NaCl is an interesting step in that direction. If you could use something like it to protect kernelspace (instead of ring 0), syscalls could be much, much faster."

It's actually partly inspired by how old security kernels work mixed with SFI. The first, secure kernels used a combination of rings, segments, tiny stuff in kernel space, limited manipulation of pointers, and a ton of verification. Here's original ones:


A Burroughs guy who worked with Schell et al on GEMSOS and other projects was the Intel guy who added the hardware isolation mechanisms. They were originally uninterested in that. Imagine the world if we were stuck on legacy code doing the tricks no isolation allows. Glad it didn't happen. :)

Eventually, that crowd went with separation kernels to run VM's and such that market was demanding. They run security-critical components directly on the tiny kernel.


The SFI people continued doing their thing. The brighter ones realized it wasn't working. They started trying to make compiler or hardware assisted safety checking cost less with clever designs. One, like NaCl and older kernels, used segments to augment SFI. Others started looking at data flow more. So, here's some good work from that crowd:




So, have fun with those. :)


...this is extremely relevant to what you're saying. A talk worth watching.

I don't think I would call JEE and Spring a failure.

They mark the turning point I stopped worrying about UNIX deployments.

An application server has all the features I care about from a container, including fine grain control over which apis are accessible to the hosted applications.

It doesn't matter if the application server is running on the OS, an hypervisor, container or even bare metal.

Back in 2011 we were already using AWS Beanstalk for production deployments.

Also OS/400 is like that, user space is bytecode based. For writing kernel space native code, or privileged binaries you need the appropriately called Metal C compiler.

It's been done:



The latter runs FreeBSD on a FPGA. There's others that use crypto to do similar things with sealed RAM. What we need isn't tech to be invented so much as solutions to be funded and/or bought. Neither most suppliers or demand side want to make sacrifices necessary to get the ball rolling. Most prototypes are done at a loss by CompSci people. There's a few niche companies doing it commercially. High-assurance security w/ hardware additions is popular in smartcard market for example. Rockwell-Collins does it in defense with AAMP7G processor but you bet really low volume in orders. Price goes up as a result.

"It's been done" and "What we need isn't tech to be invented so much as solutions to be funded and/or bought." are somewhat dismissive overstatements.

If we had a way to implement a capability-secure runtime on Linux on Intel CPUs, in a way that improved performance rather than making it worse, and was straightforwardly usable with existing programs, then we could make a ton of progress and easily be commercially successful.

But that is technologically difficult. Maybe we can still figure it out, though. Or maybe we just need to figure out the right trade-offs to get something viable out the door and widely used.

"in a way that improved performance rather than making it worse"

This part is not realistic if it's apples to apples. Performance will always drop because the high performance came specifically by doing unsafe things that enable attackers. Adding checks or restrictions slows it down. The question is whether it can be done without slowing things down too much. I'm hopeful about this given many people are comfortably using their PC's and the net on 8-year old PC's w/ Core Duo 2's that still run pretty well. That was 65nm tech. Matching it or at least a smartphone is what I'd aim at for first run.

>Performance will always drop because the high performance came specifically by doing unsafe things that enable attackers.

No, this is not true. Strong static typing and other compile-time information can (at least theoretically) allow improving both security and performance.

For example, look at single address space operating systems. Switching between security domains is cheaper when you don't have to switch address spaces. That's an improvement to security that allows increased efficiency.

We were talking about C code. That doesn't have strong, static typing with built-in safety. C developers also rejected things like Cyclone and Clay. So, my comment assumes a C-based platform (esp OS).

Intel had an wonderful processor for that, iAPX 432, and they botched it.

Every time I read about it I wonder how things would have turned out, if they managed to do a proper job with the CPU.

Don't forget i960 which had some of 432's fundamental protections with decent performance. It was used later in high-end embedded market.


This. We know how to build secure systems, but VCs won't fund the effort unless there's profit to be made and existing software will run on it. That second criterion especially means you're going to have to dumb down either your security or your performance to the point that there's no real value added.

I've come to the conclusion that the only realistic approach to security is to begin a slow, methodical replacement of every line of C code in the Linux kernel with Rust. No VC is going to pay for that, so some other funding mechanism will be required.

Not only Rust, but every native compiled memory safe language is a good alternative to replace user space applications, specially if they aren't dependent on the usual C memory pointer optimization tricks for their use case.

Hence why you will see my schizophrenic posts regarding Go, although I dislike some of the design decisions, every userspace application written in Go is one application less written in C. And if bare metal approaches like GERT[0] take off even better.

[0] - https://github.com/ycoroneos/G.E.R.T

I think modern hardware is "recursive" in the relevant sense.

In an ethereal sense sense every Turing complete machine is recursive, because by definition you can implement another Turing complete VM on it. CPUs with kernel/user separation took it further by allowing vanilla instructions to run on bare metal within a virtual world defined by the kernel. Modern CPU virtualisation features extend this to the control instructions that an OS would use.

What else can you ask for?

I actually don't know much about nested virtualization. Can it be done to arbitrary depths with modern hardware?

One issue with being "recursive" is efficiency. Ideally, running 500 levels deep would be just as fast as running at the top level. In language-based systems this can be achieved with inlining and optimization, but it's difficult in hardware, where there is much less semantic information available about the running code.

In terms of missing the forest for the trees, I agree, but perhaps we are looking at different forests.

If port 22 is not privileged, then what is to prevent my daemon from listening on that port and collecting all the credentials of other users trying to log into the machine? Nothing. This is why users don't get to bind to privileged ports -- it's why privileged ports exist. The workaround is that every user get their own ssh daemon under their control and for every user to request that their ssh daemon handle their own login by specifying their own virtual network address: alice@alice.com and bob@bob.com -- instead of the current solution of using shared hosts with system services: alice@sharedhost and bob@sharedhost

What you cannot do is have system services (a shared host) with user control over daemons that fulfill those services. It has to be system control over shared services and user control over user services.

But every user having their own ssh daemon and their own hostname/IP is certainly looking a lot like the virtualization/containerization solution, no? The opposite of the virtualization option is not "get rid of privileged ports", but "have privileged ports" -- e.g. have resources controlled by the system and not any particular user.

The real complaint here is that using custom port numbers is unwieldy and we need more robust mappings from custom domains to shared domains with custom ports. For example, make it easier for users to set up their own virtual hostnames to map to a shared host with a custom port. Getting rid of privileged ports doesn't solve this problem at all.

For the same reason, users can't bind to port 80, because then my webserver could steal credentials to your site, as both our sites use the same common webserver. So either none of us controls the webserver or we each have our webserver, and with our webserver we'll need our own copies of other system libs, which again puts us back on the containerization path.

Again, the choice is of using system libs versus duplicating and then isolating user libs.

It seems to me that this is the fundamental trade off, and focusing on privileged ports as a problem, when they are one side of a fundamental trade off, is not really insightful at all.

Those are all accidental limitations from the architecture as it exists, not as it could exist. There is no fundamental reason why the IP address is a combination of network and host address. There is no fundamental reason why a host is presumed to have only one IP address. There is no fundamental reason why alice@alice.com and bob@bob.com can’t be the same daemon listening on different IP addresses, but anybody connecting to the alice.com interface gets a different certificate and no access to the bob.com resources.

I think the systemd socket activation with declarative configuration files, and the Serverless cloud computing fad, are hints of how it is possible to control exactly what program is running, and even have some custom code, without having to duplicate and maintain all the binaries. Too bad they’re doing it on Linux, so they still have all those accidental limitations.

Hosts are not presumed to have only one IP address. This is a mistake that people make. (They often made it with djbdns, hence https://cr.yp.to/djbdns/ifconfig.html .) But it has never actually been the case. Indeed, one can find discussions of hosts that have multiple IP addresses in RFC 1122 and discussion that IP addresses do not have a 1:1 correspondence with network interfaces in RFC 791.

Details. In practice, few applications do interesting things when binding to multiple IP addresses. It’s like a special case of single IP address.

Perhaps I should phrase it, there is no fundamental reason why IP addresses are associated with hosts rather than users, or even services. There is no fundamental reason why you need to be root to listen to privileged ports, which includes many of the most useful ports.

There's implications for development whenever we have a sandbox wall, too. There are serious differences in the development experience of a binary language with no runtime model, a binary language with one, a language wrapped in a high level, fully sandboxed VM, and a language compiled to another language. Some of them allow productivity, others restrict entire application categories.

Once you have a runtime, dynamic linking suffers. Once you have a VM, you lose vast swathes of control over I/O and resources. And once you target a different language you end up with lowest-common-denominator semantics and debugging capabilities.

In some respects, the JVM-style sandboxed language runtime is an "original mistake" because it's an OS-like environment that doesn't have access to OS-like features, leading to a lot of friction at the edges and internal bloating as more and more features are requested. If we had similar access to memory, devices etc. everywhere the friction wouldn't be experienced as such, even if there were protections enforced that hurt performance in certain cases. You'd design to a protocol, and either the device and OS would support it or it wouldn't. That's how the Internet itself managed to scale.

But as it is, the stuff we have to work with in practical systems continues to assert that certain line of coder machismo: unsafe stuff will always be unsafe and You Should Know What You Are Doing, and anyone who wants safety is a Newbie Who Should Trust A Runtime.

Apparently C developers still don't know what they are doing, after 50 years.

"12. Trust the programmer, as a goal, is outdated in respect to the security and safety programming communities. While it should not be totally disregarded as a facet of the spirit of C, the C11 version of the C Standard should take into account that programmers need the ability to check their work."


"In terms of security/isolation, processes, users, containers and virtualization are all essentially the same thing."

This is not true. They each are defined by different security boundaries, have vastly differing properties of isolation and communication, contain different data, and are contained by different components.

From a security perspective they are not the same, though from a functional perspective they each solve similar problems.

I feel like you're speaking past each other. I think you're describing what they happen to be empirically, whereas the parent is describing what they are fundamentally (isolation mechanisms). I'm pretty sure the parent realizes you can't "user" for "container" in the same sentence, and that each one has different use cases and implications... but what is being claimed is a little more abstract than that.

That may be, though I do not think that is the case as the parent made a very narrow qualified statement.

The parent made a specific statement about the security properties of various isolation mechanisms, equating them all from a security perspective.

Can you give a specific example of something that e.g. processes absolutely must support which users absolutely cannot?

Consider that in Linux, processes and threads are implemented via the same abstraction (tasks). This abstraction actually leaks in some unfortunate cases, but it's generally considered "good enough."

The abstraction may be good enough functionally. My comment was a security not functional statement.

In the case you mention, your choice of abstraction may affect your threat model, depending on if there is shared state and what data may require isolation.

I'm assuming that the underlying isolation mechanism is formally proven (or at least as good as possible). With a single set of reasonable features, it should be able to provide isolation between processes, users, containers and VMs. What am I missing?

For general purpose operating systems formal verification of security mechanisms should not always be assumed.

I was not talking about ideal security but that certain pre-existing mechanisms do not have equivalent security postures, as the parent had mentioned. The point isn't that with enough work isolation can be achieved but that that work has not in fact been done and the various mechanisms are distinct and their security values should not be conflated.

Fuchsia is looking like a neat contender for tackling this problem. I've read through most of their design docs, and if I've understood them correctly, it should allow for fully recursive processes. The sandboxing [0] and namespaces [1] design docs are a good starting point.

As an example of this in action, the test runner [2] creates a new application environment for each test.

[0] https://fuchsia.googlesource.com/docs/+/HEAD/sandboxing.md

[1] https://fuchsia.googlesource.com/docs/+/HEAD/namespaces.md

[2] https://fuchsia.googlesource.com/test_runner/

> Whatever the ultimate "isolation unit" ends up being, it needs to be recursive.

Thank you. I've always thought about this and figured I must be crazy since nobody else seems to care about it.

One of my biggest frustration while running a multi-tenant system running NixOS (Which is great btw, every user can install stuff in their home without sudo) is that HTTP is bound to port 80. And beyond the whole privileged port jazz, there's another trouble here : only one program can listen on myip:80.

Ideally, HTTP would use SRV DNS records. For the uninitiated, those are records that contain both an IP and a port, so instead of having a "default port" of 80 (which is completely arbitrary), you get to define on what port that service is running. Then I could just assign each user of my multi-tenant system a range of ports (With 1000 port each, we could get 65 users. Maybe less due to ephemeral ports, but it's already more than what I need anyway).

There are other solutions, such as machines connected to IPv6 could get millions of IP address essentially for free. But IPv6 coverage is still spotty. And in my case, the system was running on a kimsufi box, which gives exactly one IPv6 address per machine. (And let me rant for a moment and say that this is really stupid, for multiple reasons such as ipv6 blacklist using blocks anyway).

I wonder if it's too late to get SRV records in, say, HTTP2. Or even in HTTP1 for that matter, as an amendment RFC. Because it would trully be awesome.

I'd say the original mistake was making the port number distinct from the address in TCP/IP. In that model the remote party implicitly cares whether two services are on the same physical machine or not. We have ways to work around that, but not without friction.

The advent of DNS created a new opportunity to come up with a unified address, but sadly the SRV record seems to have been invented too late (RFC 2782 is dated February 2000).

Of course there are good reasons for address/port split, not least that ports come from TCP and addresses from IP. But still, it shows how many pragmatic real world decisions have pragmatic real world consequences.

Oh yes. If we go back further and make IP 64-bit and specify that all endpoints should get a /40 to be assigned to up to 2^24 network endpoints per system and then left out the concept of ports entirely, we could have saved probably ten thousand human lifetimes of coding.

Ipv4 addresses are 32 bits and ports are 16, so the above would only have cost 16 bits.

It's hard to have that much foresight, but I would not be surprised if it was suggested and shot down.

A reverse proxy can kind of act as a nice hack around this by multiplexing off of the the Host header (which is unencrypted even with HTTPS due to the way that the SSL handshake is done).

The HTTPS Host header is not unencrypted. You are talking about SNI which is an extension to TLS that adds the hostname as part of the handshake. The Host header is still encrypted when using SNI.

That was my thought.

Having the host OS run what's essentially a virtualhost proxy seems like the most elegant, viable-path forward. You'd need a protocol for the guest to announce what site(s) it's planning on hosting, and then bridge those out the the host.

Keeping multiple guests from claiming the same name would be useful.

Better: shift the whole public layer out to a distributed caching service.

Tools such as IPFS offer another possible answer, where the RaPi boxen could be considered simple origin servers.

> only one program can listen on myip:80.

That isn't true anymore, at least on Linux, thanks to SO_REUSEPORT

Yeah, but that's not the point. If I have my friend who wants to host blog1.com, and I want to host blog2.com on the same machine, we can't do it without some coordination. We'd probably need to have an Nginx server using VHost to redirect to my or his server. We can't use SO_REUSEPORT because if the request goes to the wrong server, it won't know how to react.

With SRV, we wouldn't have this problem. With each of us having IPv6, we wouldn't have it either.

Arguably IPv6 also solves this problem :P

In Linux the thing the author wants is /proc/sys/net/ipv4/ip_unprivileged_port_start which defaults to 1024 but can be set to anything you like. Such as 0.

Edit: I didn't realize how new that was. Kernels 4.11+ only. I think some people were using this on custom patched kernels though because I've been seeing it around. Was committed in January.


No, the things that would have obviated the author's specific example were (a finer-grained version of) ip_unprivileged_port_start and SRV records back in the 90s, when the customs actually developed.

That's the whole point about path dependence. We went down this massively more complex and expensive path because we didn't do a few teeny weeny little things to network interfaces and permissions back in the 90s.

Or just giving relevant users CAP_NET_BIND_SERVICE.

or do it to the binaries:

> setcap 'cap_net_bind_service=+ep' /path/to/executable

huge blog post for what is literally a one line fix

Unprivileged user accounts can't do that, so it changes nothing about the blog post.

Also only works for actual binaries, not scripts. Well, you can write a short exec wrapper I suppose.

Hm. Why doesn't it work for scripts? I thought the capabilities were stored in the filesystem?

An OS that allows shebang scripts to have setuid or capabilities ends up allowing security holes, as seen in traditional Unix variants; see http://www.faqs.org/faqs/unix-faq/faq/part4/section-7.html and https://www.in-ulm.de/~mascheck/various/shebang/#setuid

Therefore, Linux simply doesn't allow it.

Because the script is probably not what is actually opening the port. It is going to execute something else that will open the port.

sudo setcap 'cap_net_bind_service=+ep' /usr/bin/nodejs

then all users benefits from it.

How does that solve the problem? How to 100 users bind their locally-installed web server to port 80? How do you make it so that only my user can bind to only my IP?

You use network namespaces or whole solution like LXC.

Doesn't solve the problem, just shows he doesn't understand the problem.

A lot of good blog posts end up as one line fixes :-)

> Reproductive organs are what evolutionary biologists would call a highly conserved system, meaning they don't tend to change often.

No. Reproductive genes are well-known for what evolutionary biologists (such as myself) call positive selection, meaning they tend to change more often than you would expect.

See, for example, https://www.nature.com/nature/journal/v403/n6767/full/403304...

Well, the author was talking about organs, not genes. Generally speaking the organ arrangement described in the article (external testes, internal ovaries) is pretty well conserved among placental land mammals even if genes related to other aspects of reproduction are not.

The author is wrong in saying that the testes are not protected, though. They're quite well-protected, with a leg on each side. Simple proof of this is that it's clear when these are damaged due to the pain, but most men will only experience this pain a couple of times in their life.

The testes are even more protected in quadrupeds, far away from any frontal fighting or sparring. And if you're a quadruped that is being chased by something that can damage your testes from behind, you have far bigger problems than damage to your testes.

Would anyone mind posting the PDF for those of us who can't afford it but would like to study it?


> Step two: extend user and group permissions and ownership into the realm of network resources, allowing UID/GID read/write/bind (bind in place of execute) permission masks to be set for devices and IPs. By default a bind() call that does not specify an address ( or ::0) will listen for packets or connections on all interfaces that the current user has the appropriate permission to bind and outgoing connections will default to the first interface the user owns.

I think that this can be achieved with network namespaces and some iptables magic, entirely in userspace (and without the whole burden of containers).

Anyway, IMO the point of containers is not network virtualization, but rather isolation of dependencies. It's much easier to just pack everything into a container than to invent proper non-root package manager (like Nix).

The point of directories is to provide proper isolation of dependencies. But that followed the same path.

What leads to... It's better to add non-root mounting into the author's list. I don't think there's even anything that needs to change, permissions are already there.

It's not just ports, resource contention in general is largely the pain of multi-user systems.

If you've ever managed a host with thousands of users, you'd remember. Sure many of these have work-arounds now, but I've seen users (or have personally) run a machine out of:

- /tmp space

- processes

- open file handles

- inodes

- shared memory segments

- ephemeral ports

- etc., etc.

You ever have a root shell on a box that's out of processes. Luckily modern shells have built-ins, so you can at least `ls` and stuff. But when you can't run, say, `lsof`, what do you do? That used to be a classic interview question.

And then there's the traditional resource contention folks think of like saturating links, i/o channels, disk space, cpu, memory, and so on.

But nowadays it's virtual machines all the way down. My favorite is to think of something like Puppet server (I like to pick on it) running in Docker. Count the abstractions in the chain:

- JRuby interpreter


- Process

- Container

- OS

- Virtual Machine

- Hypervisor process

- Hypervisor OS

- even UEFI to an extent

- Hardware

You could say things like the process and OS are redundant, and you could probably add more if you consider some of the subcomponents of some of these. But each one is there to segment and/or simplify and then we add all kinds of holes for each to directly access one of its parental chain.

It is an evolved system where we are doomed to "reinvent Unix badly". But I'll be curious to see where we end up in 10-15 years...

Somewhat related, but I always thought it would make sense for browsers to use DNS based service discovery to find the correct HTTP port, and then fall back to 80/443. That alone would make it easier for users in a multi-tenant environment to self host, IMO, and eliminate some of the need for SNI.

This is what SRV records were (largely) created for. There have been a couple of attempts at making this a reality:

https://tools.ietf.org/html/draft-andrews-http-srv-01 https://tools.ietf.org/html/draft-jennings-http-srv-05

Unfortunately, HTTP is just too widespread of a protocol - you would end up having to listen for legacy clients on 80/443 forever, making it a nonstarter.

The reality is that this is not that infeasible a change as everyone wants us to think that it is. It is simply a change to the DNS resolution logic for HTTP/HTTPS. The major browser vendors could make the change in relatively short order. The major hurdle is that it would increase the number of DNS queries to resolve a website.

XMPP is an example of a protocol that currently has this correct, in that it uses SRV records to control ports for a given host.

Why would it involve extra DNS queries? Why wouldn't the SRV record come back as part of the original response to the host-name resolution request?

There is no such thing as a "host-name resolution request". A DNS query specifies a domain name and a record type, and gets as the result a list of all records of that type under that name. Record types would be A (IPv4 address), AAAA (IPv6 address), MX (mail exchanger name), SRV (server name and port for a particular service), TXT (free-form text), and many others, most of them nowadays unused.

You can specify "ANY" as the type in the query and get all results. Try running "host -a ycombinator.com". It doesn't recursively resolve CNAMEs like an A query does, though. Also, at least Cloudflare refuses those requests, to reduce DNS reflection attacks.

"any" does not mean "all". People regularly make this mistake. One cannot do these sorts of tricks with ANY queries.

* http://jdebp.eu/Softwares/djbwares/qmail-patches.html#any-to...

ANY gives you all records for the given name in the cache of the name server that you are querying, which usually will be incomplete in the case of recursive resolvers.

The "It will increase the number of DNS queries" is another of the many attempts at a counterargument that is not really true if one analyses it, that has been coming up for two decades now.

* http://jdebp.eu./FGA/dns-srv-record-use-by-clients.html

In reality, the number of back-end DNS queries to perform an A or an AAAA lookup can vary enormously, and can number in the tens or hundreds of queries already. It's not generally true that the number of back-end lookups will increase, because the number of back-end lookups varies wildly by time (depending from what is cached from momement to moment) by domain name (depending from the amount of gluelessness) and by country. There is so much variation in there already that the variation caused by a two-step SRV lookup can quite often be lost in the noise.

Of course, SRV lookups do not even have to be two-step in the first place. A proxy or content DNS server is free to add the requisite A and AAAA resource record sets as additional section data in its response, meaning that the client obtains both pieces of information in a single transaction. And of course several real world DNS server softwares do exactly this.

* https://news.ycombinator.com/item?id=8850302

Whenever you see these objections, always remember that there are protocols where SRV resource records have been in use for coming up to twenty years. And yet one never hears of actual problems with those protocols that mirror the supposed projected problems of using SRV lookups for HTTP. A lot of what one hears about why this would not work is simply bad analysis and excuse making.

The worst part perhaps is that the code to do this for one of the WWW browsers was actually written, the only actual technical niggles were solved 17 years ago, and -- perhaps most tellingly of all -- whilst people are still telling us how this cannot be done all these years after the code was written, there are parts of the world that are quietly and happily doing it right now. The FreeBSD packaging system uses SRV resource records and HTTP/HTTPS, for example. It's also an example of how real world DNS servers can and do send all of the data in one transaction using the additional section. They've even made it in-bailiwick.

    JdeBP % dnsq srv _http._tcp.pkg.freebsd.org. ns1.isc-sns.net.
    33 _http._tcp.pkg.freebsd.org:
    510 bytes, 1+5+3+8 records, response, authoritative, noerror
    query: 33 _http._tcp.pkg.freebsd.org
    answer: _http._tcp.pkg.freebsd.org 300 SRV 50 10 80 pkg0.isc.freebsd.org
    answer: _http._tcp.pkg.freebsd.org 300 SRV 50 10 80 pkg0.nyi.freebsd.org
    answer: _http._tcp.pkg.freebsd.org 300 SRV 10 10 80 pkgmir.geo.freebsd.org
    answer: _http._tcp.pkg.freebsd.org 300 SRV 50 10 80 pkg0.bme.freebsd.org
    answer: _http._tcp.pkg.freebsd.org 300 SRV 50 10 80 pkg0.ydx.freebsd.org
    authority: freebsd.org 3600 NS ns1.isc-sns.net
    authority: freebsd.org 3600 NS ns3.isc-sns.info
    authority: freebsd.org 3600 NS ns2.isc-sns.com
    additional: pkg0.bme.freebsd.org 3600 A
    additional: pkg0.bme.freebsd.org 3600 AAAA 2001:41c8:112:8300:0:0:50:1
    additional: pkg0.isc.freebsd.org 3600 A
    additional: pkg0.isc.freebsd.org 3600 AAAA 2001:4f8:1:11:0:0:50:1
    additional: pkg0.nyi.freebsd.org 3600 A
    additional: pkg0.nyi.freebsd.org 3600 AAAA 2610:1c1:1:606c:0:0:50:1
    additional: pkg0.ydx.freebsd.org 3600 A
    additional: pkg0.ydx.freebsd.org 3600 AAAA 2a02:6b8:b010:1001:0:0:50:1
    JdeBP %

Well, I don't think that anyone claims that it doesn't work, and I don't think some packaging system is a good comparison. I mean, email effectively has been using SRV records (even if they are called MX records in that case ...) for three decades, so obviously it works. But it commonly requires an additional round trip for (initial) resolution, which adds delay, which is kindof relevant for interactive use, not so much for package managers or SMTP.

Also, you as the site operator can't really do anything about how recursive resolvers handle additional section data (and there are millions of them deployed on crappy home routers with probably equally crappy DNS software), in contrast to out-of-zone glue, which is something completely under your control.

That being said, I don't think it makes sense to not standardize it for mandatory client-side use with HTTP, (a) because network latencies are not exactly increasing, even mobile networks are moving towards single-digit millisecond RTTs, and (b) browsers could just be backwards compatible, and issue SRV, A, and AAAA queries for the URI domain name at the same time, so you would only need one round trip to figure out the addresses of sites that don't want to use SRV, and you'd only get a negligible amount of traffic for one additional DNS request without any additional latency.

> But it commonly requires an additional round trip for (initial) resolution

No, it does not. I just explained that to you, with examples even.

There are actually quite a few protocols whose clients use SRV resource records.

* http://jdebp.eu./FGA/dns-srv-record-use-by-clients.html

As a workaround, couldn't you just have, say, nginx listen on 80/443 and either redirect or reverse proxy to the correct endpoints based on the SRV records?

That way, compliant browsers can connect directly while SRV-ignoring browsers would just get proxied (or redirected), and once set up there would be no special actions needed by the users, save for telling the proxy about the keys and certs.

The sooner we get started, the sooner we'll be able to actually take advantage of it.

It's true that you have to wait a long time, but that's no reason to avoid it. SNI wasn't widely supported initially, for example, but these days you can safely require it unless you have weird requirements.

There probably isn't enough need for it to get the major vendors to start adopting it, though.

Browsers won't lead the way. Fetching SRV records will increase network latency, and that's one of the main points of competition among browsers today.

Is it not possible to fetch them as part of the same request that looks up the IP address?

My DNS knowledge is a bit rusty. I remember there are some issues with requiring different types of record on the same request (most servers block it), and that timeout worked differently for different types of record.

I did never try resolving HTTP services anyway, so I may be completely wrong. I've had problems with TXT and MX.

Interesting. It would make sense that requests that don't get used in the real world might not work well.

In addition to DNS SRV records, there's also DNS URI records https://tools.ietf.org/html/rfc7553

I'm the original author.

I wrote that a while back. My opinion hasn't changed too much, though I have to say that article could stand a rewrite. Don't really have time right now.

The real crux of the article is less about privileged ports and Unix permissions than about path dependence and how it leads to complexity explosions in systems. Instead of building a fix for X, maybe we should first question whether X really has to be that way and if there exists some simpler path to achieving our goals that involve some amount of change but far less complexity.

Sometimes you can't do that, but sometimes you can. I think privileged ports are a case where we could have easily eliminated a lot of complexity and headaches by just eliminating an obsolete feature.

There were, and to some extent still are, operating systems without the notion of privileged ports (because they lacked the notion of privileged users) that had BSD-style sockets, clients, and servers. It's worth looking at how things worked and evolved on such operating systems, to see whether it really was easier in practice.

Please search this HN comment thread for "jail" and my own comment - I am curious about your thoughts ...

Jails are closer. They are perhaps a way of achieving some of this. The problem (unless I'm wrong) is that you need root to create one, so you are back to needing root for everything.

Yes, of course the base systems root is necessary to create the jail, but then the jail has its own root user as well as its own /etc/passwd (and /etc/everythingelse).

For many, many purposes (almost all ?) a FreeBSD jail is indistinguishable to the root user from a bare metal server.

> How did we end up with the nested complexity explosion of OS->VM->containers->... instead of the simplicity of multi-user operating systems?

> [...], it pushes about 5-10 megabits of traffic most of the time but the software it runs only needs about 10mb of RAM and less than a megabyte (!!!) of disk space. It also utilizes less than 1% of available CPU.

> The virtual machine it runs on, however, occupies about 8gb of disk and at least 768mb of RAM. All that space is to store an entire CentOS Linux base installation, [...]

The exo-kernel and uni-kernel people have the opposite idea: go all the way to virtualization, and eliminate the traditional operating system.

You can still have all the conveniences of the programming models you like, but eg file-systems and network protocols will be implemented in user level libraries.

I can easily see a world where history repeats itself here. VMs get thinner, hypervisors get fatter to make up the difference, all the other layers of abstraction get squished out of the system, and what we're left with is just another process model. It's a model built on different assumptions than the one we're used to, with stronger guarantees about isolation than we might be comfortable with at first, but one that has an appeal even for desktop computing.

Later, once everyone has a hypervisor in their pocket, we tell our children about how cloud infrastructures used to be so expensive that only big corporations could maintain them, and they came in server racks the size of refrigerators.

No, the idea is not to make hypervisors fatter. Just the opposite: put the abstractions into userspace.

Exokernels are, as I understand them, even lighter weight than VMs. Instead of user-level libraries implementing file systems and network stacks for emulated hardware, or dedicated hardware on a machine with multiple instances of those devices, the kernel's protection mechanisms are redesigned to work at a lower level: disk blocks as managed by inodes, raw packets as identified by filters, time slices as scheduled by user space.

Combined with a sandboxed-by-default capability security model, I think that's the best approach. You no longer need virtualization or users or containers- you just assign hardware and software resources to applications, following the principle of least power. You not only still have the same convenient programming models, but the same (or arguably better) monitoring and management tools.

You can think of it as running your favorite programming language bare metal.

Instead of syscalls the libraries do the actual work of talking to the hardware.

Basically what some of us were doing when programming 8 and 16 bit computers 30 - 40 years ago.

Yes, and the only thing the kernel (or hypervisor) would do is provide for security and isolation, but not provide any abstraction whatsoever. Not even cross-platform hardware compatibility: all of that is for the user-level libraries.

This is one of the reasons why the kubernetes network model[0] is kinda neat. Every service running in each pod can bind to port 80 without any conflicts because every pod gets a dedicated ip address.

[0] https://kubernetes.io/docs/concepts/cluster-administration/n...

Right, I read this article as "IPv4 is causing climate change." If you have enough IP addresses, dedicated port conventions don't bother me at all.

Yeah, but even if each user has their own dedicated IP, they can't bind to a privileged port unless they're root. So it's still a problem, no?

No, not at all. This problem was solved decades ago via a myriad of different workarounds.

You can simply turn off privileged ports if you feel like it. Or do things like setuid such as how Apache starts as root but switches to userspace immediately.

It's not ideal, but to say the privileged port thing is an issue is bizarre to me. It's the least interesting item the author brings up in the article, and certainly was not the primary driver behind multi-tenancy virtualization.

Better yet is using reverse proxies - service providers have been doing this for ages. A single shared HA cluster of haproxy, that maintains records for each individual application/tenant that lives wherever it likes. This is the model I prefer, since it allows me (as a sysadmin) to protect the developers from themselves by being able to easily filter at layer 7 in front of their application as needed. Also lets you direct traffic easily, and is generally the model used by any container orchestration service.

Very easy to expose all that via various APIs and tooling so users can self-service.

It was a fun read, and there is a clever thought there which I'll get to in a minute. But first ...

The author gets a lot wrong, it isn't uncommon because most software people see a "server" as a set of numbers "A" bytes of memory, "B" bytes of disk space, and "C" hertz of clock speed.

Computers aren't numbers of course, they are systems, and the systems are a collection of moving parts that are connected by linkages which constrain and amplify the various parts. In system design there is a concept of 'balance' which discusses how the linkages enable and constrain such that an understanding of the amount of "work" that can be done by the system is understood.

There is a great example in the article comparing a Raspberry Pi to a Sparcstation 10. Most of the software the author used "back in the day" is still around in one form or another, and a RasPi is cheap, so its easy to create modest recreation of the environment from that time and to build simulated users who can access the "server". Just two or three users and the Pi will "choke". Understanding how a system with so many "bigger" numbers is less capable of doing the same "work" as one with much smaller numbers is a useful exercise to run if you are into systems analysis. The system the Pi was designed to run, and the one for which it is fairly well balanced is a smartphone. Something the SPARC 10 would truly suck at.

The clever idea however is making networking resources just another resource like disks. Interestingly enough, with IPV6 that is much easier than it was before (to the point about how things were done before in networking, vs now). It is straight forward to create a model of network interfaces that is equivalent to serial lines. Have the kernel get set an IPV6 subnet with a block of 32K addresses and let your user program open /dev/net/<1> through <32766> because its IP each "net" device comes with its own set of port numbers etc. IPC is just networking and a cleverly coded kernel running on a machine with a 64 bit virtual address space would do direct data placement. No need to do even any page table mashing.

> Just two or three users and the Pi will "choke".

Do you really know this? What's the limiting factor?

I've never played with a Pi and I don't really know, but I don't see offhand why it would be so bad. Interrupt-handling and context-switching times should be much better. There's plenty of main memory. The file system should be faster, running on flash. I don't see where the bottleneck would be.

   > Do you really know this? What's the limiting factor?
Yes, I've got a bunch of RasPi's (originals, v2's, and V3s).

The big issue is I/O bandwidth. All of the meaningful I/O in a server setup goes through one USB 2.1 hub. The second issue is that all of the memory accesses have to traverse the (useless) GPU to get to the CPU.

As designed for a phone, having the CPU be a ride along to the GPU makes perfect sense. The "thing" that is important in a smartphone is the screen, its updates and I/O to and from it. But as a general purpose server that isn't balanced.

We put RasPi servers on the Internet and local network at Blekko for a variety of tasks, and I've used them at home as well for servers. Very successful having them be single purpose servers (a DNS cache, a video streamer, a data collector) and very unsuccessful when they are multitasking multiple 'kinds' of network clients. Tried various configs with EXIM and Dovecot as a mail server, it processes very slowly. (an SSH session on the machine was 10s of seconds between the key press and the appearance of the character on my end, took me back to the worst of times on a timeshared system :-).

But seriously, everyone should try this. Set up a minecraft server, set up a web server, set up what ever you want and watch it. It is easy to set up multiple "stressor" clients with Python. Trace the flow of data from client to network to app to disk to app to network to client etc.

Then do the same thing on a machine with the same numbers and twice the I/O bandwidth (Odroid has a couple of boards with similar specs but better I/O bandwidth as do a bunch of boards in the 96boards.org spec)

It will be time well spent if you're ever asked to get the most out of a system.

The original Raspberry Pi is quite slow.

But, the latest models are quite fast. I've run a variety of microbenchmarks and a RPi3B has twice the memory bandwidth of a Sun V240 (which is many times the machine that a SS10 was.) It has roughly the same single threaded CPU/FPU performance and twice as many cores a the V240. Storage wise the microsd is faster than the SCSI drives you'd find in an SS10. The SS10 ethernet was 10 Mbps. You could add an SBUS adapter to get up to 100 Mbps, but that's it.

Realistically, you would be able to support 10x whatever an SS10 would do. But if you hang your storage off that USB port, then you're likely to be pretty unhappy. But still would be more of a machine than an SS10.

I would love to run the same benchmarks here, have you got a github repo or a write up somewhere that I could read?

"Trace the flow of data from client to network to app to disk to app to network to client etc."

How would one go about doing that?

In the ideal world? With a copy of the full schematic, a datasheet for the SoC, and source code to the kernel and drivers. Its harder with RasPi because various bits are obscured but many of the drivers are open enough to figure much of it out.

> All of the meaningful I/O in a server setup goes through one USB 2.1 hub.

Even Ethernet and 802.11 wifi packets?

RPI's Ethernet adapter is internally connected through USB (you may also see another hub connected to the motherboard hub here).

  pi@raspberrypi:~ $ lsusb
  Bus 001 Device 005: ID 05e3:0608 Genesys Logic, Inc. USB-2.0 4-Port HUB
  Bus 001 Device 004: ID 05e3:0608 Genesys Logic, Inc. USB-2.0 4-Port HUB
  Bus 001 Device 003: ID 0424:ec00 Standard Microsystems Corp. SMSC9512/9514 Fast Ethernet Adapter
  Bus 001 Device 002: ID 0424:9512 Standard Microsystems Corp. LAN9500 Ethernet 10/100 Adapter / SMSC9512/9514 Hub
  Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

I see. Does anyone know what the rationale for that design decision was? I'm comparing to, for example, the Dragino, which is in the same general price range but does not have this, um, "feature".

Possibly because there was nowhere else to hang the port off with that Broadcom SoC. See also the myriad SoCs and devices that use a USB->SATA converter where I really just want SATA (e.g. a little known, obscure device called the Playstation 3).

> Possibly because there was nowhere else to hang the port off with that Broadcom SoC.

Ah. Dragino uses an Atheros SoC. Not the first time I've seen a Broadcom chip have an...interesting feature.

Yes, RPi uses USB-Eth adapter on board.

Probably only limiting factor in the hardware is that R-Pi does not have any meaningful high-speed low-latency IO.

But on the software side there is one huge problem: today's Linux kernel and userspace is incredibly bloated in comparison to early 90's SunOS. On the other hand, said SunOS probably did not support many things that everybody expects today, like scalability to large-ish SMP systems (ie. >2 CPUs), shared libraries, loadable kernel modules, threads...

It would be interesting if you could port NetBSD 2.x to the Pi. That would be a close equivalent.

Interestingly, the SparcStation10 would run Solaris 9 and premier support for the SS10 only ended in June 2016, even though the product EOLed in 1994 and had end of support in 1999.

SunOS definitely had shared libraries -- I think those go back at least to 4.3BSD.

Anyway, how much I/O do you need to support people running terminal apps over Ethernet?

What I meant by the IO is that the aforementioned sparcstation had ethernet card connected to SBus which was capable of directly producing interrupts and doing DMA, and all the modems were connected through some multiport serial card/concentrator which tried very hard to offload stuff from the CPU. Which is completely different situation than R-Pi with everything hanging off USB with it's "novel" approach to interrupts.

SunOS/Solaris has been able to run on hosts with lots of CPUs for much longer than Linux.

But then, it had to, because SPARC CPUs have been no match to X86s for almost as long.

It was able to, but it was not especially efficient at doing so, to some extent because the hardware platforms themselves were not scalable and only appeared to be so because of brute force (even e10k is essentially shared bus SMP machine, sun4d-style).

In fact, the NUMA support in non specialized systems (one would want to say "non-IRIX") is surprisingly recent development. In Linux it is one of the things that happened in the 2.4/2.6 timeframe, which is the same timeframe when mainstream distributions started to be too bloated for lot of old or underpowered hardware.

Besides what others commented.

It definitely supported M:1 and M:N threading models while Linux was still designing NPTL and LinuxThreads models.

I'm not saying that SunOS was somehow slow to implement these things, but that the SunOS that ran on the aforementioned SS10 shell server. Which seems about right given the fact that first Solaris with userlevel LWPs was 5.2 (dec 93) and pthread-compatibility layer was introduced in 5.5 (nov 95). And it seems quite plausible that they ran SunOS 4.x and not Solaris.

Yeah that bit doesn't make any sense. If anything the pi would support more users with similar workloads to those being run on the sparcstation in the same fashion (without containers).

I just use iptables to forward all traffic on port 80 to port 8080. You can just replace your /etc/rc.local with this:

iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-ports 8080

iptables -t nat -I OUTPUT -p tcp -d --dport 80 -j REDIRECT --to-ports 8080

exit 0

Simple enough to add to a configuration management workflow. I used to maintain a configurable reverse proxy, until I decided that was way too much work and to only put one web facing app per server.

edit: rc.local may not be the best place to put these rules, check this link:


Unprivileged users can't use iptables, so that doesn't change anything about the blog post. I'd suggest that if it's so easy, but still doesn't get done, that still also leaves the blog post intact. In a way it doesn't help when there's a thousand ways to do something, so none of them get done.

This seems to be a common misunderstanding of the blog post. Yes, there's a ton of ways for root to create privileged ports, and a ton of ways to delegate it in various ways. But all of them require actions by root, and sysadmining, and aren't as secure as a system designed to work this way from the beginning would be, so nobody uses them, so in terms of addressing his discussion points, they might as well not exist.

But surely, creating the user requires sysadmin privs, and as part of setting up the user the appropriate work could be done to allow whatever hole-punching is required. The reality though is that if Mary, Joe and Tim all want to listen in port 80, there's going to be a mess - hence using a common server (on a privileged port!) to demux incoming requests.

I don't understand the use case here. Are people really expecting to be able to configure stock Linux distribution images purely from unprivileged user accounts?

Because this iptables hack can be built right into a custom Linux distribution.

That's not enough, because this would allow only one unprivileged user to bind to the HTTP port (in this case - 8080). Another user trying to bind to the same port would get an error. So for this scenario the author is talking about a custom solution with a reverse proxy would probably work best - getting all requests directed at 80 and distributing them to as many servers as necessary.

My solution was to only put one HTTP server on a machine. I agree that a reverse proxy is the solution for those really wanting HTTP multi-tenancy on individual machines. But is this a really in-demand use case?

Even if port 80 was unprivileged, wouldn't you still run into issues if multiple users attempted to bind to the same port?

Really, what would the problem be with just deploying an nginx instance that listens on 80 and routes by domain?

Well, someone has to administrate the nginx routing config, or there needs to be a mechanism for each webapp sitting behind the nginx to register itself as a forwarding target.

What I want to know is why can't we get all the major web browsers to look for (and honor) SRV records, so that www.mydomain.com can transparently have the browser connect to port 30338 (or whatever port my app is listening on). Then we dispense with the need for proxies altogether.

All major protocols should, IMO.

No problem at all, given that the underlying operating system would already require that somebody got privileged access.

The point of the article as I understand it, isn't so much about the possibility of shared web hosting - it's obviously possible. It's more about how early design decisions shape our paradigms and the way we understand computing and the necessary resources.

Yeah, you'd have to do that, and then have permissions around routing, like we do with the filesystem today.

I really like this idea, it's really elegant.

Isn't this exactly what shared hosting is? I bet companies like dreamhost regularly put 50 clients on a box. And Webfaction[0] will even give you pretty open shell access to a shared environment.

[0] https://www.webfaction.com/

>> I bet companies like dreamhost regularly put 50 clients on a box.

Hah, hundreds, if not thousands of accounts can be on a single shared server. 'Unlimited hosting for $5/month' requires you to pack em in.

Indeed this is exactly what webfaction does!

They get around the port issue by using a nginx frontend that uses the Host to route each request to the correct apache instance each user is running out of their home directory. All these apache instances are bound to higher unprivileged ports.

Imagine a world where you don't need a router! A world where the browser would look up a http/s service record for the host and connect directly to the unprivileged port.

There are plenty of trade-offs but network resource ownership?

For a datacenter isn't it more efficient to pack workloads across dense machines than to over-provision idle ones?

While a factor, maybe, I don't think privileged ports are causing excess CO2 production.

Although I've been interested in low-power edge networking with uni-kernels, would be an interesting project to work on (thinking micro-dc's with homogenous ARM-based boards at the edge and having a dynamic system to boot up uni-kernels on demand closest to the requesting user... a system that mostly stays off unless it's needed).

The biggest change which would have made this much easier in my mind is adding port numbers to DNS. Instead of having HTTP at port 80, have the port number for HTTP returned as part of the DNS query. You can allocate a block of ports to each tenant on a host and let them use them as they see fit, you don't need to use the low ports for anything.


SRV solves this problem nicely, unfortunately SRV support for HTTP 2.0 was rejected.

I'm pretty sure most or all of this port/sudo nonsense was fixed with Plan 9.

Well, ports are just files and have a owner, group and permissions. As simple as that.

> alter package managers to allow installation of packages into a subtree of the user's home directory if the user is not root, etc.

Guix and Nix. I haven't used Nix, but Guix even allows ad-hoc containers running only specific programs and their dependencies:


That doesn't help with the privileged port problem, but per-user services can be dealt with.

This work was already done in other "Unix" environments like Solaris and BSD (zones and jails)

My personal favorite is SmartOS (based on illumos) it runs zones on bare metal and even has support for running Linux containers (docker)! It does this by wrapping the Linux syscall APIs and translating them. Magic stuff!

Launch a Linux native container and run 'ps' and you will only see processes that you own!

reference: https://wiki.smartos.org/display/DOC/Home

I came here to say just that ...

A FreeBSD jail does not emulate or create a virtual machine - it's just a fancy chroot mechanism that produces only the unix processes that actually get run inside the jail.

A jailed httpd does not take up any more resources than the exact same httpd run on the base system.

This makes it extremely efficient and, in fact, allows you to create an even richer multi-user platform than the original one the op has nostalgia for: a multi user unix system where everyone gets to be root.

I've long thought that a combination of NixOS and illumos would be incredibly useful. Software would be installed into the nix store in the global zone, and mapped into dedicated zones for each user. IP and port virtualization via Crossbow.

User zones would be extremely light, with your typical database-driven web app taking no space in the zone, and very little space in the nix store.

The "privileged port" model was basically dead at the time of the Morris worm 30 years ago. It should have been sorted out in that time but there was never the momentum.

Everything in UNIX is a file, apart from the things that aren't. If ports were files, you'd be able to chown them or put them in a group for delegation purposes.

    Container solutions like Docker get us part of the way there. In a
    sense you could argue that containerization is precisely the
    multi-tenancy solution I'm heading toward, except that it borrows
    heavily from the legacy path of virtualization by treating system
    images like giant statically linked binaries.
Nailed it. This is why Docker does not excite me at all, and why I think there's room for other container systems to improve up on the Docker model by solving this problem. I made my initial attempt awhile ago by adding basic container support to a package manager that allows users to use a virtualenv-like tool to create containers to hack in:


Docker is like an extended satire of Unix multi-tenancy in code form.

"Damn it... another f'ing package broke. Fuck it. We should just tar up whole Linux distributions and push them out as updates."

"Hey... that's a great idea!"


"Tar up whole distributions and..."

"No! No! I was only kidding! Wait! where are you going...?"

"Container solutions of the kind that are like docker" vs "Container solutions in general of which docker is an example". If he is using docker to write-off all container solutions, then that's a mistake, since it sounds like using the underlying container solution (on which docker then adds the ability to manage system images) is exactly what he wants.

Yeah, perhaps namespaces will satisfy them, but from what I've seen network namespaces leave something to be desired. I found that it wasn't too hard to roll my own container implementation and hook it up to a tool that was already providing software environments to unprivileged users, so I think there's plenty of room for other people to take the primitives that Linux provides and run with them.

His proposed solution actually looks a lot like one early flavor of LXC implementations, where instead of unzipping a big tarball with your own environment, you'd just bind-mount system folders into the container. I suspect LXC devs are reading this post and thinking "we already have this!"

The issue with apt install has more to do with a poor packaging mechanism on Debian's part rather than networking. Windows installers for once are actually superior in this regard since they generally provide an option to install for either the system or only the current user.

Also, to provide multi-tenant hosting, you could just create some sort of root level reverse proxy (possibly Nginx or Apache) with some sort of registration system for domains. You just use the Host header to multiplex the system's port 80/443. For localhost stuff, I think you could set up something with DNS to be http://{username}.localhost. It would be kind of an interesting idea to do as a happy medium between serverless and separate VMs.

Apple had the right idea with .app bundles. There should be no such thing as "installation." Android and iOS kind of get it right too.

Problem with Apple's .apps is that they still manage to mess up your ~/Library. You can often find gigabytes of data left in there on your average Mac.

Easy solution: multiple IP addresses (one per user), and setuid wrapper to open listening socket on port 80 of the user's IP address and pass it to the web server after dropping privileges. No containers, VMs, or redesign of Unix needed.

Or put them all on different ports and front it with a proxy. Run only one MySQL instance.

Of course, containers and VMs aren't meant to solve this problem anyway. Containers are about deployment, and VMs are about virtualization/migration. Your customers would do well to use the former (for ease of deployment), and you the latter (for ease of maintenance).

Privileged ports really have nothing to do with this problem.

Great articular... But, forgetting the (apparently GBs) of size, isn't this really what docker and such like are trying to achieve? (maybe I'm missing something). Especially when it comes to only storing a single instance of the base image if all 50+ users are sharing the same image? So it will only store the differences in each container. It sorts out your network issues, user can be root in the container and bind to whatever port they like and IP addresses are assigned to each container... What other overheads are there? (this is a rhetorical question - you seem much more knowledgeable than myself) :)

Just a quick apology - that was meant to read "this isn't a rhetorical question".. I was truly asking :D

Why not just run a single web server and set it up to serve public_html from people's home folders?

The difference to that multiuser system is that we want people to manage their own infrastructure. That's the real reason for the bloat - no matter how much that VM server costs, I bet you it costs less than paying people to administer a shared infrastructure. It has nothing to do with ports.

>Why not just run a single web server and set it up to serve public_html from people's home folders?

Doesn't work with dynamic content beyond CGI, which is too slow and usually not supported by modern web frameworks.

Could you elaborate on why it doesn't work with dynamic content beyond CGI? Is this some kind of insurmountable limit, or is it just the web frameworks don't care to support it?

It's not an insurmountable limit; it's not the web framework's fault either. It's that there's not a way for a individual user to indicate to a shared system-wide webserver, "Here is the FastCGI (or whatever) socket that you should connect to, to generate dynamic content for my home directory." But certainly a way to express that could be created.

Haven't home folders worked fine with PHP for ages and ages? They even used to have safemode to work around permissions issues.

PHP is either CGI or a specially-supported special-case in the shared webserver.

Service location in the TCP stack is a bad idea. Why port 80 _every_ time for http? Put that into DNS SRV! Boom! IPv4 crisis solved (for now)

I'm happy to know I'm not the only one who wants the death of privileged ports. When I proposed this for NixOS I wasn't exactly well received: https://github.com/NixOS/nixpkgs/issues/11908#issuecomment-2...

"systemd will handle that"

No no no handling with complex logic what can be achieved instead with simple design is how Windows got its suckage. Sigh.

I was a little bit annoyed (enough to comment) by the use of the wrong units. 'mb' is millibit, perhaps millibyte, but certainly not megabyte (MB or MiB).


> it pushes about 5-10 megabits of traffic most of the time

Bits are a unit of information, not flow. Probably the author meant Mib/s.

I agree about the milli prefix (I hate its misuse guts, too), but netadmins write the unit designation as either "Mb/s" or "Mbit", so among them it's kind of jargon to talk about "megabits" when one means "megabits per second". Jargon has this weird property of often being incorrect terminology.

But while I would accept that from a network engineer, I don't know what background the author has.

We as programmers have gotten tired of saying "per second", I guess.

Well, yeh... I would normally like this kind of correction, but I don't get annoyed by them - otherwise I'd be annoyed after everything I read :)

This annoyed me too. The author does not seem to be aware that case is highly important when dealing with SI/binary prefixes as well as unit abbreviations. If you don't even know (or care) that "b" is a bit and "B" is a byte, and you use them incorrectly, how am I supposed to trust your technical knowledge about the rest of the stuff you're talking about?

The author is the creator of ZeroTier. He is likely aware and didn't care because it's extremely pedantic and most people know what he means in informal settings, which this is. I expect he'd use the correct case in an RFC or something.

Getting abbreviations for bits/bytes correct is not "extremely pedantic", it's about communicating correctly. In a network context, which this is, "b" means bits, but he was using it to refer to bytes. Also I've never heard of ZeroTier, but even if I had, I probably would not have made the connection that he was the author of it, so being correct about these things is important for establishing credibility with new audiences.

IIRC, most if not all uses of 'mb' or 'gb' were about disk space or RAM, so not really a network context - this is about bloat, ports are just the whipping boy.

And I'm not against pedantry in the right context, but this is just a casual, relatively nontechnical rant. Pedantry is really not needed.

The original post seems a bit unix-centric. Much of what he is complaining about comes from the nature of TCP/IP networking, Unix can only be blamed to the extent that it doesn't abstract things like ports away.

The general thrust of his post is to fix Unix so that its (ohh so well respected!) permission model can extend across into the cloud. This will run into friction when more than one OS is involved.

That said: anything that unifies all the different kinds of virtualization as the OP wants would need to be an OS. And the only plausible candidate for an OS lingua-franca in this world is Unix.

So long live unixcentricsim?

Since we're here anyway: if we would just change web browsers to connect to a default port number obtained via DNS instead of always using 80/443 by default, then we could offer web servers through NAT gateways (by hosting up to 65535 web servers on a single IPv4 address). We'd need some way to tell the NAT we want to expose a port (some variant of uPnP or some less-dumb protocol), but other than that, it would be easy. And then we'd have effectively 65535x as many IPv4 addresses, which would be enough for everyone, permanently, and IPv6 wouldn't be needed.

What? This isn't necessary at all. HTTP includes the requested hostname in the request so you can have as many DNS names as you want point to a single IP. No NAT or other magic necessary. The entire HTTP Internet could be hosted on a single IP.

True but if the sites were served by different processes then there would have to be a trusted reverse proxy process to dispatch.

You just described how reverse proxies work, just by using ports instead of the host-header to choose the correct backend.

I'd point out that currently, access to port 80 for a domain is access to a certificate for that domain.

This means that, stupid or not, HTTPS essentially is as secure as privileged ports are.

Slightly more secure, since you can have multiple A records on a domain and your domain verification should really check them all.

It's more fair to say HTTPS is basically as secure as DNS. If someone hijacks your DNS for 5 minutes they can have a TLS cert from LetsEncrypt for your domain for 90 days

> (USB is slowly replacing the lighter plug, but most cars still have them.)

And yet, that's just trading for different path dependence annoyances. 5V, 500mA limit (or proprietary higher-voltage hackland), and that feeling of connecting your valuable phone to a low-bid switcher? Airports, cars, home outlets - I'll choose the appropriate third party device every time! I just wish that car power socket wasn't so dildonic so that more outlets would come already built in.

Ultimately the issue is that IPv4 addresses are expensive. Allowing multiple users bind to port 80 on different IP addresses, assume you can have multiple IP addresses on that IP. Unless they are private IPs, that's going to be the main issue.

Any practical solution would need a system wide nginx to proxy to all of the tenants, who would run their apache/nginx/node etc. on a high numbered port.

> How did we end up with the nested complexity explosion of OS->VM->containers

If someone is putting containers into VMs they've eliminated the performance benefits of containers and added an additional layer of complexity for no reason: ie they don't know what they're doing and wanted to run Containers on their existing VM only cloud platform.

Well, FreeBSD's mac_portacl(4) has been around since 5.1-R (June 2003) and allows per-user ACLs on privileged ports. Although the permission is for all IP addresses, not a specific one. But one could create a virtual network device per user and assign it a mac_mls policy to restrict that interface to that user... hmmmmm...

We also currently have technology, and could have allowed HTTP/2.0 to allow for this, if we just used SRV records.

If we had SRV records, you could whatever damn port you want, and it would be invisible to users.

It would also allow us to not have to get load balancers for most (definitely not all) setups - but that is different rant.

This wouldn't change anything because you cannot run multiple service instances on same port so users still would have to specify port in URL. SRV record in DNS could help if anything supported it. But it is not needed now because with IPv6 we have enough addresses to assign each process unique IP.

Good enough + soon enough beats "best possible", or even "better later".

In this case Ethernet + TCP/IP beat OSI.

Also, JavaScript beat any number of sensible scripting languages. Houses are frequently constructed such that they are not very serviceable.

I'm not sure your example matches your mantra. OSI was usable before ISPs became a thing, you ran OSI over X.25.

X.25 isn't the whole networking stack. Elements of OSI were available, and are still available, but most people use something else because it's easier: good enough, soon enough.

I was developing for OSI at a time when it was not possible to do equivalent things on a WAN in Europe using TCP/IP.

TCP/IP didn't win by being first.

How did TCP/IP win?

I'm not sure that the OSI people even realized that they were in competition with anything else. I think they basically neglected the low end of the market.

At the time there were multiple LAN protocols that were mostly used for file sharing, Netware, Appletalk, NetBEUI, etc. You had NFS for TCP/IP but it wasn't really used for anything other than UNIX workstations.

Probably it would have required somebody to write a cut-down OSI stack for MS-DOS that could be linked to a particular killer application (whatever that was, maybe FTAM).

This would still have had to compete with the way that TCP/IP was able to swallow up the other LAN protocols, we wouldn't have had to go through the IPv4 to IPv6 migration though.

Umm, the real problem is that I can't discover what port you are on.

So, in the article's instance, the machine owner needs to set up a web server on default port 80 that redirects to the correct web server on the correct non-privileged port.

Sure you can, this is the entire point of service discovery. We've built so much crap to deal with the fact that web browsers don't resolve SRV records (and at least for HTTP/1.1 prohibited by RFC from doing so).

The article does not mention backups or standardized images at all. These are huge reasons why virtualized machines are attractive, and which multi-tenant OSes don't provide.

In the wordpress example you could use named pipes/sockets, but it's rather complex to manage. A better idea is to hand out ipv6 addresses and let users bind to them.

Why was the original title ("Privileged Ports Cause Climate Change") changed ?

I guess because it's quite silly, grossly imprecise and click-bait-y.

Because the moderators of HN can't make up their minds on the rules

"You must post with the original title, or we'll change it"

"The original title was bad, so we changed it"

At this point, why even give the option to give a title at all?

On the contrary! the rule is simple and consistent and has been the same for many years. You'll find it at https://news.ycombinator.com/newsguidelines.html.

If DNS returned IP:port for hostname, that would sort it too.

www.foo.com --> sharedserver:8000, www.bar.com --> sharedserver:8001 and so on.

It can, with SRV records, which is what things like ActiveDirectory and XMPP use to discover the correct server and port for a given name.

Its use for HTTP (and most other L7 protocols in general) just never really caught on.

I believe Consul can do this and it's a pretty nice capability but the effort of retrofitting it onto things that just assume well known ports would be phenomenal.

I mean even if you know you want http the "right" thing to do is look up the port number in /etc/services but who ever does that? We all just hardcode 80. So that's another potential technique that we can't use...

This article makes no sense. Virtualization's first use was to run a guest OS on a different host OS. Containerization's first use was to achieve reproducible configuration at a lower cost than virtualization. Neither were trying to solve a multi-tenant hosting problem where everybody is running the same system -- were it that simple!

The first use of virtualization that I can remember was VM/370. It did the things you said plus hosted systems for many users connecting via terminals. Supporting legacy systems and getting higher utilization with multiple workloads were also advertised benefits of VMware on x86. IBM and NUMA vendors touted isolation of multiple workloads or users with things like LPARS. The cloud providers now claim a lot of this stuff.

So, it seems like it was intended to do these things going back to the 70's. They also designed insecure, bloated solutions that INFOSEC founders like Paul Karger called them out for. Resulted in better designs like KVM/370, KeyKOS, VAX VMM, separation kernels, and recently mCertiKOS. Most peddling virtualization for multiple users still use bloated, untrustworthy components though.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact