Hacker News new | past | comments | ask | show | jobs | submit login
Iocage – A FreeBSD jail manager (readthedocs.io)
100 points by HugoDaniel on Nov 13, 2016 | hide | past | favorite | 45 comments

I'm deeply involved in the Linux container ecosystem (docker, rkt) and I'd like to understand the difference in workflow between that and this. Is anyone familiar enough to speak to both?

Jails are a general mechanism to run multiple userlands on one kernel. They started with a change root directory, an IPv4 address and no access to anything dangerous (e.g. /dev/mem). Jails grew a lot features with the biggest bump in FreeBSD 8. Starting with FreeBSD 8 jails can be nested, have multiple IP addresses etc. but they still lacked System V IPC. This changed with FreeBSD 11.

Now the only thing jails can't have is their own IP stack. Jails share the hosts IP stack which improves efficiency and simplifies most deployments, but it prevents them having administrative access to the IP stack. There is an experimental kernel feature (VIMAGE) to jails to run their own instance of the IP stack but there are still some nasty bugs hiding in this code, because nobody thought about how to tear down the IP stack. After all it was initialised once during the boot process and kept running until the power went out.

The largest difference is in the mindset behind jails. Jails are designed as secure operating system level virtualisation. Docker on the other hand is fairly fragile and offers neither secure isolation between containers nor between containers and the host. Jails can contain a complete userland and this a very common setup.

A full FreeBSD userland + some ports/packages to make it useful is about one 1GB. This used to be a lot 15+ years ago when jails where created and the older jail management tools like ezjail reduce the per jail storage requirements with nullfs and unionfs hacks. These days 1GB isn't that much for a simple container and most FreeBSD servers run on ZFS. ZFS offers a much simpler and cleaner way to reduce storage requirements: just clone a snapshot (the template) create a new jail and copy a few config files into the clone. The only problem is that you can't rebase your clone.

Docker is designed around the idea of single purpose containers without stable storage. You can use jails to implement this idea, but FreeBSD jails support more than that. Also all the FreeBSD jail managers I used try to stay out of your network configuration as far as possible and at most configure alias IP addresses on existing interfaces.

Docker is very opinionated software fighting against limitations imposed on it by the Linux kernel. Jails are FreeBSD kernel feature touching multiple parts of the kernel with a minimal userland interface in the FreeBSD base system . Multiple higher level jail managers are available in the FreeBSD ports tree.

There is no reason why you couldn't implement a docker like jail manager and the jetpack projects started doing exactly this. Keep in mind that docket images are the worlds new statically linked binaries for people who can't figure out how to define and reproduce the relevant parts of their development environment in their production environment. Executing existing docker images with their Linux binaries would probably require a massive update to the Linux compatibility layer (a reimplementation the Linux syscall ABI).

Actually, Bjoern Zeeb (bz@) did a lot of work around VIMAGE (sometimes also called VNET) in order to track down these bugs and fix them for 11.0-RELEASE. If you haven't, you should give it a try, because it's now the way to use per-jail IP stack.

I did and during my tests i found no more panics, but I ran into a bug in the IPFW tables logic. The fix hasn't been backported to FreeBSD 11.0 as an errata. Also the epair pseudo-interface used for connecting IP stacks tops out at about 4Gb/s while the loopback interface can handle >40Gb/s on the same hardware, because it uses just one kernel thread (I don't remember the pps).

I hope we'll get VIMAGE into GENERIC for FreeBSD 11.1. He completely refactored the IP stack shutdown logic. Now the layer are shut down from top to bottom instead of the other way around. This avoids most of the nasty locking problems draining the higher levels with their pointers to the lower layer resources (e.g. routes, interfaces) first. The remaining bugs won't be found without a lot more exposure.

I did get panics but I haven't been able to reproduce them. The only thing I can think of is create a jail without an assigned IP, start it, assign one, shut it down and start it up again.

> Docker on the other hand is fairly fragile and offers neither secure isolation between containers nor between containers and the host.

Please elaborate... Otherwise, great summary.

Docker moves fast and breaks things without stabilising them first this post explains the problem fairly well: https://thehftguy.wordpress.com/2016/11/01/docker-in-product....

There have been lots of bugs (some could be considered design failures) allowing processes to escape from a docker container. Fixing those problems hasn't received the attention from the docker community I would expect from a serious OS level virtualisation community. In some cases it boiled down to "yeah just don't do that" or "no problem just put docker inside a VM".

Perhaps you have the knowledge to inform me on a question to which I have never been able to find a plain answer. Do FreeBSD jails have anything like UID/GID virtualization and namespace isolation and resource quotas that LXC/LXD gives you on linux via cgroups?

Why is this NEVER brought up when FreeBSD and linux are compared?

UIDs and GIDs are just numbers to the kernel. Root inside a jail become any UID/GID and chmod/chown files just like root outside the jail, but even root inside a jail is locked down into the jail root directory.

You can prevent root inside a jail from modifying files with several mechanisms. The simplest is to not allow (re-)mounting from inside the jail which is the default and mount the relevant file systems read-only. BSD extended flags offer a finer granularity and by default root inside a jail isn't considered privileged by chflags(2). You can disable the security feature in which case the normal rules apply and access is controlled by the secure level (jails have their own secure level).

Processes running inside a jail have access to the full UID and GID namespace without any mappings.

PIDs exist in a global namespace and are allocated by the kernel. Processes inside a jail can't address PIDs unless they are inside the same jail or a child jail. Root inside the host can always address all PIDs. Non-root processes in the host (and jails) are subject to the security.bsd.see_other_uids and security.bsd.see_other_gids sysctls which can be disabled to protect the obvious ways to spy on other users. There is also support for various forms of mandatory access control. Have a look at the `mac_*` manual pages for more information on the available MAC disciplines.

Restricting the jails to just their processes in a shared namespace is a much cleaner alternative to loosing the PID as unique global process identifier. And mapping between different scopes is even worse than extending the identifier from an integer into a tuple.

FreeBSD also supports hierarchical resource limits. Jail ids are one possible subject type to limit. That way you can limit resource consumption per jail. See https://www.freebsd.org/cgi/man.cgi?rctl and https://www.freebsd.org/cgi/man.cgi?rctl.conf for more details on that.

Every LXC/LXD container can have its own PID 0 init process, and none of them even have visibility of the others. That's what I am not seeing in jails. Root on the host cannot even see any of the LXC/LXD PIDs.

Am I not seeing it because I am not looking hard enough, because FreeBSD guys don't really understand this whole issue, or because it's not there?

Re: resource limits: can the CPU load per jail be limited?

What issue? Root outside the jail is implicitly trusted by the jail so why would you hide resources (in this case jailed processes) from the "true" root? Jails can't see each others processes. The only case I can think of how one jail could affect an other is if they are connected by a unix domain socket in that case you could probably send a process descriptor from one jail to the other and the capability represented by the descriptor should cross the jail boundary. This would require the sending to sendmsg() the descriptor on purpose and I wouldn't consider it a security problem, because both jail have to cooperate to create this condition and be configured to share an existing connected unix domain socket a shared writable file system subtree (e.g. through nullfs what Linux calls a bind mount).

Yes you can limit resources per jail through hierarchical resource limits.

Does this answer your questions or did I missunderstand your question?

Across jails, pids are not visible. I can't comment on the ability of a user on the root of a linux system being able to see the pids in a LXC, but I bet there is a way. After all they are running on the same kernel. Is the non-visibility of the proceses just a userland trick on linux?

Resource limits for Jails can be done with rctl

It's important to understand, when people talk about cgroups, they do mean not only CPU/memory limits, but also network and disk I/O limits, that freebsd doesn't offer out of the box.

The firewall can match on the jail id for local sockets and apply traffic shaping based on these matches. Hierarchical resource limits on CPU/disk/memory/... can be configured per jail.

Of course jails have UID/GID virtualization and namespace isolation, that's what the purpose of jails is. Cgroups-like resource quotas - not so much, but there are ways to limit resources.

Actually for resource quotas/management there is rctl. So resources can be managed.

This is a fantastic summary, thank you!

There are lots of jail wrappers but in most cases it is just as easily to just take the time to understand how jail.conf works and use this (see FreeBSD jails the hard way [1]). It's fairly easy to just use zfs clones to create jails and customise the exec.prestart/exec.start functions to do automatic provisioning and configuration (eg. I store the port forwards in a variable for each jail and process this and automatically setup pf rules when the jail starts).

[1] https://clinta.github.io/freebsd-jails-the-hard-way/

Is there anyone from the FreeBSD community interested in standardising FreeBSD containers inside the OCI specification? Currently only Linux, Solaris and Windows have been included in the standard -- which is a bit disappointing (I've always been fond of FreeBSD).

If anyone is interested, please contribute to the Open Container Initiative. https://github.com/opencontainers

Not exactly OCI, but as an example of what can be done with FreeBSD, check out https://github.com/3ofcoins/jetpack that's a APC implementation using Jails/ZFS and other FreeBSD goodies.

iocage is great, but according to the README in GitHub it is not being developed anymore. The author is working on a rewrite in Go. The rewrite isn't done yet last I checked.

Due to this very fact I've forked iocage into iocell (https://github.com/bartekrutkowski/iocell) where I am fixing numerous critical bugs currently present in iocage, that most likely won't be fixed, due to iocage support being on hold (if not completely abandoned, until rewrite is available). Feel free to check it out, especially the `develop` branch! Once I have most of the annoying bugs fixed, I'll merge `develop` into `master` and create a FreeBSD port for iocell.

Is there a way I can try iocell while I've got existing iocage jails?

'It depends' ;) First of all, if you're using 'stable iocage' then you're out of luck, since too much changed in devel iocage, and iocell is based on devel branch. Second, it would either involve renaming your datasets in few places (from iocage to iocell) or changing few strings in iocell (from iocell to iocage). Other than that, iocell aims to work exactly the same way as iocage, with same principles, workflow and so on. Check out the commits history in iocell `develop` branch to find the one where iocage name is being changed, it should give you an idea on what and where to do. I am also planning some migration guide for existing iocage installations, but it might not be very easy/very stable, I am afraid.

Okay, let me ask it a different way then.

If I had to move away from iocage right away, how would I preserve my jail? I don't really care what the jail is called - I just want Plex or whatever to start up when I start the jail, and to be able to keep upgrading packages inside it.

That's entirely different question, and doesn't really have much in common with iocage/iocell. I, for example, have every single thing I host automated in a way where it doesn't matter wether it runs in jail provided by iocage, jail manually built, vm in AWS or a physical machine. This way, my migration path would be: wipe existing jails, create new ones with new tool, launch automation, done. You, however, might need entirely different approach, based on how your environment/setup looks like right now. One very primitive way would be to simply archive (rsync, tar, whatever else) jail contents and deploy it in jails created with new tool.

This. I still use iocage over alternatives though.

If anyone knows a updated status on the Go rewrite, please chime in!

How does it compare to ezjail?

Ezjail is fairly old. It works with the older less capable API to jails and is restricted to features supported by the old API. It was also designed for use with UFS on tiny disks.

Lets say you are a webhoster around the year 2000 and your servers have a handfull 36GB or 72GB SCSI disks. You want to protect each customer from all other customers and protect yourself from all customer scripts. This was before IA32 CPUs offered the features to support efficient transparent virtualisation and even if they did the resource demand per VM would have been too high. As long as your customers are happy with static file hosting everything is fine, but as soon as way want to execute some useful server side scripts you have a problem. FreeBSD offers a way to run one HTTP server per customer inside jail, but keeping a full FreeBSD userland (base + http server + databases + scripting language + customer code) per customer would quickly fill your puny little disks. Ezjails offers a neat solution to the problem: store a template just once and instantiate it with a nullfs read-only mount. Now your storage requirements are manageable at a reasonable price with hardware of the day and your buffer cache hitrates are better too. All of these indirection and aliasing hacks make ezjail more complicated than modern jail managers, because ezjail had to work around the operating system limitations instead of taking advantage yet to be invented operation system features.

Ezjail now supports ZFS.

Sharing one basejail via nullfs is useful feature, can Iocage do this?

Nullfs not only allows to save space on disk but also allows faster updates (extract new basejail then switch all jails to it, without full upgrade of each jail).

Also in software old doesn't mean bad, and newer is not automatically better.

I didn't mean to imply that. I just explained why ezjail looks so convoluted to a new user. Ezjail still solves the same problems today as it did 10 years ago. On the other hand since not even bloatware did keep up with storage cost decreases users can afford more comfortable trade-offs today.

I've found iocage's user interface to be more intuitive. I started using iocage because ezjail broke something when trying to upgrade a jail, but that was probably me doing something wrong.

I'm not sure if Go is the best language for this task, but he might have his reasons.

Why do you think Go is not the best language for this task?

As iocage pretty much only wraps FreeBSD system tasks, I can't see why a language different than "shell scripting" would be an improvement here - does it use API calls instead?

I have no idea how iocage is being rewritten, but if I were to rewrite iocage in Go I would call the relevant system calls directly. I would not wrap exiting programs or shell scripts.

The iocage shell script was totally unmaintainable. I know because I forked it and used it for my own purposes, until I stopped and wrote my own thing (coincidentally, also in Go). Implementing state machines correctly in shell script is painful.

I'd love to see a combined FreeBSD jail and bhyve manager. Can libvirt do that?

Have you looked at https://github.com/pr1ntf/iohyve?

CBSd https://www.bsdstore.ru/en/about.html

It's a little overwhelming though. FreeBSD really needs more/better jail and bhyve tools.

iocage is easier to use than ezjail or warden, but I still managed to end up with a broken system where the system thought it was FreeBSD 11 but still expected to use packages from FreeBSD 10 (I get an ABI error when doing pkg upgrade).

The distinction between jails and basejails is tricky to follow.

I don't know where the rewrite went.

What do your pkg.conf entries look like? Do they point at 10 specifically?

    FreeBSD: {
      url: "pkg+http://pkg0.nyi.FreeBSD.org/${ABI}/quarterly",
      mirror_type: "srv",
      signature_type: "fingerprints",
      fingerprints: "/usr/share/keys/pkg",
      enabled: yes

Works well for me. The only thing I dislike is the GUIDs for the containers that can't be changed to something shorter.

At the time I had to decide ezjail didn't work with FreeBSD 10, not sure if it has been updated.

...Am using ezjail on a mix of 10.3 and 11 now. Works flawlessly.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact