Hacker News new | past | comments | ask | show | jobs | submit login
Solaris to Linux Migration 2017 (brendangregg.com)
440 points by hs86 on Sept 5, 2017 | hide | past | web | favorite | 129 comments

I'm ex-sun, kernel group. I'm retired, early at 55, but after reading Brendan's write up, I'd work for that guy. Holy smokes, he is all over it. Reminds me of me when I was in the groove. Brendan if you read this, I'm old school, not good at all the ruby on rails etc, but systems, yeah, pretty good. Not looking for money, looking for working with smart people, I can be paid in stock. If it works out, great, if it doesn't still great because I like smart people.

Sorry for making it about me, you all should read his post, it's someone who is completely in the groove, has breadth and depth, knows about systems. These people are rare, but hugely valuable when you need to scale stuff up.

If you are indeed interested in coming out of retirement, Brendan Gregg does performance engineering at Netflix. It might be worth reaching out.


I'm kinda old and burned out but thanks for the link. I'll reach out. It could be a boatload of fun.

It's even much more fun than described in this article, because Netflix mostly works with FreeBSD not Linux. So no NIH syndrom, proper dtrace (no eBFS hacks), proper ZFS, proper tooling, easy kernel maintenance.

As I've seen it described here before (by Brendan Gregg ?) Netflix uses FreeBSD for the CDN servers (streaming the video), and Linux for everything else, browsing Netflix, encoding, etc.

That's true, although he was right about the Linux counterparts. Reading this migration guide, at least to me, it seems that all Linux replacements for things like dtrace, ZFS, SMF, zones... are just subpar.

So I got a different vibe. I can't speak to how good all that stuff is in Linux vs Solaris (or BSD) so it's just a vibe. The feeling I got was it was someone who was extracting the best value out of the system he was using.

Which is refreshing, the constant whining about Linux vs $SOMEONES_FAVORITE_OS gets old. Very pleasant to have a "just the facts" pile of info. And I learned a few tools that I didn't know about.

Of course it is valuable, and being myself a Linux user, I'm glad somebody did the effort to get all those tools available for us.

However, I feel sad seeing the best tech from Sun going down the drain despite being open source.

I've been here before. I fought like crazy to prevent SunOS 4.x from being tossed on the trash pile. If you think Solaris going away sucks, it sucked harder (for me) to see SunOS go away. It was a far more pleasant environment than Solaris ever was.

As an example of that, because it's fair to go "what??!", I installed some open source solaris in the last few years to play with it. The default install was miserable, the tools were crap. I was having lunch with Bryan Cantrill and some of his team and I mentioned my negative reaction. They laughed and said you have to install the GNU tools and put that in your path first.

Say what? SunOS 4.x came with good tools. In /usr/bin. The default install was useful. Solaris was "standard" in that it came with all the System V stuff in /usr/bin. And Shannon and crew "protected" those standard tools and refused to evolve them to make them useful. They started sucky and stayed sucky all in the name of being "standard". Except nobody else used those tools. There was no other System V based Unix that had any significant volume. *BSD certainly didn't move to System V, Linux wasn't System V, the only System V Unix with any volume was Solaris. So they were standard for no good reason and all these years later Solaris still has crappy stuff in /usr/bin, you want /opt/GNU/bin or something.

Sorry for the rant, and it's off topic. Well maybe it's off topic, maybe not. I sort of wonder if Sun had shipped the GNU stuff and installed it by default as the real tools, would it have made any difference? Probably not but boy, do the default tools make a bad first impression.

> Sorry for the rant, and it's off topic

No, I always find it interesting to hear stories about people who worked for companies like Sun in the past.


we have an opening on my team right now, for a Senior Performance Engineer (https://jobs.netflix.com/jobs/865018). My team is kinda small, so we can't hire everyone (much as that would be great), there's a lot of hiring elsewhere in the company, and I regularly work with different teams.

Man, I'd be interested but all the web stuff is not my wheelhouse. Maybe the better path is I tinker with lmbench for you just for fun. I'm not really looking for work so much as some interactions with sharp people.

Since I retired I've been doing some for hire tractor/excavator work (I live above you in the Santa Cruz mountains, have 3 Kubotas) and a lot of wrenching with my mechanic. Who is a decent mechanic but because he's in so much pain he frequently self medicates which makes him not a member of the sharp people team.

So I'm missing the conversations that engineers have. Normal people are fine and all but not as fun as poking at a hard problem with someone smarter than me.

I kinda think the job that might be good for me is helping out a VC firm that funds sort of system stuff, like cloud stuff, I/O stuff, etc. I'm a dinosaur, I like C, I like kernels, I like thinking about I/O and how to scale it. Systems stuff is where I like to be and there don't seem to be too many places that want that anymore. Or I'm just not aware of them.

Brendan if you ever want to go to lunch and yap about lmbench or tell me what you are working on, hey, lunch is on me. We can go to crappy chinese (I used to have offices in the water tower plaza and that's what we called the chinese place that's just up the street from you. It's the best chinese we could find but it's nowhere near as good as San Francisco chinese food, hence the name) or whereever is good, I'd love to find a new good place for food down there.

I share the thought.

Happiness from work comes from the work you are doing and whom you are working with. If offered the chance to work with someone like Brendan I would jump at the opportunity.

My sentiment exactly. I'd give almost anything for the opportunity to work on SmartOS with Bryan Cantrill.

If you have some skills that he would value, he's a friend of mine, we aren't like best buds or anything but we share the same passion for operating systems so we get together once in a while. I certainly know him well enough to introduce you.

I'd want to have some idea that it would be a good fit, so if you are really interested please contact me via email. Look at my profile, I think I stuck my email there, if not you can find it easily enough.

If you're open to other opportunities and interested in TCP, Ethernet drivers, file system work, jails stuff on FreeBSD for one of the biggest CDNs you can contact me at kbowling at llnw dot com. I am the hiring manager.

Kind words, thanks! ... so I was using lmbench last Friday to profile memory latency (lat_mem_rd), is that one of yours?

Ah, lat_mem_rd, so awesomely named :)

Yeah, that's mine and I think I was the first to do it that way though I saw the idea described in Hennessy & Patterson and they didn't give me credit so they must have dreamed it up too. It's clever but pretty obvious once you think about it.

Very cool that you are still using it. lmbench has aged pretty well, mhz.c still works. BTW, lat_mem_rd has changed processors all over the place. They all have code to detect sequential strides and prefetch. I think Carl and I switched it to go backwards to try and defeat their prefetch. Dunno if they prefetch that way as well. The prefetch is sort of cool and sort of annoying because it hides how the hardware would perform if the access was random. Without the prefetch you can determine cache line size, L1,L2,L3 size and latency, and TLB size, and main mem latency. With the prefetch you think you are getting all that stuff but it looks different. If you have ideas on how to defeat the prefetch and still get the info I'd love to hear them.

Another funny thing about lmbench: I was recently approached by a company that's got it working on phones. They want to feed back their stuff and have me do another release that makes it easy to run it on phones. I might do that, still waiting their dump of code. Kind of a neat idea.

I still tinker with lmbench, if you have something you want measured and it fits with the idea of measuring bandwidth and latency of everything (and fit's with what Linus and I wanted out of lmbench, if you tune for those metrics you are making the hardware/OS better, so no silly show off your whatever, has to be generically useful), let me know what it is and I'll see if I can code up a benchmark. Making new measurements using the lmbench framework is pretty easy for me, I know that code.

BTW, I have an internal version of lmbench that adds stdio support to lmdd (if you haven't played with lmdd, man you've missed out. It can simulate many I/O benchmarks and give you results really quickly). So why would I want stdio support? Because BitKeeper has an enhanced stdio library that can stack filters on a FILE*. And it has some interesting filters, like gzip, lz4. And some much more interesting filters that do CRCs on blocks and an XOR block at the end. And another one that is so complicated I would want to describe it in person.

You can combine lz4 with the CRC stuff. Once I had stdio support, I linked with BK's stdio instead and added options to push all those filters and I could see what the overhead was. I did all this when Wayne Scott was putting that crud in BK so I could make sure we were not slowing BK down (we weren't, I think all that crud runs around 1 GB/sec which was fast enough for me). Oh and I also added support to do I/O backwards because we did that in BK as well so we had an append only file format for the ChangeSet file. Lots of funs stuff in there, I need to package it up and release it. If that I/O stuff is of any interest to you, it's open source under the Apache v2 license and if the license bugs you tell me what one you want and I'll rerelease it under that one. We picked that license because we thought it was the one that was easiest for everyone to use, if we're wrong we'll change it.

Nice writeup on the migration stuff, sounds like you are having fun. Getting paid to have fun is sweet, that's what it was like for me at Sun in the SunOS 4.x days. I got there and spent the first 3 years telling people "man, I love this job so much. I'd work here for free if I had the money." Someone took pity on me and said "Do you never want a raise? Because that's how you never get a raise" :) I shut up and started getting raises.

Enjoy that job. Not every job is that fun, you are at a special part of your career, enjoy the heck out of it.

Brendan is one of the few guys who try to squeeze the last drop of performance from kernels. He is brilliant and has done some great work.

I really enjoyed using a Sun box around 17 years ago for C dev - my first professional programming job. Thanks for your work :)

So was that some random Sun machine or are you one of the few people who had what was code named sunbox? It was a clustered NFS server and it was my baby.

Pretty sure you just mean a random sun machine, there weren't a lot of sunboxen sold.

Sadly, the former. It was a pizza box shape, and had (compared to the beige box systems in use in the company) a gloriously large and clear CRT. Would compile our code base as fast as one of the aforementioned beige boxes with twice the clock speed.

His books are awesome too. I keep them by my side pretty much constantly.

Nice article! Though I do think the article could have more clearly noted that Linux containers are not meant as security boundaries. It doesn't explicitly say it but it is a very important distinction.

Unlike FreeBSD jails and Solaris Zones. You can't run multiple docker tennant's safely on the same hardware. Docker is basically the equivalent of a sign which says: "don't walk on the grass" as opposed to an actual wall which FreeBSD jails and Solaris zones have. Now if you have a very homogene environment (say you are deploying hundreds of instances of the exact same app) then this is probably fine. Docker is primarily a deployment tool. If your an organization which runs all kinds of applications (with varying levels of security quality) that's an entirely different story.

There are dangerous things that you can allow Docker to do. But if you don't do those things, it is pretty difficult to break out of a container.

Redhat has been especially good here, with not allowing anyone but host-root to connect to Docker and using SELinux and seccomp filtering. With those working, it doesn't matter if your container mounts a host filesystem since it won't have the correct SELinux roles and types anyway.

Many people claim that ruins Docker, since now you can't use Docker from within Docker. But that's the price you pay for security.

I believe that with the correct precautions, a Linux container is just as safe as a jail or zone. Perhaps the problem is just how easy it is for a sysadmin to put holes into the containers that ruin the security.

> Perhaps the problem is just how easy it is for a sysadmin to put holes into the containers that ruin the security.

I think there's a bit more to it than that. For some examples of other reasons people might be wary vs zones:

docker itself is still a daemon that runs as root, combining a large number of different functionalities which require root access into a single binary with a large attack vector and a lot of code which doesn't need to be privileged. While isolation of responsibilities of docker has begun, even their own security page [1] admits that there's a long way to go here.

Zones are, as many of the articles in this thread point out, a first class feature designed and implemented. What docker/"containers" allow you to do is the culmination of many building blocks which have been incrementally added to the Linux kernel. Some of those have been pretty recently, and without an overall design, their interactions with other portions of the Linux kernel or other components of the system have often been surprising and led to a number of security issues over time. In comparison, both the code and the design of the system are relatively young. A good example of this can be found at [2], which ends with the following very apt quote:

> Why is it that several security vulnerabilities have sprung from the user namespaces implementation? The fundamental problem seems to be that user namespaces and their interactions with other parts of the kernel are rather complex—probably too complex for the few kernel developers with a close interest to consider all of the possible security implications. In addition, by making new functionality available to unprivileged users, user namespaces expand the attack surface of the kernel. Thus, it seems that as user namespaces come to be more widely deployed, other security bugs such as these are likely to be found.

It might also be interesting to read [3], which is already showing that 3.5 years later, user namespaces are still a breeding ground for security issues that lead to privilege escalation.

[1] https://docs.docker.com/engine/security/security/#related-in... [2] https://lwn.net/Articles/543273/ [3] https://utcc.utoronto.ca/~cks/space/blog/linux/UserNamespace...

I can appreciate that the features needed to be developed and matured over time, but I don't understand why Linux didn't invent an umbrella concept to tie everything together.

It seems to me a grab bag of things which Linux allows to be independently namespaced/isolated: Cgroups, networking, PIDs, VFS, etc. From a kernel point of view, this would be the perfect use case for an "object-oriented" design with some kind of abstract container concept that reflected the nesting of each container, but instead it seems very scattered and ad-hoc.

In particular, each mechanism is opt-in and must configured separately, very carefully; to approximate Zones you have to combine all of the mechanisms together and hope you didn't forget something, and also hope that the kernel's separation is perfect (which, given the vast amounts of "objects" it can address, is doubtful). To my untrained (in terms of kernel development) eye, this seems the opposite of future proof, because if the kernel invents some new namespacing feature, an application that uses all of the existing mechanisms won't automatically receive it, because there's no concept of a "container" as such.

opt-in seems like the wrong approach. The safer alternative would be that a new container process was completely isolated by default, and that whoever forked the process could explicitly specify the child's access (e.g. allow sharing the file system). This is, I believe, how BSD jails work.

That "object-oriented" design is called capabilities.

> since now you can't use Docker from within Docker.

Docker-in-docker is a trashfire that barely works anyway, it's no real loss.

Think you might be talking about running the Docker daemon inside Docker, which is a different thing from just calling Docker from a container.

Ah, you're right. The latter is actually a pretty cool solution, I've had great success with using it to spawn what are essentially sibling containers.

Maybe relevant: "Setting the Record Straight: containers vs. Zones vs. Jails vs. VMs"


Some discussion about this article here on HN:

https://news.ycombinator.com/item?id=13982620 (160 days ago, 235 comments)

True, and that very long post basically says with many words: "Yes, Linux namespace (docker) isn't as secure as FreeBSD jails or Solaris Zones but security is not the problem docker solves. Docker solves a deployment problem, not a security problem."

Comparing containers (a concept) with VM/jails/zones is a non-sequitur. To quote:

A “container” is just a term people use to describe a combination of Linux namespaces and cgroups. Linux namespaces and cgroups ARE first class objects. NOT containers.


VMs, Jails, and Zones are if you bought the legos already put together AND glued. So it’s basically the Death Star and you don’t have to do any work you get it pre-assembled out of the box. You can’t even take it apart.

Containers come with just the pieces so while the box says to build the Death Star, you are not tied to that. You can build two boats connected by a flipping ocean and no one is going to stop you.

Docker is a bunch of boxes floating on a flipping ocean[1]. They could have made a deathstar, but they chose not to.

[1] https://www.google.com/search?q=docker+logo

Can't you just run docker in something like firejail?

I'm not sure what you mean here. Do you mean that:

1) There are known ways to perform Docker escapes on any or some common Docker setup. You could write a Docker escape binary or script today and it would not be a zero day. That's just the way Docker is.


2) You simply have less faith in the ways Docker performs isolation. One could write a Docker escape exploit and it would be a zero day, but you expect there to be more of such zero days in Docker than in Jails/Zones.


If 1) I'd be really interested in seeing it and if 2) I'd like to know more about what additional levels of isolation jails and zones (and LXC?) perform.

I've expanded a little bit more in another comment here, but I think 2) is less a case of any fundamentally different level of isolation, but more a maturity of design and implementation. While Zones were designed as first class, containers were not, they're built out of a variety of other building blocks, which while they may be more powerful primitives, have seen less review of design and implementation and exposed a large attack space on the kernel. User namespaces are the prime example, possibly because they are the largest/juiciest new target, but I expect we'll see more with time.

> Docker is basically the equivalent of a sign which says: "don't walk on the grass" as opposed to an actual wall which FreeBSD jails and Solaris zones have.

I think this is dramatically overstating the risks. It is possible to run containers securely, it is just much more difficult to secure containers on linux than on BSD or Solaris. It is significantly difficult to break out of a properly configured container (using user namespaces, seccomp, and selinux/apparmor), and I know of no cases where it has been done successfully.

I still separate tenants onto VMs because I don't want to be the first example of a breakout, but I don't think people who isolate with containers are crazy, just a little less risk-averse.

> I think this is dramatically overstating the risks. It is possible to run containers securely, it is just much more difficult to secure containers on linux than on BSD or Solaris.

The more difficult it is for sysadmins to harden their Docker setup, the less credence can be given to the claim that Docker is designed for security isolation.

In the real world, security compromises happen far more often because of misconfigured setups, not because of zero-day exploits. This is why it's important for software in general to be shipped with secure defaults.

I can't remember the source, so I'll paraphrase, but a rather wise guy once said: "I don't care how hard it is to break a hardened setup - what I'm concerned about is how easy it is to harden."

Out of the box, Docker is designed to solve deployment problems, not security problems. And that's a crucial distinction.

> Linux containers are not meant as security boundaries (...) > You can't run multiple docker tennant's safely on the same hardware. Docker is basically the equivalent of a sign which says: "don't walk on the grass" as opposed to an actual wall (...)

I agree, but it's a strange way to frame it - the technology is lxc - and lxc (along with some enabling technologies and tools) do have, and has a history of, a focus on being a real boundary.

docker has never (does now?) claimed, or implemented - or been about security boundaries.

Stretching the metaphor, docker is using tar to package dependencies in a snapshot, and using plain chroot to run "containers" - even when jails are available.

In terms of marketing, "Linux containers" might mean "docker" - but in technology (as contrasted with zones, jails) that's not quite right.

So yes, "docker is not about security", lxc&friends maybe not quite equivalent to modern jails - but new ways of running "docker containers", like in actual vms - certainly can blend some convenience/popularity and security.

I am confused. I thought Docker was no longer based on LXC. Am I wrong on this?

The whole story is somewhat long and complicated - I think it was roughly openvz (patch-set, out-of-tree for Linux "jails" > lxc > [docker enters the picture] > as more features for isolation are merged in mainline (namespaces ++) it means docker is no longer based on the same subset of features that the lxc project uses.

I'm sure someone will chime in with an updated family tree of Linux chroots, capability frameworks and process isolation features.

[ed: i believe one source of confusion is that docker started with (userspace part of) lxc as its only driver, and now docker-the-binary makes system calls directly, and avoids lxc-the-userspace-toolset - but they employ a mish-mash of kernel features, many-of came from the lxc project (on the kernel side)?]

Linux does offer enough primitives to completely sandbox processes, as Zones and Jails did/do. The problem is that docker didn't implement all of those primitives right away, the biggest being user namespacing, which it didn't add until shockingly late in its development lifecycle.

So, yes you can completely sandbox a "container", just be prepared to put in some work (SELinux, AppArmor, user namespaces, and reviewing the defaults capabilities quite carefully).

Given that's the case, I'm sure you can go capture the flag at https://contained.af/ .. no one has yet (docker containers, heavy seccomp filtering).

Or maybe you can break out of the Google App Engine linux container and let me know how it looks (linux containers, quite well-worn at this point)?

Or perhaps you can check for me whether AWS Lambda actually collocates tenents or not (unknown linux containers)?

Or you can launch a heroku dyno and break into another dyno and steal some keys (lxc linux containers)?

In reality, many services do colocate multiple different users together in linux containers. If you use seccomp and ensure the user is unprivileged in the container, it's fairly safe. Heroku has been doing it for years upon years now. The other services I named above likely do.

Linux containers absolutely are intended as security boundaries. Kernel bugs which allow escaping a properly setup mount namespace or peeking out of a pid namespace or going from root in a userns to root on the host are all treated as vulnerabilities and patched. That clearly expresses the intent.

Yes, I agree that in reality they're likely not yet as mature / secure as jails or zones, but I think it's disingenuous to say that Linux containers aren't meant to be security boundaries.

> . Kernel bugs which allow escaping a properly setup mount namespace or peeking out of a pid namespace or going from root in a userns to root on the host are all treated as vulnerabilities and patched. That clearly expresses the intent.

This is because these vulns can be exploited locally without containers.

I'd be very interested to hear details of exactly how you would suggest, if I understand you correctly, that any Linux container can be broken out of.

With user namespacing in use (available in Docker since 1.12) I'm not currently aware of any trivial container-->host breakouts or container --> container breakouts.

There are information leaks from /proc but they don't generally allow for breakout, and in general Dockers defaults aren't too bad from a security standpoint.

The only exception for the general case is, I'd say, the decision to allow CAP_NET_RAW by default, which is a bit risky.

Right so each of those are specific vulns. like all code gets, and not a systemic "linux containers aren't a security boundry" as suggested by the OP, which was the point I was asking for more info. on.

All code has bugs, some of those are security bugs. There's a big difference between "if you haven't patched or I have a 0-day I can compromise you" and "no matter how well patched you are this isn't a security boundry so it can be bypassed"

My reading of the top comment was it was suggesting the latter with regard to Linux containers, and I'm not sure that's true.

Yes, but certain things tend to wind up with more security bugs than others, due to attack surface and the like. For instance, you tend to see browser exploits, SSL exploits, and privilege escalation attacks a lot more often than OS-level exploits, or hypervisor vulnerabilities.

It's not that Docker is a gaping security hole, it's just not something I trust as much as the Linux Kernel or Xen. I probably trust it about as much as I trust a well-updated web browser. It's suitable for everyday use, but I don't click the link on the phishing or spam email just to see what happens.

oh sure, I'm not saying Docker provides perfect security.

The point I didn't agree on was the top comment which basically, to me, seemed to be saying "Docker is not a security boundry" because that's not (in my experience) true.

There are a load of companies running Multi-tenant systems using Linux containers, so if they're not a security boundry, a lot of people are going to be having a bad time :)

It's just sad to watch this. the fact that the whole root zone etc. is just built right into every part of the os is amazing. And yet you have a bunch of companies just stroking their egos. I looked at the illumos-gate and smartos contributors as brendan suggested and there isn't much.

I wonder if adding a proper wifi stack and commodity hardware supported would have helped. Maybe it's just wishful thinking but I thought it would have been nice for cheap routers and home nas.

The fact that there is so little documentation also probably didn't help it

Openshift give you a free container with no more id than an email address. So clearly it can be made secure.

Though I admit this would not be core docker stuff; more like selinux and other controls.

... Precisely none of these are technologies that you should use to "safely" run multiple things on the same hardware.

Linux containers are very similar to Zones.

I doubt you would choose between the two technologies based on security.

The fact this is the top-rated comment really does say a lot about the technical literacy of the newsy audience.

Multiple explanations, not all of which need to be the case:

* Not everyone needs or is an expert on containers, just as not everyone is knowledgable about the TCP stack, dynamic routing, assembly optimization, or name your topic.

* It's a true and well-stated comment in itself and deserves to be recognized, even if many already know it.

> It's a bit early for me to say which is better nowadays on Linux, ZFS or btrfs, but my company is certainly learning the answer by running the same production workload on both. I suspect we'll share findings in a later blog post.

I am eager to read this piece!

Even though I am afraid to see it confirm that btrfs still struggles to catch up…

The 2016 bcachefs benchmarks[0] are a mixed bag.

[0]: https://evilpiepirate.org/~kent/benchmark-full-results-2016-...

Is BTRFS really an option now that RedHat has decided to pull the plug on their BTRFS development? That basically leaves Oracle and Suse I think? As far as I can tell the future of BTRFS doesn't look good.

Facebook using it doesn't mean anything since they are probably using it for distributed applications. Meaning the entire box (including BTRFS) can just die and the cluster won't be impacted. I really can't imagine they are using BTRFS on every node in their cluster.

Facebook is using it for lots of shit. Chroots for containers because we can easily snapshot them and use them. Use it for build testing, so snapshot base repo, checkout commit, build, throw away. Gluster which takes huge backups and then other random workloads, it's used in a few places.

Just because Facebook is fault tolerant doesn't mean we don't care about failures. We actively run down any issues we hit, so while it doesn't have much of impact in the short term, we don't just ignore issues.

And also Red Hat hasn't contributed much to btrfs in years, and Oracle has one developer working on it. We are constantly working on it, and now that my priorities have shifted back to btrfs I hope that we will start to close out some of these long term projects.

Just because Facebook is working on btrfs in their OS, doesn't mean they will publish their code and patches to the public.

Just because some code is added to the main kernel, assuming it is, doesn't mean it will be propagated to the main LTS distributions anytime soon.

Lol what? We don't ship code that isn't upstream. It sucks to maintain and rebase if you have a shitton of proprietary patches. Everything we do with the kernel is open source, you wouldn't be able to hire kernel developers if you didn't do open source. Nice comment tho, I legitimately lol'ed.

> Facebook using it doesn't mean anything since they are probably using it for distributed applications. Meaning the entire box (including BTRFS) can just die and the cluster won't be impacted.

I don't think that follows for lots of reasons:

- If enough of your boxes die that you lose quorum (whether from filesystem instability or from unrelated causes like hardware glitches), your cluster is impacted. So, at the least, if you expect your boxes to die at an abnormally high rate, you have to have an abnormally high number of them to maintain service.

- Filesystem instability is (I think) much less random than hardware glitches. If a workload causes your filesystem to crash on one machine, recovering and retrying it on the next machine will probably also make it crash. So you may not even be able to save your service by throwing more nodes at the problem. A bad filesystem will probably actually break your service.

- Crashes cause a performance impact, because you have to replay the request and you have fewer machines in the cluster until your crashed node reboots. It would take an extraordinarily fast filesystem to be a net performance win if it's even somewhat crashy.

- Most importantly, distributed systems generally only help you if you get clean crashes, as in power failure, network disconnects, etc. If you have silent data corruption, or some amount of data corruption leading up to a crash later, or a filesystem that can't fsck properly, your average distributed system is going to deal very poorly. See Ganesan et al., "Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions", https://www.usenix.org/system/files/conference/fast17/fast17...

So it's very doubtful that Facebook has decided that it's okay that btrfs is crashy because they're running it in distributed systems only.

This article https://www.linux.com/news/learn/intro-to-linux/how-facebook... explains somewhat what Facebook does with BTRFS.

"Mason: The easiest way to describe the infrastructure at Facebook is that it's pretty much all Linux. The places we're targeting for Btrfs are really management tasks around distributing the operating system, distributing updates quickly using the snapshotting features of Btrfs, using the checksumming features of Btrfs and so on.

We also have a number of machines running Gluster, using both XFS and Btrfs. The target there is primary data storage. One of the reasons why they like Btrfs for the Gluster use case is because the data CRCs (cyclic redundancy checks) and the metadata CRCs give us the ability to detect problems in the hardware such as silent data corruption in the hardware. We have actually found a few major hardware bugs with Btrfs so it’s been very beneficial to Btrfs."

The sentence: "We also have a number of machines running Gluster, using both XFS and Btrfs." seems to imply Facebook is not using it heavily for actual data storage. What I distill from this (which is obviously my personal interpretation) is that Facebook mostly uses it for the OS and not for actual precious data.

I'm reading that as quite the opposite: they're saying that Gluster, a networked file storage system, is being backed with btrfs as the local filesystem, so all data stored in Gluster is ultimately stored on btrfs volumes. (They're also using it for OS snapshotting, yes, but insofar as the data stored in Gluster is important, they're storing important data on btrfs.)

See also https://code.facebook.com/posts/938078729581886/improving-th...

"We have been working toward deploying Btrfs slowly throughout the fleet, and we have been using large gluster storage clusters to help stabilize Btrfs. The gluster workloads are extremely demanding, and this half we gained a lot more confidence running Btrfs in production. More than 50 changes went into the stabilization effort, and Btrfs was able to protect production data from hardware bugs other filesystems would have missed."

Given how many times btrfs has failed to read data or to mount (with an error), I would imagine this is why btrfs is used by Facebook: because it isn't afraid to 'just let it crash' (cleanly), to use Erlang rethoric.

Yeah, it's definitely true that you want a filesystem with data and metadata checksums if you want high reliability. (I think btrfs and ZFS are the only Linux-or-other-UNIX filesystems with data checksums?)

But I think the inference to make is that Facebook trusts btrfs to increase reliability, not that Facebook trusts their distributed systems to cover for btrfs decreasing reliability to gain performance (or features).

Redhat was never a big contributor to BTRFS. This still means less users, but not less devs.

For a filesystem I'd worry about what that implies for testing, especially for enterprise hardware and workloads. RHEL has significantly more users and it seems likely that their users would have more diversity than just Oracle shops.

Are you completely forgetting SUSE? They’re a thing, too, after all.

Not forgetting them or Oracle but RHEL has something like twice the marketshare of the two combined. I'm not sure I've ever seen a support matrix where either of them is listed but RHEL is not, whereas the reverse is not uncommon.

Also eager to read this piece

> Linux has also been developing its own ZFS-like filesystem, btrfs. Since it's been developed in the open (unlike early ZFS), people tried earlier ("IS EXPERIMENTAL") versions that had serious issues, which gave it something of a bad reputation. It's much better nowadays, and has been integrated in the Linux kernel tree (fs/btrfs), where it is maintained and improved along with the kernel code. Since ZFS is an add-on developed out-of-tree, it will always be harder to get the same level of attention.


So long as there exists code in BTRFS marked "Unstable" (RAID56), I refuse to treat BTRFS as production ready. If it's not ready, fix it or remove it. I consistently run into issues even when using BTRFS in the "mostly OK" RAID1 mode.

I don't buy the implication that "it will always be harder to get the same level of attention" will lead to BTRFS being better maintained either. ZFS has most of the same features plus a few extra and unlike BTRFS, they're actually stable and don't break.

I'm no ZFS fanboy (my hopes are pinned solidly on bcachefs) but BTRFS just doesn't seem ready for any real use from my experience with it so far and it confuses me. Are BTRFS proponents living in a different reality to me where it doesn't constantly break?

EDIT: I realize on writing this that it I might sound more critical of the actual article than I really am. I think his points are mostly fair but I feel this particular line paints BTRFS to have a brighter, more production-ready future than I believe is likely given my experiences with it. BTRFS proponents also rarely point out the issues I have with it so I worry they're not aware of them.

We're using both btrfs and zfsonlinux right now, in production, and fortunately we're not consistently running into issues (I'd be hearing about it if we were!).

I should note that we do have a higher risk tolerance than many other companies, due to the way the cloud is architected to be fault tolerant. Chaos monkey can just kill instances anytime, and it's designed to handle that.

Anyway, getting into specific differences is something that we should blog about at some point (the Titus team).

Do you use replication at all? I have a feeling the reason why nobody else sees my problems is that most folks will be using a system like Ceph for distributed replication rather than BTRFS for local replication.

Have you not seen issues with kernel upgrades? I had kernel upgrades cause btrfs to fail to mount as recently as 2016 (at which point I switched to zfs).

> my hopes are pinned solidly on bcachefs

Bcachefs is developed by a single person (however brilliant he may be) on a part time basis (I see only 4 commits since May this year).

Unless I'm missing something (please correct me) bcachefs will not be replacing BTRFS, or any other filesystem, anytime soon.

> Bcachefs is developed by a single person (however brilliant he may be) on a part time basis (I see only 4 commits since May this year).

I'm not sure how true this is of filesystems, but a lot of the best software I know is (or at least, was) developed by a single person. For example, Redis, Sqlite, Lmdb, most malloc implementations, varnishcache, h2o, (nginx?), chipmunk-2d, most of dragonflybsd (including HAMMER).

Many projects get more contributors after the code is working well enough to see popular production use. But programmers working alone can accomplish some pretty impressive stuff.

lmdb/openldap is developed by a commercial company.

The other products you mentioned BEGAN with a single person. They had more resources later.

These are good points and mostly why I said "hopes" rather than something stronger like "bets" or "production machines".

> ZFS has most of the same features plus a few extra

The ZFS feature set is not a strict superset of what btrfs offers. The ability to online-restripe between almost any layout combination is quite useful for example. So is on-demand deduplication, which is also far less resource-intensive than ZFS dedup.

> The ability to online-restripe between almost any layout combination is quite useful for example

This is true but since the only stable replication options on BTRFS are RAID1 and single, this online restripe is of very limited usefulness.

I'd _really_ like BTRFS to fix its issues so I could use this reshaping (it's the main feature I'm missing from current filesystems) but it's been years and replication is still unstable.

I work with both ZFS and btrfs. btrfs is my fs for almost everything, but the solaris boxes use ZFS. That said, btrfs is far easier to work with. ZFS is a dog when file space gets low, cleanup is troublesome. Never had any btrfs problems, besides ext4 would be faster.

This is a great set of information comparing features and tools between the two ecosystems. I like it, and wish more were available for Linux -> BSD and even lower level command tool comparisons like apt-get vs. yum/dnf.

In fact, this works a general purpose intro into several important OS concepts from an ops and kernel hacker perspective.

My only surprise is that this is written as a specific response to Oracle Solaris' demise. From that specific perspective, how many target viewers are there? 10? Illumos isn't losing contributors, and there are still several active Illumos distros. Nevertheless, interesting.

Yes, I hope to write one for Solaris -> BSD.

This post was written for the illumos community as well.

Please do. I often advocate for BSD when it fits the need but I get pushback due to its lack of popularity. If all Solaris users migrated solely to Linux my position becomes weaker. :( and BSD is very good at quite a number of things.

And Solaris -> SmartOS / illumos?

You might be interested in the long-lived Unix Rosetta Stone:


Also the 2017 version - covers current-gen OSs including BSD, SmartOS, newer Linux distros and Windows powershell:


It is actually sad to see that you had to write this.

Funny how the most used / popular technology and a mismanagement from a single company can crush other competing tech.

It is frightening how much of what was invested in Solaris is now lost because of it.

SmartOS might be easier for the Solaris familiar, looking to deploy Linux containers rather than go fully Linux.

Does SmartOS actually support Linux containers (aside from the obvious approach of running containers in a Linux VM)? Last I checked, SmartOS just used the word "container" to refer to Solaris Zones.

SmartOS has code specifically in place for using 'lx' branded zones as wrappers for docker containers (which is distinct from its support of kvm)



It's been a while since I've looked in to it, but if memory serves they are using docker filesystem snapshots without modification and running them on a thin translation layer of Linux system calls to Solaris system calls. Hard to find anything backing this up, so I could be way off the mark as to how it's implemented.

EDIT: forgot that's what 'lx' zones are: zones which allow the execution of Linux binaries

you checked a long time ago :)

If you can't access it directly, here's a cached version: https://web.archive.org/web/20170905181357/http://www.brenda...

> If you absolutely can't stand systemd or SMF, there is BSD, which doesn't use them. You should probably talk to someone who knows systemd very well first, because they can explain in detail why you should like it.

I can't imagine trying to sell anything with the phrase "why you should like it". SMF certainly doesn't need that kind of condescending pitch--it just fucking works and doesn't get in your way.

It'd be a waste of my time, but I could document the times I've seen Solaris SMF break a system so bad it was almost undebuggable, even with the help of many Sun staff. If you never had crippling registry issues, you were lucky. Other than those unlucky moments, it worked great!

It helps that SMF doesn't include reimplementations of DNS resolvers and NTP clients, both of which have caused security flaws in systemd. :/

You certainly can only use systemd's pid1 together with ntpd (systemd only does SNTP, not full-blown NTP) and no caching resolver.

In fact it's the default in most Linux distributions. The only mandatory pieces if you use systemd as pid1 are udevd and journald.

Misinformation like this is exactly why Brendan said you should ask an actual user of systemd.

I actually like systemd as an init system, and use it that way on all my current machines. (I'm a fan of journald so far too.)

However, the mere fact that (s)NTP and DNS are re-implemented in the same codebase is still unsettling to me.

Nor is it misinformation to mention their existence, or the bugs/limitations caused by the duplication of effort.



(Looks like the sNTP client hasn't caused security flaws, though. I was wrong there.)

They are independent, you can use one without the other. They are just residing in the same repository and build system.

At least Ubuntu ships with both resolved and timesyncd enabled by default.

> "Xen is a type 1 hypervisor that runs on bare metal, and KVM is type 2 that runs as processes in a host OS."

Is it not the other way around, that KVM runs on bare metal (and needs processor support) while Xen runs as processes (and needs special kernel binaries)?

No, it's not.

It's true that KVM needs processor support: it kind of adds a special process type that the kernel runs in an virtualized environment, through the hardware virtualization features. The linux kernel of the host schedules the execution of the VMs.

Xen has a small hypervisor running on bare metal. It can run both unmodified guests using hardware support or modified guests where hardware access is replaced with direct calls into the hypervisor (paravirtualization). The small hypervisor schedules the execution of the VMs. For access to devices it cooperates with a special virtual machine (dom0), which has full access to the hardware, runs the drivers and multiplexes access for the other VMs - the hypervisor is really primarily scheduling and passing data between domains, very micro-kernel like. Dom0 needs kernel features to fulfill that role.

Today I learned something. Thanks!

I updated this because, as someone pointed out, type 2 is no longer a good description of KVM. It uses kernel modules that access devices directly, so it's not strictly type 2 where it all runs as a process. Maybe I should just stop using the "types".

The type definitions have basically always been misused anyway. It's a shoehorning of formal proofs from the 70s onto modern concepts, and it doesn't really work.

Though, really, I'm just quoting Anthony Liguori from 6 years ago, so credit where it's due:


No. See the Classification section.


Solaris support ends in November, 2034. Yeah, 17 years from now. No need to hurry ;-)

Given that they fired everyone, who do you think is going to give that support and fix bugs?

They didn't quite fire everyone: Alan Coopersmith (https://twitter.com/alanc/status/904366563976896512) is still present. I'm sure lots of support and bug fixes will be neglected and probably moved offshore though.

Also, it certainly seems like Oracle Solaris 11.3 (released in fall 2015) will be the last publicly available version. Between-release updates (SRUs) have always been for paying customers only, but now it seems like there will never be another release.

My comment was a joke, obviously. However, it's not that difficult to provide that kind of support and bugfixing. It's not like developing something new.

Hm, when was the next Unix timestamp range overflow again?

I think it was 2038... so Solaris won't need a patch :-D

That may well have been the motivation for that 2034 date and the canning of most of the Solaris crew.

It will be interesting to see if the Osborne Effect [1] repeats itself with Solaris.

[1] https://en.wikipedia.org/wiki/Osborne_effect

People started moving away from Solaris/SPARC in droves as soon as Oracle acquired Sun.

An example: the "old" Sun gave me a loaded T1000 system to run SUNHELP.ORG on.

The "New" Sun wouldn't even give me Solaris patches / security updates (which used to be free) without a support contract.

I had to eventually move the site to being hosted on a Debian box because I couldn't afford the hundreds of dollars they wanted every year for patch access.

It really chapped my hide. I'd even been part of the external OpenSolaris release team.

The "old" Sun recognized and encouraged the hobbyist community - that if people played with older gear at home, they were more likely to recommend/spec it at work... Solaris and patches/updates, software suites (LDAP server etc) "free unless you want/need a support contract".

"New" Sun after the Oracle acquisition: "Unless you're a business paying us money for support, we don't care, and you get NOTHING."

So, people stopped using and playing with Solaris/SPARC at home, and eventually stopped using it at work too.

I think two things need to be mentioned :

1. ZFS is officially supported by Canonical on Ubuntu as part of their support plans. 2. Docker over raw containers or zones.

> Crash Dump Analysis...In an environment like ours (patched LTS kernels running in VMs), panics are rare.

As the order of magnitude of systems administered increases, rare changes to occasional changes to frequent. Especially when it is not running in a VM.

Also, from time to time you just get a really bad version of a distro kernel, or some off piece of hardware that is ubiquitous in your setup, and these crashes become more frequent and serious.

(Recent example of a distro kernel bug - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1674838 . I foolishly upgraded to Ubuntu 17.04 on its release in stead of letting it get banged around for a few weeks. For the next five weeks it crashed my desktop about once a day, until a fix was rolled out in Ubuntu proposed)

Most companies I've worked at want to have some official support channels, so usually we'd be running RHEL, and if I was seeing the same crash more than once I'd probably send the crash to Red Hat, and if the crash pointed to the system, then the server maker (HP, Dell...) or hardware driver maker (QLogic, Avago/Broadcom).

Solaris crash dumps worked really well though - they worked smoothly for years before kdump was merged into the Linux kernel. It is one of those cases where you benefited from the hardware and software both being made by the same company.

Crash dumps don't matter as much if your distributed architecture has to account for hardware failures. (Or VM failures, or network hiccups, etc.)

Kernel developers still have to use crash dumps to root-cause an individual crash, but crash dumps are most useful for extremely hard-to-reproduce crashes that are rare (but if you are using the "Pet" model as opposed to the "Cattle" model, even a single failure of a critical DB instances can't be tolerated). For crashes that are easy to trigger, crash dumps are useful, but they are much less critical to figure out what's going on. If your distributed architecture can tolerate rare crashes, then you might not even consider worth the support contract cost to root cause and fix every last kernel crash.

Yes, it's ugly. But if you are administrating a very large number of systems, this can be a very useful way of looking at the world.

> As the order of magnitude of systems administered increases, rare changes to occasional changes to frequent. Especially when it is not running in a VM.

I think the perf engineer for Netflix is quite aware of this.

While I really respect Brendan's opinion (I've got most of his books and he is one of my IT heroes) I do think he is very netflix-IT-scale minded. When your Netflix you can maintain your own kernel with ZFS, DTrace, etc. and have a good QA setup for your own kernel / userland. Basically maintain your own distro. However when your in a more "enterprisy" environment you don't have the luxury of making Ubuntu with ZoL stable yourself. I know from first hand experience that ZoL is definitely not as stable as FreeBSD ZFS or Solaris ZFS.

And there it is - the elephant in the room noone mentioned. People in 99% of the IT shops get an existential crisis if you mention during the interview that you want to do kernel engineering. Thank you!

See http://www.omniosce.org/ for most recent community release

Article from 4 months back,

> OmniTI will be suspending active development of OmniOS


Since OmniTI pulled back from the project - there has now been a reboot by the community. Regular releases started in early July with regular updates following with support for USB, ISO and PXE versions. The most recent release was August-28-2017. See http://www.omniosce.org/ for more info.

Image packaging system, no preinstall, postinstall, preremove or postremove scripting, and a mishmash of Python and C, with that horrible automated installer thrown in for good measure, are you kidding? No way. SmartOS or bust.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact