Hacker News new | comments | show | ask | jobs | submit login

I worked on Solaris for over a decade, and for a while it was usually a better choice than Linux, especially due to price/performance (which includes how many instances it takes to run a given workload). It was worth fighting for, and I fought hard. But Linux has now become technically better in just about every way. Out-of-box performance, tuned performance, observability tools, reliability (on patched LTS), scheduling, networking (including TCP feature support), driver support, application support, processor support, debuggers, syscall features, etc. Last I checked, ZFS worked better on Solaris than Linux, but it's an area where Linux has been catching up. I have little hope that Solaris will ever catch up to Linux, and I have even less hope for illumos: Linux now has around 1,000 monthly contributors, whereas illumos has about 15.

In addition to technology advantages, Linux has a community and workforce that's orders of magnitude larger, staff with invested skills (re-education is part of a TCO calculation), companies with invested infrastructure (rewriting automation scripts is also part of TCO), and also much better future employment prospects (a factor than can influence people wanting to work at your company on that OS). Even with my considerable and well-known Solaris expertise, the employment prospects with Solaris are bleak and getting worse every year. With my Linux skills, I can work at awesome companies like Netflix (which I highly recommend), Facebook, Google, SpaceX, etc.

Large technology-focused companies, like Netflix, Facebook, and Google, have the expertise and appetite to make a technology-based OS decision. We have dedicated teams for the OS and kernel with deep expertise. On Netflix's OS team, there are three staff who previously worked at Sun Microsystems and have more Solaris expertise than they do Linux expertise, and I believe you'll find similar people at Facebook and Google as well. And we are choosing Linux.

The choice of an OS includes many factors. If an OS came along that was better, we'd start with a thorough internal investigation, involving microbenchmarks (including an automated suite I wrote), macrobenchmarks (depending on the expected gains), and production testing using canaries. We'd be able to come up with a rough estimate of the cost savings based on price/performance. Most microservices we have run hot in user-level applications (think 99% user time), not the kernel, so it's difficult to find large gains from the OS or kernel. Gains are more likely to come from off-CPU activities, like task scheduling and TCP congestion, and indirect, like NUMA memory placement: all areas where Linux is leading. It would be very difficult to find a large gain by changing the kernel from Linux to something else. Just based on CPU cycles, the target that should have the most attention is Java, not the OS. But let's say that somehow we did find an OS with a significant enough gain: we'd then look at the cost to switch, including retraining staff, rewriting automation software, and how quickly we could find help to resolve issues as they came up. Linux is so widely used that there's a good chance someone else has found an issue, had it fixed in a certain version or documented a workaround.

What's left where Solaris/SmartOS/illumos is better? 1. There's more marketing of the features and people. Linux develops great technologies and has some highly skilled kernel engineers, but I haven't seen any serious effort to market these. Why does Linux need to? And 2. Enterprise support. Large enterprise companies where technology is not their focus (eg, a breakfast cereal company) and who want to outsource these decisions to companies like Oracle and IBM. Oracle still has Solaris enterprise support that I believe is very competitive compared to Linux offerings.

So you've chosen to deploy on Solaris or SmartOS? I don't know why you would, but this is also why I also wouldn't rush to criticize your choice: I don't know the process whereby you arrived at that decision, and for all I know it may be the best business decision for your set of requirements.

I'd suggest you give other tech companies the benefit of the doubt for times when you don't actually know why they have decided something. You never know, one day you might want to work at one.

It was Jeff Bonwick's team which proved that the number of engineers or even developers working on a given problem is completely irrelevant: ZFS was developed by a team of, what, five people? Meanwhile, how many people are working on BTRFS? It's nowhere near ZFS.

But, let's chalk that up to an isolated, one off statistical aberration. From what I understand Adam and Bryan wrote DTrace almost single handedly, with some help from Mike, and even with all the contributions, you can still count the people who made DTrace a working production tool on the fingers of your one hand.

However, let's chalk that up to a one-off, statistical aberration as well. Meanwhile, how many people are working on how many tracing frameworks for Linux?

Next, we have zones, a complete, working, production proven virtualization solution, augmented by KVM, lx, TRITON, Consul, et cetera. One coherent solution. Built upon technology on which I ran production Oracle databases on, way back in 2006, powering a very large institution which was making very large amounts of money. By the second. How many engineers did it take to design, architect, and code all that up?

Meanwhile, there are how many competing cloud virtualization solutions based on Linux? And remarkably, except for SmartOS, none are a complete, comprehensive solution: they all lack one thing or another. Not one of those Linux based solutions is paranoid about data integrity or correctness of operation. Those things are not even an afterthought of Linux.

Should I chalk that up to a one-off, statistical aberration, or would you say that there is a pattern here?

Amiga Intuition library, the foundation on which the GUI is built into the system, was written single-handedly by one just one person: RJ Mical. In a couple of days! For almost two decades, it was the reference on how to build a library of GUI primitives with almost unlimited flexibility.

Star Control 2, one of the greatest games in history, was developed by just two guys in the span of three years.

Dave Haynie almost single handedly developed not one, but entire series of Commodore computers, the C16, C116, C Plus/4 (Commodore 264). Those are the lessons not only of history, but of our contemporaries, people you used to work with: KVM was ported from the Linux kernel by what, three engineers, and form what I can tell, it runs faster on illumos than it does on Linux where it's developed! Why is that?

You and I apparently drew a completely different set of conclusions: when you wrote Linux now has around 1,000 monthly contributors, whereas illumos has about 15 you seem to equate the number of people working on a product with that product's capability and quality, whereas I drew the conclusion that the number of people is irrelevant, but what the individuals or individual can do makes all the difference in the world.

Where you are absolutely correct is that the job market for illumos based operating systems is non-existent, at least in the country where I live, and slim elsewhere (I used to work in Silicon Valley and in other parts of the States). That's a fact. But I wouldn't rush to the conclusion that it's because illumos or SmartOS are worse products, because I see no evidence of that. Furthermore, at the end of the day, people still need to run a cloud on something which actually works, and Linux is not it. It doesn't work correctly, when it works at all. Not even after 20 years, billions of dollars and a world wide army of people working on it. What is the alternative? SmartOS.

I read the Netflix tech blog from time to time. And over time, one thing became clear to me: Netflix can do the things it does because they have one single application to scale, but most of the world out there, in the trenches, has more than one application. You write of people with deep knowledge of the kernel and performance: I've been working in this industry for decades, and I've yet to meet anyone like that (they must all either be a secret society, or I'm just way too paranoid, but I do know a lot of IT professionals). So perhaps it's a living in an enclave problem, or perhaps both you and I work in enclaves, only different ones? I'm the only person I know in IT that has done or has any interest in kernel, system engineering or performance; I must either be incredibly bad at picking companies to work for, or people you mentioned are really few and far between, or a third possibility is that it's a fluke coincidence?

Let me tell you about my world: I work on and with Linux professionally. Where Netflix has only one major application (according to their tech blog) to worry about, I work at a place where we literally have several thousands of applications, some bought, some developed in-house; for just about every problem, we have an average of five applications, all different, but basically doing the same thing; and some of our applications are so exotic, so complex, and so custom, that it is impossible to find anyone on the market with any experience in them. Thousands.

So while you might be picturing this in your head, imagine running Linux, and suddenly your database keels over: Linux didn't fail over to the other path, so multipathing doesn't work right. Then imagine having systems with data corruption, but Linux can't fix it, because ZFS isn't supported by redhat which we run, so there goes that - another outage (we have regulators and governments to worry about, so the company is reluctant to start hacking their own custom kernel and a ZFS-based Linux). Next, Linux suddenly has an outage because the NFS mount is flapping. Why is it flapping? Because Linux's NFS implementation doesn't play well with NetApp. Now imagine stuff like this happening on a scale of 72,000 systems, spread across the planet. I never had such problems with Solaris. Not once.

But, since that's anecdotal evidence and experience, we have to discount that as well.

Then, I have hardware (from one of Oracle's competitors), very, very expensive, intel-based 80-CPU Xeon monsters, with .5 TB of memory per system, where the serial console hangs at random: redhat points the finger at the hardware manufacturer, hardware manufacturer points the finger at redhat. Result: console is still hanging at random, with both companies telling us they have no clue what the problem is. That's Linux for you.

Serial console always worked just fine on illumos. After all, it's basic functionality.

Then there's the issue of Linux not getting shutdown properly: you'd think that after 20 years of development and as you correctly noted, a world wide army of developers and billions of dollars in investments, the shutdown procedure wouldn't try to write to an already unmounted filesystem; it's basic functionality, after all; but even that is too much to expect, apparently (I can dig out the redhat bug if you're interested).

That last one, we cannot chalk up to a fluke, and even worse, sgi's XFS was the only one which actually detected that write and panicked the kernel - ext3 was oblivious to this data corruption. It's mighty difficult for me to engineer highly reliable services on such a substrate... but let's not dwell on that too much right now. It's too depressing.

Then there is tracing: you know there are several frameworks at play. Then there is also lack of proper DWARF2 support (I researched the subject, and found out that the "solution" was to replace my run time linker!) Can you imagine something like that being a solution on an illumos based system? I think everybody would commit collective suicide or quit altogether like Keith Wesolowski did before casually suggesting such a thing, but let's not dwell on that either. (At this point, I think it fair to sue for pardon if I don't want my operating system made by people who think nothing of casually replacing the run time linker only to get DWARF 2 debugging support. Do you agree?)

Then there's this issue of startup: while SMF has been humming along for more than a decade, Linux is still trying to figure out some sort of a complete working solution: currently that's systemd, and based on how it's architected, it looks like Windows and Linux are finally converging. Meanwhile, to make a startup which sort of reminds of the working SMF, systemd has several different configuration states for its services... and no fault management architecture to speak or write of.

One thing's for sure: your and my expirences are radically different. You shocked me to the core, but I also understand your thinking and motives for leaving illumos behind better, and it's the kind of appreciation I'm unable to put in words. You are also a much more flexible: after having seen just how convoluted, complex, slow, and resource wasting Java is, I would never go work at another company which used it (the place where I work now, Java is the language and the platform). I'd just quit the industry like Keith did.

In spite of all of this, if you let me know how to reach you, I'll provide you with enough information on how to get in touch: I'd still love to have you over if you're in the country, and cook you dinner.

You've just discounted quite a lot of what I said as "no evidence", and have made some incorrect assumptions about both development at Sun and Netflix. Along with your other comments, at this point it's clear you are bashing on Linux, Netflix, and me personally, and you still haven't revealed your real name.

I'd like to know what your real name is. If you really cannot post it here, then feel free to contact me at bgregg@netflix.com.

I am bashing on Linux, absolutely; that massive bleeding wound is very raw and painful. I have no reason to bash on Netflix; I merely pointed out that, in my view, Netflix's problem domain is very narrow, and a luxury: most IT departments don't have only one (however massive) application to worry about.

As for you personally, I have nothing but highest respect for you. You are one of the reasons why I still haven't quit this industry. In fact, I still cannot believe I've actually communicated with Brendan Gregg. To me personally, you're a living legend. If I believed in personal heroes, you'd be one of them.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact