BSD doesn't have drivers for Infiniband and other HPC interconnects. Nor does it have client drivers (let alone server implementation) for Lustre which is the distributed filesystem used by most super computers.
I imagine MPI support on BSD is also likely non-existent.
Then there is the matter of accelerator support, i.e NVidia GPUs and Intel Xeon Phi.
It's not to say that some vendor couldn't reasonably build a BSD based supercomputer, it's just highly unlikely given how much stuff is missing.
This page mentions that Mellanox has provided work on this. Also storage vendors uses external infiniband stacks on FreeBSD for many years (I got my first Isilon cluster something like 8 years ago, and it was using an IB backend, with a forked FreeBSD 7 kernel iirc) and are stable and widely deployed.
I didn't realise Isilon used BSD (was aware it was IB based) or that IB drivers work well on BSD, that is cool.
That said, the primary platform is Linux and HPC is a very demanding workload. Unless I had a lot of time to invest in BSD kernel development I would stick with Linux.
That is putting aside Lustre too which is usually a non-negotiable requirement for HPC.
This is the correct answer (especially Infiniband - and Aries on Crays)
Also NUMA is very important on supercomputers, and it works well on Linux.
The other thing worth noting is the much better support IBM has for Linux on PowerPC (2 in the top 10). I think Sunway (most powerful in the world) is a Linux shop too.
Is it egg and chicken matter? Vendors don't write driver for BSD and BSDs lack of users because lacks of drivers. Honestly I hope I can run an OpenBSD and install whatever driver for my plugged in devices, both for my personal and production servers.
Vendors do write drivers for bsd, they just don't generally give them back to the project. Agree or disagree, they generally have a ton of time and money invested in their drivers and don't want to give them away to competitors.
its very much chicken and egg. Cray used Linux because all the customers were using linux. There was never a technical meeting discussing their relative merits. The Tera MTA project was actually BSD based, because it was from an age where the BSD project had clear technical superiority (and they were probably worried about complying with the GPL)
As others have mentioned there was a Mellanox stack in Free circa 2005 that I worked with. It was used at Isilon (BSD based) in production.
There really isn't a technical discussion here at all, when an overwhelmingly large part of your userbase uses X, it would be pretty stupid to only support Y, and probably not defensible to support X and Y
OpenBSD's performance is atrocious. Scrolling the browser laggy and unresponsive on my ThinkPad x220, and closing tabs sometimes results in multiple-second freezes.
That's maybe good for a router but simply not HPC material.
Nobody wants to use OpenBSD HPC. Everybody wants to use Dragonfly BSD HPC, if not many drivers were missing. Dragonfly MPI outperforms Linux, Linux just has more HW support and a bit better TCP/IP stack.
Okay, if that is the case, illumos and therefore SmartOS has long had stable Infiniband and MPI support and coming from Solaris is famous for his excellent scalability on very large number of processors, as well as long tradition of HPC. Why isn’t it used for HPC then?
Sun (you might add Oracle as well, but I think at that point whatever they could have done was too little too late) mismanagement, and Linux was/is better in many respects? It wasn't called "Slowlaris" for nothing?
And it's not like Linux is somehow famous for poor scalability, unless you're talking about the 1990'ies. Yes, back in the 1990'ies it was certainly much worse than Solaris. But for the 2.6 and subsequent releases SGI and others put a lot of work into improving it. SGI at some point sold 4096-way (might even have been 4096 cores and 8192 hw threads?) single-image supercomputers running Linux, which AFAIK is bigger than anything Solaris has been deployed on.
That being said, most HPC systems consist of 1 or 2-socket nodes connected via a network, so the kernel scaling to such extreme systems isn't that relevant in the vast majority of deployments.
“Slowlaris” days were 15 years ago with Solaris 8. Meanwhile, Solaris and illumos (and therefore SmartOS) are the only operating systems I know of which provide CPU bursting. If you go put Linux and SmartOS on the same intel CPU based hardware, SmartOS is likely to beat it in performance. What might have been 15 years ago has long since (2005 with Solaris 10) not been the case.
Sparc machines were not as good at number crunching as Power so Sun wasn't as well-represented in the list as IBM, and Solaris wasn't as heavily used as AIX.
I’m specifically referring to running HPC on SmartOS, which runs only on intel and AMD, with full support for intel only. My question is why isn’t it used for HPC now since it provides CPU bursting, not why it wasn’t used in the past. Fair disclosure: I grew up on SGI and HPC, I know what was before.
Because of one man: Donald Becker. At the beginning of the commodity super computer era Donald did an absolutely amazing job squeezing out every last bit of performance from commodity networking hardware for 'Beowulf' style clusters.
This gave Linux a head start and the self-reinforcing effects of such a head start did the rest, it made answering the question 'for which OS should we start writing drivers?' for specialty HPC hardware a no-brainer.
Scientific high energy physicist here with regional HPC center on the same floor. My observation is that administrators tend to enterprise distributions such as Scientific linux, Suse linux enterprise server (SLES), together with commercial MPI implementations such as IBM MPI and Intel MPI.
On the other hand, people are used to Linux, in my environment literally everybody has Ubuntu on their notebook and workstation. They know how to run their python analysis scripts there and the only thing they have to change when going to the cluster is the adoption of an environment managament system (such as http://modules.sourceforge.net/).
(However, I have to admit I never got in touch with BSD and don't know the differences in user space)
One thing very much worth following as a replacement or supplement to modules is Singularity. It is relatively easy to create images that contain all of your required libraries, and you can run these same containers on both your cluster and on your development system.
This can substantially reduce the time to deploy new software and cut down on overhead related to managing multiple modules.
Singularity, unlike Docker, is designed to require only minimal privilege escalation and as such it's an easy sell to HPC admins, who can (at least somewhat) get out of the business of helping users figure out what the heck is weird about their environment when trying to get something running on a cluster for the first time. You can also take these containers with you and be reasonable certain they'll work on another system.
Modules looks really interesting. Makes me wonder why Continuum is out there trying to reinvent the wheel with Anaconda. Glad to have something I can use at work to replace Conda environments. Now all it needs is Powershell/CMD support so I don't have to use it inside Cygwin...
Modules are really about environments (including software management). Anaconda doesn't handle this. For example, Conda its version of HDF5 and points to its environment path. Let's say you want to be using a different version of HDF5. An easy way to do this is just use a module so that you load this. You are creating an easy way for the user to set up their environment, where they really don't have to know anything about it.
It also helps with versioning. It is not uncommon to see various versions of gcc and intel compilers. In essence the user should be able to load their environment with a few module loads.
A comment I found while looking at linux OS that are run on super computers.
"Originally, the top 500 list was populated entirely by proprietary Unix systems from vendors like Cray research, SGI, etc.
In June 1998, the first Linux system entered the top 500 list. By June 2003, Linux systems passed the 25% mark, accounting for 139 of the top 500. By November of 2003, Linux systems comprised over 56% of the top 500. By November 2006, Linux made up more than 75% of the top 500. You get the idea. Over the years, there were a few attempts by microsoft to get into supercomputing, and there were BSD and Mac systems."
Since time is sold on these supercomputers they probably want to run all the same/similar OS so they can compete selling time on them. Also if one person has success everyone else will copy them.
Slashdot has a ton of comments discussing bsd vs linux on this subject matter, but I didn't see anything to helpful.
My only thought is large companies like netflix use bsd more for CDN because from what I been told bsd has the best I/O handling. Why they don't use it for the rest of there infrastructure? Maybe linux is better at crunching numbers and bsd is better for network and security? No idea thats my best guess.
Netflix uses BSD for OpenConnect because asynchronous disk I/O-which is critical for a CDN-remains a tire fire on Linux after more than 20 years.
On Linux you basically have to use blocking threads to emulate async disk I/O, which means tons of threads and overhead when you’re handling 10k-100k concurrent connections per box.
This is incorrect. Linux has had proper direct async disk I/O for a decade or more, used ubiquitously in database engines (among other things). It is not emulated with threads.
Last I looked (~4.4) linux AIO implied DIO. Conflating aio and dio is the problem, not a feature. On FreeBSD AIO works with the page cache for read and write, read ahead works, sendfile works, io & cache & readahead & size hints all work. Linux has half of those, and DIO none. As i recall.
Bingo. Async disk IO on Linux has to be unbuffered and block aligned. Making it useful only for databases that manage their own caching, and useless for file systems.
A video CDN needs lots of concurrent access to a file system.
“Work has been in progress for some time on a
kernel state-machine-based implementation of asynchronous I/O (see
io_submit(2), io_setup(2), io_cancel(2), io_destroy(2),
io_getevents(2)), but this implementation hasn't yet matured to the
point where the POSIX AIO implementation can be completely
reimplemented using the kernel system calls.”
Not true. I've regularly demonstrated very high throughput/connectivity with lots of little connections. The problem I have seen (not only with linux) has been over-aggressive congestion controls, usually configured/set wrong.
On high performance async IO, this works quite well in Linux, and there are no blocking threads that I am aware of in that stack. The kernel uses bio dispatches to perform the actual block io. If you are complaining about using bio to perform the actual IO, and that linux includes this in its load calculation, sure, that is a conscious decision as I understand it on the part of the block layer folks. Is it wrong or bad? I don't think so, though others have different opinions.
FWIW ... I work at a place now using SmartOS as its primary OS. There are many people I know preferring BSD. Many people preferring linux. I have a different view, one that is not as popular as I hope it would be.
Specifically, I look at operating systems now, largely, as an implementation detail for your stack. You have a mission in many cases, unless you are an OS developer, that consumes the OS services layers to help you perform your mission. In many cases, specifics of the OS don't matter, as long as they don't get in your way. Sometimes the specifics of the OS help you.
From my view as an HPC guy, a hardware guy, a storage/compute/ML/GPU guy, I generally can work in Linux and BSD without pain. Minor config difference, but I am comfortable in both.
I am not, and have not been comfortable in AIX, HP/UX, and UnicOS. I used to enjoy IRIX until I started playing with Linux. I used Solaris and SunOS in the past, and SmartOS/illumos today.
As long as the OS has the tools I need, the libraries I need, or a way for me to build them, and doesn't constrain me or force me to contort to vagaries of the OS itself, I am fine with it.
A problem arises when people get caught up in "my OS > your OS", which, this overall question at least brings in under the covers. This usually comes around from various esoteric aspects of little relevance for the vast unwashed masses of users (like me). On the OS dev side, when this happens, it is usually defensive because something needed is missing, or some OS dev/manager (mis)believes that users don't actually need the features they are requesting.
That is actually a major problem, and it tends to drive people from your platform. Users aren't dumb, and there are many sophisticated people who have a deeper appreciation for the issues, than "my OS > your OS".
Why *BSD isn't used might be for historical reasons, momentum, etc. It is perfectly fine as an OS, and quite usable for HPC. Similar for illumos/SmartOS (not simply saying that as I work for a company using SmartOS). There are missing things in both of these, and I am working (on the side) to try to help SmartOS get some of these things (user space stuff). FreeBSD in particular has most of what is needed.
Basically pick the system that works for you and your users. The OS, as I noted, can be viewed as a detail of the implementation. Or not.
But its not a reason to create friction/tension between groups claiming OS1 > OS2 ...
I think the issues tatersolid has with linux aio is implicit dio. Thats really painful if youre working with hdd or high concurrent read scenarios. See my sibling comment for why.
That leads to people implementing “async io” threadpools in userland. Those threads then do “regular” blocking io which is able to use the page cache etc. having hundreds or thousands of blocking IO threads then causes lots of other perf/scheduling issues.
It’s not just “my OS > your OS”: SmartOS is bulletproof when it comes to correctness of operation, data integrity and superior ease of system administration, which most prominently manifests itself in less breakage, non-existant problems caused on Linux by techology concepts from the ‘80’s of the past century and nights slept through instead of being in conference calls with clueless managers screaming at one at 01:13 in the morning. These were all issues I have and have had with Linux which I don’t have with SmartOS. That’s a big difference!
An OS is a priori better if I get to sleep through the night without an incident.
FreeBSD (and perhaps some other BSDs) support “top-end” Ethernet as well. There was a great post on the Netflix blog a couple of months ago (discussed on HN) about how Netflix optimized their systems to serve video at 100Gbps.
Most modern HPC clusters use Infiniband and the more exotic Ethernet types - having done courses on classic structured Ethernet setups seeing some of the challenges building HPC clusters are fascinating.
But all the supercomputers use their own custom linux. So no commerical backing. Also these computers are not your standard data center. They cut networking and storage to a minimal because those are bottlenecks. These things are just massive ram/cpu/gpu boxes connected properly through pci.
Edit: I was looking at Sunway hardware specs the number one supercomputer they use a PCI-E 3.0 connection for all there nodes. Communication between the nodes is 12GB/second with a latency of 1 us. Their total ram is 1.31 PB
But all the supercomputers use their own custom linux. So no commerical backing.
This is just wrong. Yes, they use custom Linux, but it is highly highly supported. You buy a Cray or a BlueGene and you get dedicated kernel engineers as well as on site support etc etc.
They cut networking and storage to a minimal because those are bottlenecks. These things are just massive ram/cpu/gpu boxes connected properly through pci.
This is just wrong. Networking is extremely important in supercomputers - but it isn't like setting up a LAN. They use custom networking, Infiniband, Aries, OmniPath etc. There isn't much information about the "PCIe Network" on the Sunway, but the fact it is PCIe isn't very interesting - everyone has fast optical networking. It's the topology and protocol which makes things interesting.
I don't consider it commercial linux because they are not competing with other options. The companies that do build these supercomputers have to provide technical support because nothing out there exist for it. Just a different view of what commercial linux is vs building hardware specific software.
Its very much commercial linux, because you are paying for a service, that's linux based.
Sure with how cheap inifiband is (especially compared to 40/100 gig ethernet) one _could_ cobble together a system your self.
Where the magic sauce comes in, and where the like of cray really make things shine is the software they provide to allow end users _easily_ do multi-machine scaling.
libraries for just in time delivery of data directly into ram? yup. location aware job dispatchers that co-locate jobs near each other logically? yup.
Redhat is a commercial linux because they are competing with other os/distro in this market. If I pay Joe $5 a month to keep my ubuntu up to date it doesn't make ubuntu a commercial linux even though I am paying for a linux service. These companies building supercomputers are competing in producing supercomputers. Not in providing a linux disto and providing a service for said linux. I very much doubt I could get access to their linux disto and linux service without first purchasing a supercomputer from them.
This is pretty much exactly how every HPC OS has been sold since the Cray X-MP. It's like if you buy Isilon - it is a software, hardware and support you buy. No one argues that isn't commercial.
The Quora discussion doesn't seem to add much. Just a guy saying 'FreeBSD is rock-solid' and '[for] raw performance, .. nothing beats FreeBSD', without giving any technical details.
I think, largely, the same reasons apply to Linux vs. BSD in supercomputers as Linux vs. BSD generally. You might as well ask why Linux and not *BSD is used in Android, on servers generally, or by large technical knowledgeable organizations such as Google, Amazon, Facebook, etc.
So, in no particular order:
- Linux came on the scene when BSD's were mired in legal uncertainty. After the legal issues were settled, Linux had already become the default choice for someone wanting a FOSS Unix-style kernel, and the BSD's never caught up.
- The GPL license meant that improvements were shared rather than squirreled away in various proprietary spin-offs and thus lost when whatever company was behind them folded (generally, exceptions going both ways surely exist!).
- Due to Linux gaining the initial momentum, developers flocked (and keep flocking!) to it, leaving the BSD's ever further behind.
- Linux was more welcoming to new contributors, whereas the BSD's were controlled by a small circle of core developers sitting on the commit access. And of course, the BSD way of solving disagreements was forking the entire thing, further splitting up the already small developer base.
> The GPL license meant that improvements were shared rather than squirreled away in various proprietary spin-offs and thus lost when whatever company was behind them folded (generally, exceptions going both ways surely exist!).
I've always debated this. You would think that BSD licenses would be more attractive to corporations like google, amazon, facebook, etc and GPL licenses were more attractive to researchers and one would have thought that the BSD systems ( freebsd, netbsd, openbsd, etc ) would be the dominant unix-style OSes. Instead the GPL linux based OSes became dominant.
The question is about supercomputers specifically, which are mostly used by researchers and some applications like weather forecasting, aerodynamic simulations, etc. The infrastructures used by Google, Facebook, Amazon are massive clusters of computers, but they are not supercomputers.
In the research space peer-review and reproducible results are critical. So GPL does fit in well. The makers of supercomputers have to accommodate their clients' requirements.
Some will say better hardware support.
While Linux has better hardware support, i usually find this to be in the more exotic direction.
I think it's simply down to Linux being where the money is.
The big players (IBM, Dell, etc) are all actively promoting Linux, and trained personel is also somewhat easy to find.
So Linux is "the beast you know".
As for FreeBSD, it might be a technically better platform, but it is living in Linux' shadow.
Personally i run FreeBSD for the excellent documentation, stability, features like ZFS, but nothing i run couldn't just as easily run on Linux.
And hardware support does not only mean that it "somehow" works, but that the hardware vendor and the operating system vendor have certified this combination and will support you with all your problems for many many years.
Edit: support as in commercial support with SLA etc
You have the choice of the FUSE version that's legally free and clear but has obvious FUSE related performance limitations, or the kernel version which has great performance but is questionable at best from a legal standpoint because the CDDL is not compatible with the GPL.
Canonical has decided they're willing to take the risk by bundling it in Ubuntu and so far it hasn't backfired on them, but there's good reason to believe that Oracle's lawyers may have something to say about it if they ever feel that ZFS-on-Linux is threatening any of their products.
Because BSD's SMP support has traditionally been pretty terrible compared to Linux's. They still have a SLAB memory allocator (compared with Linux's default of SLUB which is much better for heavily SMP systems).
Many of the vendors for HPC (I'm looking at you Mellanox) primarily develop and certify their products on Linux. While they might work on BSD ok, you're not going to get the full performance and all of the features on a BSD system. If you paid for Mellanox EDR 100G Infiniband switches and all of the fancy VPI network cards, you want to use them to the fullest performance capable. The vendor tells you to use Linux for that, you use Linux.
TL;DNR: Linux is what the hardware manufacturers overwhelmingly target and work with. HPC users use what vendors support best.
Your final line is 100% correct but all your supporting details are not.
HPC is generally a "softball" workload because the code is going to be more sympathetic to the hardware than many other computer usages. Processes will batch allocate a lot of RAM and peg runnable state for a long time.
SMP.. "it depends", again a parallel vector matrix multiply is just going to sit in the runnable state on all the cores and the kernel is pretty irrelevant. There is a lot of junior job stuff left in FreeBSD to move locks around. The VFS is quite bad. In an HPC type workload these things probably wont matter that much unless you see a lot of "system %". They will show up in profiles and are generally also easy to fix. But it's not hard to construct a microbenchmark showing Linux > $else in those areas.
SLUB.. no. What kind of HPC workload is going to care much about this? The Linux allocators are pretty awful at contig kernel memory allocation (see ZFS on Linux). I don't see why UMA would architecturally flop here.
NUMA is a sore point on FreeBSD. It should be usable in 12.0. Isilon and Netflix are paying Jeff Roberson to work on it. Some folks on my team are also doing minor NUMA and locking work, but for commercial CDN workloads.
Mellanox does a pretty stellar job on FreeBSD Ethernet and Infiniband support. Unfair dig at them. I generally prefer Chelsio, but Mellanox has lowest latency which is relevant for HPC.
Awesome response, thanks for taking the time to write it.
SLUB was written by Christoph Lameter when he was at Silicon Graphics for their monster Altix machines. It took Linux hours to boot (with SLAB) on that machine. He wrote SLUB in a fit of brilliance to make Linux suck less on these, of which HPC workloads can most certainly be ran. Just like some of the crazy Cray computers, SGI machines used to own HPC. Note that I work with Christoph in the same office and have discussed this with him in person. Regarding contiguous memory allocation, a lot of serious HPC workloads use huge pages set at boot to defeat this, so that part of Linux's fail is a non-issue (You're entirely right btw). Really awesome to hear about NUMA bits in FreeBSD being improved, and I sufficiently feel hit with a cluebat on it.
The bit from Mellanox was from their engineers (in their Haifa, Israel office before lunch) telling me they build their products for Linux first, and then port to everything else. They care deeply that it works on Linux, and it is nice if it works on other systems but not as important. It wasn't a dig at them, it was what the engineer said to me.
They actually are, as is VTune and some other commercial stuff from intel and the open source libraries like ISA-L and IPP and frameworks like DPDK, SPDK and NV-DIMM stuff.. all work on FreeBSD.
Last I heard from my rep, intel was discontinuing icc altogether because it didn't make a lot of sense to not put the optimizations in the compilers most people use.. gcc, llvm, vcpp.
I'd assumed Intel kept their compilers as a competitive advantage even if they weren't profitable by themselves. Could certainly see it happening though.
This!
I’m a HPC sysadmin, and I used FreeBSD for all infrastructure services - DNS, DHCP, PF, ZFS based backup server et al
And strictly CentOS with tightly controlled installations of intel MKL libraries and its ecosystem.
Because the people who use supercomputers just want to crunch numbers - the operating system is a distraction at best, and Linux is the path of least resistance.
I would guess it is because linux has a wider hardware support than bsds. As you're building a supercomputer it makes sense to have the faster hardware, implying they are new technology.
BSD doesn't have drivers for Infiniband and other HPC interconnects. Nor does it have client drivers (let alone server implementation) for Lustre which is the distributed filesystem used by most super computers.
I imagine MPI support on BSD is also likely non-existent. Then there is the matter of accelerator support, i.e NVidia GPUs and Intel Xeon Phi.
It's not to say that some vendor couldn't reasonably build a BSD based supercomputer, it's just highly unlikely given how much stuff is missing.