I always wondered how people got to these uptimes. I have to reboot my linux box...

hyperknot · on Jan 22, 2016

You only need to restart when there is a kernel update, and the frequency of kernel updates depend heavily on the distro used. Debian stable, for example, although using ancient versions of the packages, is a great OS for such a use case, as kernel upgrades are really infrequent. Have a look at the changelog frequency of Squeeze [1] or Wheezy [2].

[1] http://metadata.ftp-master.debian.org/changelogs/main/l/linu... [2] http://metadata.ftp-master.debian.org/changelogs/main/l/linu...

aroch · on Jan 22, 2016

If you update a central library (e.g. openssl), you'll have to restart in order to deal with in-memory copies being used by other programs. If you're running a Debian server one of the packages to include in your base install is debian-goodies or needrestart because the former bundles a very helpful little script called "checkrestart" and the latter is an updated systemd-compatible version, both of which use `lsof` under the hood to determine when and why package updates require a restart for full effect.

bgray · on Jan 22, 2016

But do you? You really only need to restart the processes using those packages. Technically, a kernel update (specifically security update, bug fixes may not be important) would only require a reboot.

jerf · on Jan 22, 2016

Yes, you can restart all processes using SSL.

However, I've often been in situations where I reboot anyhow, because rebooting means I'm 100% confident the old code is gone, whereas if I try to get clever and avoid the restart, I'm significantly less confident. Depending on how hard it is to validate the security bug, that can be a problem.

Plus, for much of the past 20 years for many computers, if you're going to restart all services, adding in the last step for rebooting doesn't add all that significantly to the downtime. Server-class hardware often have things that make that not true (stupid RAID cards), but for everything else you were often only adding, say, 25% for the actual reboot.

ultramancool · on Jan 22, 2016

You don't need to be 100% confident the old code is gone - just 100% confident the old code is no longer exposed to the network - check your sockstat/netstat and call it a day.

hs86 · on Jan 22, 2016

It gets complicated when central libraries like glibc have to be updated. I did this once with checkrestart on Debian Wheezy and I had to restart nearly everything except for the init process. So in this case just restarting the system would have been faster and easier.

NoGravitas · on Jan 22, 2016

For lots of core stuff, you don't technically need to reboot, but you probably do need to go down to single user mode and come back up (consider upgrading glibc or openssl), and at that point you might as well reboot.

niutech · on Jan 25, 2016

Rebooting also closes unused sockets, closes opened descriptors, fixes memory leaks, cleans /tmp and performs fsck if needed. So it is good to reboot.

marcosdumay · on Jan 22, 2016

Well, the server is there to host some service. If you'd need to restart the service deamon, why not restart the machine for once, and make everything simpler?

Also, boot time bugs are a huge issue. They can creep during the entire time your system is up, and only show up during a reboot. Thus, if your server only has unplanned restarts, you'll only discover those bugs when you have yet another pressing issue to deal with, and also, likely at 3 in the morning on a Sunday.

So, make things better for you, and restart those servers once in a while, when things are quiet.

Xorlev · on Jan 22, 2016

KSplice helps you avoid the need to reboot even with many kernel changes. KSplice is the delta layer that gets you from security patch to maintenance window for a real reboot.

ultramancool · on Jan 22, 2016

Not only that but you only reboot when there's a kernel update that you care about. If it's not a security update or it's not a security update that affects you. I don't reboot for remote exploits in kernel services I don't use or single service VMs with local privilege escalation vulns.

Kesty · on Jan 22, 2016

Since a lot of updates will require for you to stop or the service anyhow, adding a reboot at the end before bringing everything back up it's not that bad of an idea.

Joeri · on Jan 22, 2016

I once had a debian stable desktop and home server reach two years of uptime using that strategy of upgrading everything except the kernel. Some upgrades, like a newer glibc, were quite tricky to accomplish without a reboot as you had to restart nearly every process. It was a fun game so i didn't mind the effort. Eventually a power outage wiped away my uptime.

ptman · on Jan 23, 2016

Isn't modern dbus something that you can't restart without rebooting the computer? I think...

e12e · on Jan 22, 2016

You only need to restart to install kernel updates that you need. While I'd normally just install all updates, most kernel bugs that I recall seeing the past few years are local exploits. If you're running few/no external services, you might not need to upgrade. And often bugs are in little-used subsystems/protocols -- often those will be off-by-default, or turned off by a diligent administrators (never run code you don't need).

It's rare to see a kernel-bug that can't be worked around in some way other than patching.

ewindisch · on Jan 22, 2016

> It's rare to see a kernel-bug that can't be worked around in some way other than patching.

While it's true that most critical bugs I've run into with the kernel can be workaround in some way, it's at the detriment to some use-case that some users, somewhere rely upon.

A vulnerability I discovered a bit of a year ago allowed local privilege escalation for any user with access to write to an XFS filesystem. The only workarounds were to modify their SELinux policies or switch to another filesystem. I'm not aware of any users that rewrote their SELinux policies for this. It had been fixed in the kernel and fairly quietly fixed in a RedHat security advisory, but I don't think any other distribution did anything at all.

I'd estimate that critical kernel bugs happen at least other month. Worse, discussion of these bugs happens in the public and takes dozens of months to fix. Again, assuming you even know of these vulnerabilities, while there are often workarounds, they're not always practical.

Did you know that via user namespaces, all non-root users on a machine can elevate to a root user? That root user is supposed to be limited, but it's allowed the mount syscall. Numerous vulnerabilities have been discovered as a result of this. The kernel team usually considers them low-impact and they get a low CVSS score, but when using certain applications this can lead to local privilege escalation. The workarounds are to disable user namespaces or disable mount for user namespaces, both of which will break some set of users.

Did you know that any user capable of creating a socket can load kernel modules? For a long time this allowed loading ANY kernel module! The only workarounds were to compile it out of the kernel or monkeypatch the kernel. Only last year was this was finally fixed upstream so only those modules patching a pattern were loading. Yet, it was also discovered that if you used busybox's modprobe, the filter would still allow non-root users to load arbitrary kernel modules from anywhere on the filesystem!

Point being, this is par for the course. Clearly one needs to understand their threat model, but if the model is at all worried about local privilege escalation, update weekly until you find another OS.

e12e · on Jan 23, 2016

> A vulnerability I discovered a bit of a year ago allowed local privilege escalation for any user with access to write to an XFS filesystem.

So, mount any existing xfs file systems read-only, and move rw systems to ext3? I'm not saying it would make sense - but sounds like a prime example of something for which there was a work around...

(I'll concede that for those that need(ed) xfs, there'd probably not be many alternatives at the time. Possibly JFS?)

Touche · on Jan 22, 2016

Yeah, I had a laptop I used as a home server that had an uptime of nearly 2 years when I finally decided to update the packages. Turns out the hard drive was hanging by a thread and the reboot was enough that it gave out permanently.

Symbiote · on Jan 22, 2016

Try smartd[1], which can be set to run a SMART self-test at regular intervals. Presumably the hard drive won't last as long, but you'll probably get a warning before it fails.

[1] https://www.smartmontools.org/browser/trunk/smartmontools/sm...

halviti · on Jan 22, 2016

While some might disagree, I definitely agree. Often there is no need to install updates at all on machines that only perform one or very few functions that have limited/no network connectivity. Things like HVAC and SCADA systems that only talk to hardware and not the internet, and are physically secured well.

I've seen many windows systems with uptimes of several years that have never required any maintenance.

anglebracket · on Jan 22, 2016

> Often there is no need to install updates at all on machines [...] like HVAC and SCADA systems

Which, incidentally, have been the target of a lot of recent high-profile attacks.[0][1][2][3]

[0] https://en.wikipedia.org/wiki/Duqu#Purpose

[1] https://en.wikipedia.org/wiki/Stuxnet#PLC_infection

[2] http://www.computerworld.com/article/2475789/cybercrime-hack...

[3] http://krebsonsecurity.com/2014/02/target-hackers-broke-in-v...

halviti · on Jan 22, 2016

I'm aware.

We do regular security audits from a security firm who goes the extra mile to try and social engineer and gain physical access to all of our sites.

Plus we're talking about things like processing fish in a town of 2,000 people. If I was operating a nuclear reactor, I would surely adapt better security measures.. although against government sponsored attacks using undocumented vulnerabilities, windows update isn't really going to do much.

The Target thing you posted has to do with internet access, which is something that goes against what I was saying. I'm talking about closed, physically secure networks, possibly not even using tcp/ip or ethernet.

Dylan16807 · on Jan 22, 2016

Your quote omits the critical "that only talk to hardware and not the internet". Your examples 3 and 4 are doing it wrong.

Stuxnet-like attacks can go after non-networked equipment, but they're based on exploiting the computer with the programming suite, not the industrial system itself.

anglebracket · on Jan 22, 2016

That's fair. My point was that in reality, a ton of people end up doing it wrong in some way or another. You should cover your bases and keep your systems up to date with security patches regardless of how segregated you believe they are.

krylon · on Jan 22, 2016

Under those circumstances, you can definitely get away without updating. But remember that updates do not only fix security issues, but also stability issues.

My gut feeling is that it is kind of like driving a car without wearing the seat belt. So far, if I had never worn a seat belt, nothing bad would have happened, because I did not have any accidents. But when it happens, one goes through the wind shield, so to speak. Also, some stability/performance issues do not manifest until a machine has been running continuously for months or years.

(What is more disturbing, though, that the very-high-uptime systems (~4 to 8 years) I have seen also appeared to never get backed up, and there didn't seem to be any plans for replacement, or at least spare parts. Which is kind of bad if the machine happens to be responsible for getting production data from your SCADA to your ERP system which in turn orders supplies based on that data.)

sagischwarz · on Jan 22, 2016

Weren't there a lot of reports in the last few years about how vulnerable SCADA systems are?

Spooky23 · on Jan 22, 2016

Totally. A lot of industrial/utility type places don't really have robust IT, and they treat computers like industrial equipment. So you may have a factory foreman or operating engineer who is responsible for equipment, who is 100% reliant on a vendor CE for implementing stuff.

What ends up happening is that they'll bolt on some network connectivity for convenience or to take on some new process and not set it up appropriately, or not understand what it means to expose something to the LAN or directly to the internet.

I helped a friend at a municipal utility with something like this when they wanted to provide telemetry to a city-wide operations center. They had a dedicated LAN/WAN for the SCADA stuff, and the only interface was in this case a web browser running over XWindows that had a dashboard and access to some reports. I think they later replaced it with a Windows RDS box with a similar configuration.

Because of the isolation, and professional IT who understood how to isolate the environment, it was advisable to to not be tinkering with updates, as the consequence of failure is risk to health & safety.

rincebrain · on Jan 22, 2016

Yes, frequently precisely because one of the two clauses asserted by the previous commenter (a lack of general network connectivity) has become false without changing other things about the workflow.

(I'm not advocating for HVAC/SCADA systems to be running, say, Windows XP Embedded with no updates and default passwords, world-facing, just observing that the preconditions changed.)

krylon · on Jan 22, 2016

Well, if one simply does not install updates, it gets rather easy, as long as the hardware does not act up.

Which is of course, a really bad idea for the general case.

Although it is actually kind of a requirement in some industrial environments, where certifications are involved - once the thing is certified, any change, hardware or software, requires a re-certification, which apparently is expensive and tedious. Which is how many industrial plants, too, end up running on ancient computers, at least by todays's standards.

sagischwarz · on Jan 22, 2016

An evaluation of the advantages of these kind of certifications compared to not having updates would be interesting (do they really add value, except moving around responsibility?).

bediger4000 · on Jan 22, 2016

That would be a highly interesting evaluation. I worked in the Aerospace/Defense industry in the 1980s, and it seemed to me that "we can't change X, X is 'flight certified'" was a huge excuse for not innovating, or maybe a huge roadblock to innovation. So it's big news in 2016, when Boeing is hinting about stopping 747 production, an aircraft that made its first flight in February of 1969, 46 years before. I'm guessing that "flight certification" is the largest factor in keeping airliner technology in the 1960s.

BostonEnginerd · on Jan 22, 2016

At the same time, we had a good understanding about aerodynamics in the 1960s and were producing more or less optimized designs. There are some additional optimizations that we've figured out like sharklets, but overall the design is similar -- at least until we trusted composites enough to use them in aircraft.

Where we have seen a lot of innovation is in the engines -- fuel economy and noise regulations have pushed GE, RR and P&W to up their game substantially.

https://www.youtube.com/watch?v=Or5YEhiT_d4&feature=context-...

bediger4000 · on Jan 22, 2016

The 747 is only one example. Martin Marietta made and launched Titan space launch vehicles from the early 60s to the early 90s, with only very slight changes and improvements. GD did much the same with the Atlas launch vehicle, and the Centaur upper stage. I will grant that NASA and Douglas/McDonnel Douglas made a lot out of the Thor IRBM, but that seems like a function of NASA Administrators having longer tenure than anything else.

antoinealb · on Jan 22, 2016

You can invert the question: What is the advantage of having updates in most industrial systems?

sagischwarz · on Jan 22, 2016

Well, even systems only connected to local networks or no network at all can still be the target of attacks, like infected flash drives, etc.

pjc50 · on Jan 22, 2016

If the system is suitably firewalled, it's OK to not update regularly.

If the system is sufficiently critical, it may be hard to update or migrate it. And if it's still safely working there's no real incentive to do so.

matthewmacleod · on Jan 22, 2016

*If the system is sufficiently critical, it may be hard to update or migrate it. And if it's still safely working there's no real incentive to do so.

In the very short term, perhaps. But I'd argue that critical systems are the ones most in need of the ability to be frequently updated and migrated. What's going to happen when a serious security problem demands an immediate change, and you're not prepared for it? Or the system catches fire, or floods?

SPOF critical systems are why so many organisations end up in legacy software hell.

jajern · on Jan 22, 2016

Oracle Linux can patch the kernel hot. I had some crazy uptimes but had to recently shutdown the hypervisor hosts for some facility updates.

Karunamon · on Jan 22, 2016

Only because they took over and then hid away what was an awesome project: KSplice.

http://www.ksplice.com/try/

Not my favorite company.

jajern · on Jan 22, 2016

Not mine either, but their sales pitches to the people above me are really good.

bcook · on Jan 22, 2016

Kernel updates probably need a restart (or use kexec), but most other updates only require a service to be restarted, right?

krylon · on Jan 22, 2016

Generally speaking, yes. Although, if there is an update to glibc, which tends to affect most processes, I tend to reboot the system, anyway.

(On Windows, a lot more updates require a reboot, though, because one cannot replace/delete a file that is opened.)

pjmlp · on Jan 22, 2016

> because one cannot replace/delete a file that is opened.

Which is a good thing and I bet a model in all other OSes, besides UNIX.

UNIX flock model is just flawed.

How many times I crashed something just because I rmed a file that was being used.

mschaef · on Jan 22, 2016

You'd be hard pressed to convince me that Windows model for locking files is superior to what Unix offers, at least as far as file deletion goes. Conceptually speaking, it's pretty simple:

* Files are blobs of storage on disk referenced by inode number. * Each file can have zero or more directory entries referencing the file. (Additional directory entries are created using hard links.) * Each file can have zero or more open file descriptors. * Each blob has a reference count, and disk space is able to be reclaimed when the reference count goes to zero.

Interestingly enough, this means that 'rm' doesn't technically remove a file - what it does is unlink a directory entry. The 'removal' of the file is just what happens when there are no more directory entries to the file and nothing has it open.

https://github.com/dspinellis/unix-history-repo/blob/Researc...

In addition to letting you delete files without worrying about closing all the accessing processes, this also lets you do some useful things to help manage file lifecycle. ie: I've used it before in a system where files had to be in an 'online' directory and another directory where they were queued up to be replicated to off-site storage. The system had a directory entry for each use of the file, which avoided the need to keep a bunch of copies around, and deferred the problem of reclaiming disk storage to the file system.

pjmlp · on Jan 22, 2016

> You'd be hard pressed to convince me that Windows model for locking files is superior to what Unix offers,

This model is not unique to Windows, rather most non POSIX OSes.

I happen to know a bit of UNIX (Xenix, DG/UX, HP-UX, Aix, Tru64, GNU/Linux, *BSD).

Yes, it is all flowers and puppies when a process deals directly with a file. Then the only thing to be sorry is the lost data.

Now replace the contents file being worked on or delete it, in the context of a multi-process application, that passes the name around via IPC.

Another nice one are data races by not making use of flock() and just open a file for writing, UNIX locking is cooperative.

mschaef · on Jan 22, 2016

> This model is not unique to Windows, rather most non POSIX OSes.

You could also point out that the hardlink/ref-count concept is not unique to POSIX and is present on Windows.

https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

> Now replace the contents file being worked on or delete it, in the context of a multi-process application, that passes the name around via IPC. ...

Sure... if you depend on passing filenames around, removing them is liable to cause problems. The system I mentioned before worked as well as it did for us, precisely because the filenames didn't matter that much. (We had enough design flexibility to design the system that way.)

That said, we did run into minor issues with the Windows approach to file deletion. For performance reasons, we mapped many of our larger files into memory. Unfortunately, because we were running on the JVM, we didn't have a safe way to unmap the memory mapped files when we were done with them. (You have to wait for the buffer to be GC'ed, which is, of course, a non-deterministic process.)

http://bugs.java.com/view_bug.do?bug_id=4724038

On Linux, this was fine because we didn't have to have the file unmapped to delete it. However, on our Windows workstations, this kept us from being able to reliably delete memory mapped files. This was mainly just a problem during development, so it wasn't worth finding a solution, but it was a bit frustrating.

joosters · on Jan 22, 2016

If you rm a file that someone else is reading, they can happily keep reading it for as long as they like. It's only when they close the file that the data becomes unavailable.

pjmlp · on Jan 22, 2016

I know how inodes work, thank you. My first UNIX was Xenix.

Applications do crash when they make assumptions about those files.

For example, when they are composed by multiple parts and give the filename for further processing via IPC, or have another instance generating a new file with the same name, instead of appending, because the old one is now gone.

Another easy way to corrupt files is just to access them, without making use of flock, given its cooperative nature.

I surely prefer the OSes that lock files properly, even if it means more work.

zeveb · on Jan 22, 2016

> For example, when they are composed by multiple parts and give the filename for further processing via IPC

Of course, the proper way to do this in POSIX is to pass the filehandle.

POSIX is actually a pretty cool standard; the sad thing is that one doesn't often see good examples of its capabilities being used to their full extent.

For this, I primarily blame C: it's so verbose and has such limited facilities for abstraction that it's often difficult to see the forest for the trees. Combine that with a generation of folks whose knowledge of C dates back to a college course or four, in which efficient POSIX usage may not have been a concern, and one finds good, easy-to-read examples of POSIX harder to find than they really should be.

pjmlp · on Jan 22, 2016

Unfortunately POSIX also shares the same implementation defined behaviour of C.

I had my share of headaches when porting across UNIX systems.

Also POSIX is now stagnated around support for CLI and daemon applications. There is hardly any updates for new hardware or server architectures.

Actually, having used C compilers that only knew K&R, I would say POSIX is the part that should have been part of ANSI C, but they didn't want a large runtime tied to the language.

krylon · on Jan 22, 2016

After working as a Windows admin for a few years, I feel that it is one of those things that sound like a great idea at first, but are causing more problems than they prevent.

But years as a Unix user might have made me biased. I am certain people can come up with lots of stories about how a file being opened prevented them from accidentally deleting it or something similar. I am not saying it is a complete misfeature, just a very two-edged sword.

bediger4000 · on Jan 22, 2016

It looks to me like the "can't replace/delete a file that is opened" is one of the factors that causes the malware phenomenon in Windows. That is, you must reboot to affect some software updates. Frequent reboots meant that boot sector viruses were possible.

That policy also means that replacing some critical Windows DLLs means a very special reboot, one that has to complete, otherwise the entire system is hosed.

pjmlp · on Jan 22, 2016

Boot sector virus exist since CP/M days in almost all systems.

Also we should remember the first worm was targeted at UNIX systems.

bediger4000 · on Jan 22, 2016

Sure, boot sector viruses existed since CP/M days - those systems were almost entirely floppy-disk based, and required a lot of reboots. Removable boot media + frequent reboots = fertile environment for boot sector viruses.

We should also remember that the 2nd worm was targeted at VMS systems (https://en.wikipedia.org/wiki/Father_Christmas_%28computer_w...) and appeared only a month or so after the RTM worm. Imitation is the sincerest form of flattery, no?

pjmlp · on Jan 23, 2016

Sure, I was just making the point that virus and friends never were a PC exclusive, as some tend to believe.

laumars · on Jan 24, 2016

To be honest that sounds like a pretty bad set up if you need to manually delete files when you know there's a chance that the system is not only operating on them, but also not stable enough to handle exceptions arising from accessing them.

But arguments about your system aside, you could mitigate user error by installing lsof[1]. eg

    $ lsof /bin/bash
    COMMAND   PID USER  FD   TYPE DEVICE SIZE/OFF NODE NAME
    startkde 1496  lau txt    REG   0,18   791304 6626 /usr/bin/bash
    bash     1769  lau txt    REG   0,18   791304 6626 /usr/bin/bash

[1] https://www.freebsd.org/cgi/man.cgi?query=lsof&sektion=8&man...

You might even be able to script it so you'll only delete a file if it's not in use. eg

    saferm() {
        lsof $1 || rm -v $1
    }

If you do come to rely on that then you'll probably want to do some testing against edge cases; just to be safe.

JustSomeNobody · on Jan 22, 2016

Sounds like user error not design flaw.

pjmlp · on Jan 22, 2016

How many UNIX applications make proper use of flock() ?

sagischwarz · on Jan 22, 2016

That's right. But updating everything except the kernel also seems strange.

krylon · on Jan 22, 2016

Well, on Linux distros it is not uncommon for individual packages to be updates as updates become available. So if there is an update to, say, the web server, it is sufficient to restart the web server.

On BSD systems, kernel and userland are developed in lock step, so it's usually a good idea to reboot to be sure they are in sync.

laumars · on Jan 22, 2016

Usually you'd be running the same userland daemons on FreeBSD that you might on Linux. Web servers, databases, OpenSSHd, file networking protocols (FTP, SMB, NFS, etc), and so on aren't generally tied to a particular kernel version since ABIs are not subject to frequent changes (it would be very bad if they were).

So with that in mind, you can update your userland on FreeBSD without updating your kernel in much the same way as you can do with Linux. Though it is recommended that you update the entire stack on either platform.

floatboth · on Jan 22, 2016

Updating the FreeBSD base (to the next major release) without updating the kernel is Not a Good Idea. Backward compatibility is there, down to version 4.x, but no one guarantees forward compatibility!

Applications will usually work, but the base system will likely break in some places.

laumars · on Jan 23, 2016

Yes, but 'freebsd-update' is akin to an 'apt-get dist-upgrade'. Regular updates from 'pkg' or ports should be fine.

drummer32 · on Jan 22, 2016

It's not at all strange. Security updates to publicly-facing services should be applied as fast as possible. Kernel vulnerabilities are a whole different attack surface

laumars · on Jan 22, 2016

Kernel vulnerabilities can be combined with user space vulnerabilities. eg a buggy web script might allow an attacker shell access under a restricted UID. The attacker could then use a kernel vulnerability to elevate their permissions to root.

dkns · on Jan 22, 2016

I managed over 90 days on my PC (ubuntu 14.04) at the office. I would go for more but there was power outage and all my bragging rights are gone.

sagischwarz · on Jan 22, 2016

Do you just omit the reboots after kernel updates? Because on my notebook with 14.04, I have to reboot quite often.

krylon · on Jan 22, 2016

I have a netbook running 12.04 sitting on my nightstand that I use for listening to Podcasts / audiobooks as I go to sleep. I usually wake up after a few hours, put the netbook to sleep, then turn around and go back to sleep.

This means I only update the netbook on rare occasions, because when I go to bed, I want to... sleep, you know, not update my netbook. So currently, that thing has about 180 days of uptime (although it spent most of that time sleeping, of course). I have been meaning to install updates and reboot it for months, but during the day, I forget about it, and only think of it as I go to bed... A vicious cycle... ;-)

With a kernel update, one doesn't have to reboot, technically, it is merely required for the update to take effect.

michaelcampbell · on Jan 22, 2016

On a box that does a few specific things, you don't need to install updates.

maaku · on Jan 22, 2016

Don't install updates.