
497.1-day uptime bug - mrb
http://www.ibm.com/developerworks/mydeveloperworks/blogs/anthonyv/entry/497_the_number_of_the_it_beast2
======
yatsyk
I'm not sure about details but as far as I remember Windows CE uses a
brilliant approach to fix this bug. System tick count is set to value equal
three minutes before overflow. And system counter overflows three minutes
after OS starting. 3 minutes usually enough to load all applications that
could be buggy but this time is less then usual debug session.

edit: found details:

<http://msdn.microsoft.com/en-us/library/ms885645.aspx>

 _For Debug configurations, 180 seconds is subtracted to check for overflow
conditions in code that relies on GetTickCount. If this code started within 3
minutes of the device booting, it will experience an overflow condition if it
runs for a certain amount of time._

~~~
bdonlan
Linux does this too, but unconditionally:
[http://lxr.linux.no/linux+v3.1.1/include/linux/jiffies.h#L16...](http://lxr.linux.no/linux+v3.1.1/include/linux/jiffies.h#L163)

Note that this particular timer is not directly exposed to userspace, however.

------
kabdib
Win95 had a famous timer wrap at 49.7 days. Ouch.

All timers should either be really tiny (whereupon they are good subjects for
test cases) or really huge and not subject to possible rollover (64 bits of
nanoseconds is 580 years, and should serve for an interval counter).

128 bits of nanoseconds is 10E22 years, and should serve to drive calendar
time, unless you're doing cosmology.

~~~
mappu

        Win95 had a famous timer wrap at 49.7 days. Ouch.
    

The API call in question is called GetTickCount[1], and it's still really
popular - especially for doing quick things like comparing for timeouts, and
so forth. It returns milliseconds into a 32-bit int.

There's a replacement named, funnily enough, GetTickCount64, but iirc it's
only present on Vista and newer, so it hasn't found its way into a lot of
software yet. The Windows Performance counters probably provide better metrics
for people actually interested in this data.

_______________________

1\. [http://msdn.microsoft.com/en-
us/library/windows/desktop/ms72...](http://msdn.microsoft.com/en-
us/library/windows/desktop/ms724408\(v=vs.85\).aspx)

~~~
marshray
I recently had to implement a version of GetTickCount64 for older platforms
that only support GetTickCount(32). It works great as long as you remember to
call it at least every 49.7 days. :-)

(Luckily the process already had a thread which wakes up to perform such
maintenance every hour or so.)

------
blinkingled
Speaking of uptimes - where I work, in our data center Cisco switches and load
balancers for some reason always become flaky after 300+ days of uptime -
weird resets, close_waits and other such things. Older Sun OS releases (8 and
below) also get flaky after 200+ days uptime (we have had apps doing zero size
reads on log files in a loop, processes hanging on startup etc.). Linux boxes
have so many updates that they hardly cross 60+ days uptime. The only shining
stars of rock solid uptime are all heavily loaded HP-UX 11i DB Server boxes -
1000+ days uptime and literally work like they were freshly booted!

------
maratd
As a rule of thumb, I have all of my equipment and servers reboot every 30
days. You never know what sort of cruft you'll run into if you run your box
long enough.

~~~
viraptor
I never got this approach for three reasons. The first one is sure, you could
say that some "cruft accumulates". But by rebooting you're guaranteeing that
if something goes only slightly wrong you may not notice it and every month
you'll start with a clean system that does not show the issue anymore. So
you're choosing potentially ignoring tiny issues instead of letting them crash
the system in a visible way and fixing them properly forever.

The second is that there shouldn't be any "cruft". Servers are not running
win95 which reliably crashed given enough time to run. "Cruft" should be
fixable - if it isn't then you're running a system which cannot really be
supported.

The third one is that if you cannot say anything more specific than "cruft",
then your system is badly managed. Are you restarting because your app leaks
memory? Is it leaving zombie processes? Is it leaving dead connections to the
database? Or maybe something else entirely? Restarting can be a short-term
solution for some specific issue, but if it's there to remove "cruft" and "you
never know" what it is, then you might as well try arranging your server room
according to feng-shui or using voodoo healing to make your app run better.
Either you control your system, or you don't.

~~~
count
How do you know you can restore your system to a working state in the event of
an unscheduled outage, cruft or not?

You should discriminate between _services_ and systems - make your service
available 100% of the time, but you should be able to kill and
restart/reload/replace systems for maintenance or other reasons at nearly any
time. And you SHOULD do that, because without proof that you can do it, your
DR solution is simply a best guess.

~~~
mburns
By forcing Configuration Management Software (puppet, cfengine) so that one-
off fixes that get hot patched on the production server and never documented.

~~~
count
This goes a very long way to helping, yes, but by itself does not guarantee
anything. And not everything can be cfengined or puppeted.

------
mrb
Someone else reports experiences of switches rebooting after 497 days:

[http://storagemojo.com/2011/11/07/how-fault-tolerant-are-
san...](http://storagemojo.com/2011/11/07/how-fault-tolerant-are-
sans/#comment-220017)

------
brohee
Tick counters are not much of an issue. Packets or octets counter overflowing
are much more interesting, especially when they are somehow connected to
billing...

------
drodil
<http://news.cnet.com/2100-1040-222391.html>

