

Sick time is logarithmic - micahalles
http://greatnotbig.com/2015/03/sick-time-logarithmic/

======
pash
There is probably something to this model. Notice first that the constants are
suggestive. There are 52 working weeks in the year, and a slightly more than
250 working days, and the author finds a good fit for

    
    
        sick_days = -52 * log(x) + 236 .
    

So the fitted curve can interpreted as saying,

    
    
        sick_days = weeks * sick_days_per_week(x) + healthy_days + noise ,
    

or somewhat more straightforwardly,

    
    
        days = healthy_days + sick_days(x) + noise .
    

Since _x_ is the rank (index) of an employee by the number of sick days taken,
we have that

    
    
        sick_days_per_week = log(employee_index) .
    

So it seems that we should guess that _sick_days_per_week_ is log-normally
distributed across employees. That's the same as saying that employees take
sick days at a rate

    
    
        r = mu + sigma * Z ,
    

where _mu_ and _sigma_ are the mean and standard deviation of
_sick_days_per_week_ and _Z_ is a standard normal.

This checks out intuitively, since this model comports with one in which the
number of employee-sick days in any timeframe is independent of the number
taken in previous periods. And the author says his sick-day policy is set up
to incentivize taking sick days only if you're actually sick, not because
you've accumulated them, etc. So the data look like what we would expect if
the policies are working as they should. (Notice also that this model
eliminates the need to control for employee tenure.)

So hopefully the author will run the data and see whether _Z_ passes
statistical tests for standard normality, then tell us what _mu_ and _sigma_
are. We might have a nice model for employee sick days and some solid
empirical results to go along with it!

~~~
contravariant
From his claim that

    
    
        sick_days = -52 ln(x) + 236 
    

It would be more straightforward to conclude that the number of sick days is
exponentially distributed.

To see this first note that the employee index "x" is equal to N(1 - p), where
N is the number of employees and p is proportion of employees with less sick
days. So if we invert the formula we can calculate the proportion of employees
with less than a certain number of sick days, which is basically the
cummulative probability distribution of the number of sick days.

Now let's try to invert the formula. First we simplify a bit:

    
    
        -52 ln(x) + 236 = -52 ln(N(1-p)) + 236 = -52 ln(1-p) + 236 - 52 ln(N)
    

Now if we fill in N = 86, we see that 52 ln(N) is 231.626 which looks
suspiciously like 236, so let's assume that they're the same for now. This
gives us:

    
    
         sick_days = -52 ln(1-p)
    

Or in other words

    
    
         p = 1 - exp(-52*sick_days)
    

Which is the cumulative probability distribution for an exponentially
distributed random variable with λ=1/52\. So it seems that the number of sick
days is approximately exponentially distributed with an average of about 1
week.

It's also satisfying to see that the constant "236" almost completely follows
from this assumption. There was still a minor difference, but that can be
removed by assuming that the number of sick_days is at least 4.37.

~~~
pash
Good catch that the additive constant is approximately equal to _52_ *
_ln(86)_. I like your model. It makes sense, it's simpler, and most
importantly it's different enough from mine that it gives us a decent way to
test which model reflects reality better: fit data drawn from different
numbers of employees' sick-day history. My model is independent of the number
of employees in the set and yours isn't, so that should tell us something.

For those of you following along, note that it's often rather difficult to
tell what the "real" model is, even assuming that there is one and that we've
somehow managed to discover it. Like many other pairs of distributions, the
exponential and log-normal distributions are quite similar, both in how they
fit data and in the intuition behind them. There are choices of parameters
that make an exponential and log-normal distribution look about the same, so
without good theoretical justifications for which parameters to choose,
there's little reason to prefer one over the other. Each can be thought of as
giving a time to failure (here, the time between sick days taken). The log-
normal has two parameters and the exponential has only one, so if neither
model is "correct", it's likely that we could find a choice of parameters that
makes the log-normal fit even if the exponential doesn't. The two
distributions differ most in the tails, but we would need gobs of data to see
how the tail probabilities work out.

If other employers with some statistical expertise want to weigh in with their
own employee datasets, please do!

Edit: By the way, I think this back-and-forth is a good example of the
benefits of dialectical reasoning (that is, a conversational process of trying
to hone in on the truth, with a real interlocutor to converse with). I came up
with my initial model only because the two constants seemed curiously
coincidental with the number of weeks and working days in the year;
contravariant's model throws one of those away, which is fine, but I likely
would not have proposed my model in the first place if only one constant
looked suggestive—a singular "52" does not really get the modeling juices
flowing. And I suspect thst contravariant would not have thought up his model
if he hadn't seen mine. As someone who mostly works alone, I wish I had more
of these sorts of dialogues about the problems I work on.

Edit 2: On re-reading my grandparent post, which I can no longer edit, I just
noticed that I made a very fundamental mistake that means my model is
certainly incorrect. To see my error, start with my third equation, which is
obviously correct (except that there can't be any noise). Then plug in
_sick_days_ as defined in the first equation, which comes directly from the
blog post. Now try to derive my second equation. ... contravariant did not
make this mistake.

~~~
sulam
you guys can go home now, you've just made HN worth the visit for me for the
rest of the month. Great discussion!

------
bane
Slightly off topic, the sickest I've ever been was working in a large open-
office with people who frequently traveled. I didn't travel as much, but I was
almost constantly unwell. It was also a highly stressful job. The combination
of dedicated workers, open floor plan and travel seemed to bring in just about
every kind of transmittable disease possible. The high stress seemed to
suppress my immune system. I think at one point I spent a solid 6-7 weeks up
and down with some kind of cold/virus/flu/infection.

My last job had me working from home most of the time. In two years I think I
took sick twice for a couple of days.

My current job has me in a shared office or in a shared team room (at my
discretion), very little travel, stress is far lower and in the few months
I've been here I've only taken sick once, for just a couple days...and I think
it was my wife who gave it to me (picked up in her open office full of high
stress frequent travelers).

------
calinet6
Absolutely fascinating (*pending seeing the chart of time normalized by
employee tenure, which would be more interesting all things considered).

In any case, this type of thinking shows one facet of the application of
statistics to people. There are so many natural states of variation we take
for granted. Where you might have thought that certain employees took sick
time more often than others due to differences of integrity or work ethic in
their individual control, in fact (if the assumption about this being an
expected rate of sickness for a population) it's a result of systemic effects.

Interesting question: what other facets of human performance and habits are
systemic, instead of individual? How are some other ways groups exhibit
predictable, natural distributions?

W. Edwards Deming once said, "I should estimate that in my experience most
troubles and most possibilities for improvement add up to the proportions
something like this: 94% belongs to the system (responsibility of management),
6% special."

By this, he meant that 94% (and he probably thought that number out quite
thoroughly) of potential to improve lies with the systems surrounding people,
not in their individual contribution and ability alone.

I find this true in any organization of sufficient size. If it tends to be
true for sick days, what else is it true for, and what would you gain by
thinking about more things in this way? Interesting to think about.

~~~
tajen
This is an alternative wording of the 6-rat experience, with one being the
leader, one the leader's right arm, 2 workers, 1 independent and 1 black
sheep. I definitively have the same feeling about my time in companies:
Someone has to speak up and get fired, someone else with the same strong
character but better communication becomes the leader, someone's hard work
finds a positive feedback loop in being the boss' preference, the others work
for money. The most costly for all parties is the black sheep: Not yielding
appreciable nor appreciated results, also a security risk (employment laws,
data breach, PR mistake...).

Is there any way in management to avoid transforming the worst person into a
black sheep? If he knows he's the worst, he'll be angry anyway and fail,
right?

~~~
calinet6
Reduce the individual focus and break the cycle. The work of the team as a
system is what matters, not each individual contribution. Individuals may even
be sub-optimized at a personal level in order to make a working system, and we
must recognize and even reward that type of contribution as well.

Look at Deming's management ideas, and look at Lean management concepts (the
modern version of Deming's system). Both provide good answers to these
organizational problems.

------
keithpeter
Just wondering how the sick days are distributed into spells.

In the UK, the Bradford Factor

B = S^2*D

Where D is total days sick in a given year and S is the number of separate
'spells' of sickness in that year (so two days off for a cold counts as just
one spell as does 12 days off after minor elective surgery) is used in some
large organisations to detect patterns of one or two days off.

~~~
noir_lord
Staples definitely used to use it here in the UK, not sure if they still do.

~~~
keithpeter
I teach basic maths to evening class students. The Bradford factor is alive
and kicking in a number of local employers both admin and manufacturing and
evokes a _lively_ discussion whenever I use it as an example of a formula. A
high Bradford factor tends to lead to a 'chat' with the manager in most
organisations as it points to a pattern of short spells of absence.

Most local employers disregard absence for serious illness or for things like
dentist, repeat medical appointments, childcare problems &c interestingly.

~~~
noir_lord
At staples it was purely used to track absences (when you have 20 part time
staff in a store just remembering who has been off and when over a few months
becomes difficult) most managers would just use it as a reminder to have a
word with staff as well.

If the person had a good reason then it would be ignored, for such a simple
approach it worked pretty well.

------
carlerickson
Per suggestions in the comments I've plotted a histogram of sick hours / year,
and run a simple comparison against Benford's Law.

The raw data is now available for download at the bottom of the original post:
[https://greatnotbig.com/2015/03/sick-time-
logarithmic/](https://greatnotbig.com/2015/03/sick-time-logarithmic/)

------
blahedo
His data appears to be Zipfian:

[https://en.wikipedia.org/wiki/Zipf%27s_law](https://en.wikipedia.org/wiki/Zipf%27s_law)

In this context, it's normal to graph with the x axis that is the "index in
sorted list" as the author has done, although somewhat more common to make the
y axis logarithmic.

~~~
christopheraden
Depends on whether you represent sick-time as continuous or discrete. Zipf's
Law takes support on the integers, whereas the Exponential takes support on
all positive reals.

------
quizotic
I looked at sick time a couple of years back and noticed that when a
frequently sick person stayed home sick, another frequently sick person
followed suit within a day or two. I wonder if anybody has a Pearson
correlation on that.

~~~
marcosdumay
If it's the flu, it does make groups sick, in sequence, with a couple of days
of difference.

------
QuantumRoar
You can sort all kinds of things that follow an underlying probability
distribution in order to find some kind of function to describe the behavior.
That's not something I'd consider new (or interesting).

What is more interesting, though, is what kind of probability distribution it
follows. From that, it is straightforward to figure out other representations
(as the sorted plot by the author). But it's kinda annoying to figure that out
from that plot in reverse since I'd need to calculate cumulative distribution
functions.

From what other's have posted the log-function fit means the probability
distribution is exponential (since the cumulative distribution function is
exponential). I wouldn't know, maybe a Gaussian is a better fit, exponential
seems unnatural, but I'd need to sit down and compute error functions in order
to figure that out from that plot.

It would've been great if the author would have made the effort to exclude the
obvious guess that the sickness probability distribution function is a
Gaussian distribution.

------
christopheraden
Thank you for posting the data--it makes it easier for us to follow along at
home.

I've written something up where I used tenure instead of the ranks.
[http://christopheraden.github.io/SickTime.html](http://christopheraden.github.io/SickTime.html).

Here’s my concern with the original model, having slept on it: What does it
actually buy you? In order to use the model, you must know the rank of sick
time of the employee, relative to all other employees. In order to calculate
this, you have to have the raw sick time numbers. At which point, what’s the
point in making a regression with it—just work with the raw sick times to
begin with!

Wouldn't sick time taken per year be a more interesting measure? Over time,
wouldn't you expect that an employee of 30 years would've taken more sick time
than an employee of six months? Aren't you really concerned about whether in a
particular year older employees are not taking as much vacation as newer
employees?

~~~
carlerickson
It's fun to read your much more sophisticated analysis. As to the excellent
question of what utility my study has, the truth is that I didn't expect any.
I didn't look at the data thinking I had a problem to solve, or that there
were people abusing our sick time policy. I was just curious.

We averaged about 2.5 sick days / employee / year over the last 12 months.
That's quite a bit lower than some reports I've read. It also happens to be
exactly the number I picked 10 years ago when I was first making an economic
model of my company.

------
christopheraden
This is an interesting application of statistics! I'm surprised how well the
sick-time ranking correlates with the sick times. Intuitively, I'd imagine
we'd expect correlation between the ranks and the raw values (your X axis is
just an ordinal version of the Y axis), but this is still quite high. I
imagine that's a function of the underlying distribution. Your data could be
conveyed effectively with a histogram, and I think that would paint an
interesting picture.

Is there a picture associating the sick-time and the tenure? You mention it
increases the R^2, but don't have an associated graph. Maybe you could post a
scatterplot of Sick-Time and Tenure and show that logarithmic relationship?

Would you be able to provide anonymized versions of the data, or something
like a table with (ID, Sick Time, Years of Tenure)? I'd love to play with it,
and try a QQ plot. My email's in my profile.

------
wtbob
> If my theory is true, then I can also say that the consistency of the data
> backs up my own experience and belief, which is that Atoms aren’t using sick
> time inappropriately or with selfish intention. If they were, I wouldn’t see
> such a high degree of conformance of the data to a natural log model.

I don't think that necessarily follows: it might be that willingness-to-
cheat/malingering/hypochondria is distributed in such a fashion as to generate
this log model.

------
bitJericho
The author concludes that naturally people get sick on a log scale. I find
this very dubious without any data to back it up. There could be a number of
other reasons why this occurring. Confirming this occurrence with other
companies will not prove anything either.

Perhaps the author should check actual health statistics and try to find some
correlations with actual evidence.

------
jakejake
It would be fascinating to also see data points for "actually sick" vs "called
in sick"! Of course we would never be able to know that for sure.

This post somewhat got me because I have missed the last two days of work with
a wicked cold.

~~~
araes
It might actually be two exponential distributions superimposed over one
another. One for folks who are taking sick days when they are actually sick,
and a second exponential of those who are cheating, which may have a lower
amplitude for a fairly incentivized system (which this is said to be).

~~~
sleazebreeze
Furthermore, there would need to be some sort of correction for employees who
are sick, but don't or can't take the sick time they need.

------
spacehome
Looking at the same data, I would have said power law.

------
amelius
Conversely, this means that if your company's sick time isn't logarithmic, it
has a suspicious hiring policy.

~~~
jacalata
Or it adds artificial factors on how the sick time is used, like limited
allocation of sick time or an expiry date on sick time or combined
sick/vacation time.

------
lutorm
That does not look like a logarithmic function at all to me. Am I missing
something?

~~~
christopheraden
Try graphing y = -1 * log(x) and imposing a limit on the upper bound of x and
you'll get close to what he has. Perhaps that's the angle he's coming from. He
provided the fitted equation further down in the featured article, and the log
term does have a negative coefficient, plus an intercept term.

The graph he plots looks like the data fits the Exponential Distribution:
[http://en.wikipedia.org/wiki/Exponential_distribution](http://en.wikipedia.org/wiki/Exponential_distribution)

~~~
skj
It screams exponential at me, especially given a potential underlying model
where every sick person has a .x probability of getting every individual they
work with sick. As the number of individuals goes up with no change in the
rate of sickness from outside the office, the number of sick people should go
up exponentially (as with any multiplicative process).

Edit: actually I think I completely misinterpreted the data. Now that I look
more closely, I have no idea what the X axis is for. I assumed it was number
of employees in a company whose sick time was somehow represented by bar
height, but is it just a list of all employees sorted by how much sick time
was taken?

If so, this is probably an example of a normal distribution with an
exponential tale.

~~~
JonathonW
I'm pretty sure it's just a list of employees sorted by how much sick time is
taken, so the X-axis is an "employee index number".

More interesting (and pertinent when trying to find a pattern in this data)
would be a histogram for sick time taken. Trying to fit a curve to the graph
as-is isn't useful, because the X-axis doesn't represent anything meaningful.

~~~
phrixus
This is my thought as well. So you fit a curve to a sorted list of each
employees sick time. Does this give you any additional insight? So it follows
a log function. Does that mean anything?

If you do a histogram and fit a function you get something that could
conceivably be interpreted as a probability distribution function, you might
be able to say something about predicting the sick time a given employee will
take and the uncertainty of your prediction.

But I honestly don't see what visualizing the data in the method of the post,
or fitting a function to it contributes. Hope that doesn't violate the new no
negativity policy of HN.

