Hacker News new | past | comments | ask | show | jobs | submit login
Sick time is logarithmic (greatnotbig.com)
114 points by micahalles on April 3, 2015 | hide | past | favorite | 38 comments

There is probably something to this model. Notice first that the constants are suggestive. There are 52 working weeks in the year, and a slightly more than 250 working days, and the author finds a good fit for

    sick_days = -52 * log(x) + 236 .
So the fitted curve can interpreted as saying,

    sick_days = weeks * sick_days_per_week(x) + healthy_days + noise ,
or somewhat more straightforwardly,

    days = healthy_days + sick_days(x) + noise .
Since x is the rank (index) of an employee by the number of sick days taken, we have that

    sick_days_per_week = log(employee_index) .
So it seems that we should guess that sick_days_per_week is log-normally distributed across employees. That's the same as saying that employees take sick days at a rate

    r = mu + sigma * Z ,
where mu and sigma are the mean and standard deviation of sick_days_per_week and Z is a standard normal.

This checks out intuitively, since this model comports with one in which the number of employee-sick days in any timeframe is independent of the number taken in previous periods. And the author says his sick-day policy is set up to incentivize taking sick days only if you're actually sick, not because you've accumulated them, etc. So the data look like what we would expect if the policies are working as they should. (Notice also that this model eliminates the need to control for employee tenure.)

So hopefully the author will run the data and see whether Z passes statistical tests for standard normality, then tell us what mu and sigma are. We might have a nice model for employee sick days and some solid empirical results to go along with it!

From his claim that

    sick_days = -52 ln(x) + 236 
It would be more straightforward to conclude that the number of sick days is exponentially distributed.

To see this first note that the employee index "x" is equal to N(1 - p), where N is the number of employees and p is proportion of employees with less sick days. So if we invert the formula we can calculate the proportion of employees with less than a certain number of sick days, which is basically the cummulative probability distribution of the number of sick days.

Now let's try to invert the formula. First we simplify a bit:

    -52 ln(x) + 236 = -52 ln(N(1-p)) + 236 = -52 ln(1-p) + 236 - 52 ln(N)
Now if we fill in N = 86, we see that 52 ln(N) is 231.626 which looks suspiciously like 236, so let's assume that they're the same for now. This gives us:

     sick_days = -52 ln(1-p)
Or in other words

     p = 1 - exp(-52*sick_days)
Which is the cumulative probability distribution for an exponentially distributed random variable with λ=1/52. So it seems that the number of sick days is approximately exponentially distributed with an average of about 1 week.

It's also satisfying to see that the constant "236" almost completely follows from this assumption. There was still a minor difference, but that can be removed by assuming that the number of sick_days is at least 4.37.

The author claims that his fit is:

   sick_hours = -52 * ln(x) + 236,
not sick days. Furthermore, your conclusion should read

   p = 1 - exp(- sick_hours/52).
Then, the units in the exp() cancel and everyone is happy. But I assume that's what you meant by stating lambda=1/52.

Good catch that the additive constant is approximately equal to 52 * ln(86). I like your model. It makes sense, it's simpler, and most importantly it's different enough from mine that it gives us a decent way to test which model reflects reality better: fit data drawn from different numbers of employees' sick-day history. My model is independent of the number of employees in the set and yours isn't, so that should tell us something.

For those of you following along, note that it's often rather difficult to tell what the "real" model is, even assuming that there is one and that we've somehow managed to discover it. Like many other pairs of distributions, the exponential and log-normal distributions are quite similar, both in how they fit data and in the intuition behind them. There are choices of parameters that make an exponential and log-normal distribution look about the same, so without good theoretical justifications for which parameters to choose, there's little reason to prefer one over the other. Each can be thought of as giving a time to failure (here, the time between sick days taken). The log-normal has two parameters and the exponential has only one, so if neither model is "correct", it's likely that we could find a choice of parameters that makes the log-normal fit even if the exponential doesn't. The two distributions differ most in the tails, but we would need gobs of data to see how the tail probabilities work out.

If other employers with some statistical expertise want to weigh in with their own employee datasets, please do!

Edit: By the way, I think this back-and-forth is a good example of the benefits of dialectical reasoning (that is, a conversational process of trying to hone in on the truth, with a real interlocutor to converse with). I came up with my initial model only because the two constants seemed curiously coincidental with the number of weeks and working days in the year; contravariant's model throws one of those away, which is fine, but I likely would not have proposed my model in the first place if only one constant looked suggestive—a singular "52" does not really get the modeling juices flowing. And I suspect thst contravariant would not have thought up his model if he hadn't seen mine. As someone who mostly works alone, I wish I had more of these sorts of dialogues about the problems I work on.

Edit 2: On re-reading my grandparent post, which I can no longer edit, I just noticed that I made a very fundamental mistake that means my model is certainly incorrect. To see my error, start with my third equation, which is obviously correct (except that there can't be any noise). Then plug in sick_days as defined in the first equation, which comes directly from the blog post. Now try to derive my second equation. ... contravariant did not make this mistake.

you guys can go home now, you've just made HN worth the visit for me for the rest of the month. Great discussion!

Slightly off topic, the sickest I've ever been was working in a large open-office with people who frequently traveled. I didn't travel as much, but I was almost constantly unwell. It was also a highly stressful job. The combination of dedicated workers, open floor plan and travel seemed to bring in just about every kind of transmittable disease possible. The high stress seemed to suppress my immune system. I think at one point I spent a solid 6-7 weeks up and down with some kind of cold/virus/flu/infection.

My last job had me working from home most of the time. In two years I think I took sick twice for a couple of days.

My current job has me in a shared office or in a shared team room (at my discretion), very little travel, stress is far lower and in the few months I've been here I've only taken sick once, for just a couple days...and I think it was my wife who gave it to me (picked up in her open office full of high stress frequent travelers).

Absolutely fascinating (*pending seeing the chart of time normalized by employee tenure, which would be more interesting all things considered).

In any case, this type of thinking shows one facet of the application of statistics to people. There are so many natural states of variation we take for granted. Where you might have thought that certain employees took sick time more often than others due to differences of integrity or work ethic in their individual control, in fact (if the assumption about this being an expected rate of sickness for a population) it's a result of systemic effects.

Interesting question: what other facets of human performance and habits are systemic, instead of individual? How are some other ways groups exhibit predictable, natural distributions?

W. Edwards Deming once said, "I should estimate that in my experience most troubles and most possibilities for improvement add up to the proportions something like this: 94% belongs to the system (responsibility of management), 6% special."

By this, he meant that 94% (and he probably thought that number out quite thoroughly) of potential to improve lies with the systems surrounding people, not in their individual contribution and ability alone.

I find this true in any organization of sufficient size. If it tends to be true for sick days, what else is it true for, and what would you gain by thinking about more things in this way? Interesting to think about.

This is an alternative wording of the 6-rat experience, with one being the leader, one the leader's right arm, 2 workers, 1 independent and 1 black sheep. I definitively have the same feeling about my time in companies: Someone has to speak up and get fired, someone else with the same strong character but better communication becomes the leader, someone's hard work finds a positive feedback loop in being the boss' preference, the others work for money. The most costly for all parties is the black sheep: Not yielding appreciable nor appreciated results, also a security risk (employment laws, data breach, PR mistake...).

Is there any way in management to avoid transforming the worst person into a black sheep? If he knows he's the worst, he'll be angry anyway and fail, right?

Reduce the individual focus and break the cycle. The work of the team as a system is what matters, not each individual contribution. Individuals may even be sub-optimized at a personal level in order to make a working system, and we must recognize and even reward that type of contribution as well.

Look at Deming's management ideas, and look at Lean management concepts (the modern version of Deming's system). Both provide good answers to these organizational problems.

Just wondering how the sick days are distributed into spells.

In the UK, the Bradford Factor

B = S^2*D

Where D is total days sick in a given year and S is the number of separate 'spells' of sickness in that year (so two days off for a cold counts as just one spell as does 12 days off after minor elective surgery) is used in some large organisations to detect patterns of one or two days off.

Staples definitely used to use it here in the UK, not sure if they still do.

I teach basic maths to evening class students. The Bradford factor is alive and kicking in a number of local employers both admin and manufacturing and evokes a lively discussion whenever I use it as an example of a formula. A high Bradford factor tends to lead to a 'chat' with the manager in most organisations as it points to a pattern of short spells of absence.

Most local employers disregard absence for serious illness or for things like dentist, repeat medical appointments, childcare problems &c interestingly.

At staples it was purely used to track absences (when you have 20 part time staff in a store just remembering who has been off and when over a few months becomes difficult) most managers would just use it as a reminder to have a word with staff as well.

If the person had a good reason then it would be ignored, for such a simple approach it worked pretty well.

Per suggestions in the comments I've plotted a histogram of sick hours / year, and run a simple comparison against Benford's Law.

The raw data is now available for download at the bottom of the original post: https://greatnotbig.com/2015/03/sick-time-logarithmic/

His data appears to be Zipfian:


In this context, it's normal to graph with the x axis that is the "index in sorted list" as the author has done, although somewhat more common to make the y axis logarithmic.

Depends on whether you represent sick-time as continuous or discrete. Zipf's Law takes support on the integers, whereas the Exponential takes support on all positive reals.

I looked at sick time a couple of years back and noticed that when a frequently sick person stayed home sick, another frequently sick person followed suit within a day or two. I wonder if anybody has a Pearson correlation on that.

If it's the flu, it does make groups sick, in sequence, with a couple of days of difference.

You can sort all kinds of things that follow an underlying probability distribution in order to find some kind of function to describe the behavior. That's not something I'd consider new (or interesting).

What is more interesting, though, is what kind of probability distribution it follows. From that, it is straightforward to figure out other representations (as the sorted plot by the author). But it's kinda annoying to figure that out from that plot in reverse since I'd need to calculate cumulative distribution functions.

From what other's have posted the log-function fit means the probability distribution is exponential (since the cumulative distribution function is exponential). I wouldn't know, maybe a Gaussian is a better fit, exponential seems unnatural, but I'd need to sit down and compute error functions in order to figure that out from that plot.

It would've been great if the author would have made the effort to exclude the obvious guess that the sickness probability distribution function is a Gaussian distribution.

Thank you for posting the data--it makes it easier for us to follow along at home.

I've written something up where I used tenure instead of the ranks. http://christopheraden.github.io/SickTime.html.

Here’s my concern with the original model, having slept on it: What does it actually buy you? In order to use the model, you must know the rank of sick time of the employee, relative to all other employees. In order to calculate this, you have to have the raw sick time numbers. At which point, what’s the point in making a regression with it—just work with the raw sick times to begin with!

Wouldn't sick time taken per year be a more interesting measure? Over time, wouldn't you expect that an employee of 30 years would've taken more sick time than an employee of six months? Aren't you really concerned about whether in a particular year older employees are not taking as much vacation as newer employees?

It's fun to read your much more sophisticated analysis. As to the excellent question of what utility my study has, the truth is that I didn't expect any. I didn't look at the data thinking I had a problem to solve, or that there were people abusing our sick time policy. I was just curious.

We averaged about 2.5 sick days / employee / year over the last 12 months. That's quite a bit lower than some reports I've read. It also happens to be exactly the number I picked 10 years ago when I was first making an economic model of my company.

This is an interesting application of statistics! I'm surprised how well the sick-time ranking correlates with the sick times. Intuitively, I'd imagine we'd expect correlation between the ranks and the raw values (your X axis is just an ordinal version of the Y axis), but this is still quite high. I imagine that's a function of the underlying distribution. Your data could be conveyed effectively with a histogram, and I think that would paint an interesting picture.

Is there a picture associating the sick-time and the tenure? You mention it increases the R^2, but don't have an associated graph. Maybe you could post a scatterplot of Sick-Time and Tenure and show that logarithmic relationship?

Would you be able to provide anonymized versions of the data, or something like a table with (ID, Sick Time, Years of Tenure)? I'd love to play with it, and try a QQ plot. My email's in my profile.

> If my theory is true, then I can also say that the consistency of the data backs up my own experience and belief, which is that Atoms aren’t using sick time inappropriately or with selfish intention. If they were, I wouldn’t see such a high degree of conformance of the data to a natural log model.

I don't think that necessarily follows: it might be that willingness-to-cheat/malingering/hypochondria is distributed in such a fashion as to generate this log model.

The author concludes that naturally people get sick on a log scale. I find this very dubious without any data to back it up. There could be a number of other reasons why this occurring. Confirming this occurrence with other companies will not prove anything either.

Perhaps the author should check actual health statistics and try to find some correlations with actual evidence.

It would be fascinating to also see data points for "actually sick" vs "called in sick"! Of course we would never be able to know that for sure.

This post somewhat got me because I have missed the last two days of work with a wicked cold.

It might actually be two exponential distributions superimposed over one another. One for folks who are taking sick days when they are actually sick, and a second exponential of those who are cheating, which may have a lower amplitude for a fairly incentivized system (which this is said to be).

Furthermore, there would need to be some sort of correction for employees who are sick, but don't or can't take the sick time they need.

Looking at the same data, I would have said power law.

Conversely, this means that if your company's sick time isn't logarithmic, it has a suspicious hiring policy.

Or it adds artificial factors on how the sick time is used, like limited allocation of sick time or an expiry date on sick time or combined sick/vacation time.

That does not look like a logarithmic function at all to me. Am I missing something?

Try graphing y = -1 * log(x) and imposing a limit on the upper bound of x and you'll get close to what he has. Perhaps that's the angle he's coming from. He provided the fitted equation further down in the featured article, and the log term does have a negative coefficient, plus an intercept term.

The graph he plots looks like the data fits the Exponential Distribution: http://en.wikipedia.org/wiki/Exponential_distribution

It screams exponential at me, especially given a potential underlying model where every sick person has a .x probability of getting every individual they work with sick. As the number of individuals goes up with no change in the rate of sickness from outside the office, the number of sick people should go up exponentially (as with any multiplicative process).

Edit: actually I think I completely misinterpreted the data. Now that I look more closely, I have no idea what the X axis is for. I assumed it was number of employees in a company whose sick time was somehow represented by bar height, but is it just a list of all employees sorted by how much sick time was taken?

If so, this is probably an example of a normal distribution with an exponential tale.

I'm pretty sure it's just a list of employees sorted by how much sick time is taken, so the X-axis is an "employee index number".

More interesting (and pertinent when trying to find a pattern in this data) would be a histogram for sick time taken. Trying to fit a curve to the graph as-is isn't useful, because the X-axis doesn't represent anything meaningful.

This is my thought as well. So you fit a curve to a sorted list of each employees sick time. Does this give you any additional insight? So it follows a log function. Does that mean anything?

If you do a histogram and fit a function you get something that could conceivably be interpreted as a probability distribution function, you might be able to say something about predicting the sick time a given employee will take and the uncertainty of your prediction.

But I honestly don't see what visualizing the data in the method of the post, or fitting a function to it contributes. Hope that doesn't violate the new no negativity policy of HN.

tale = tail. oy.

Since there is confusion in the sibling comments, I want to explain how y = - kln(x) + m fits in with the exponential function. I am going to be a little sloppy with closed and open intervals and round a little.

x is the rank of the employee. Let N be number of employees. We can generate a new observation from the model by generating an x' uniformly between 1 and N, and inserting in the formula for y. Then p'=x'/N is a number between 1/N and 1, or if we round, between 0 and 1.

The generated observation will be distributed according to (convince yourself by looking at the submission's graph)

P(y' > y) = p, where y=-kln(Np) + m

or solving for p, where p = exp(m-y) / N. So

P(y' < y) = 1 - exp(m-y) / N

This is the exponential distribution.

I also thought that it's exponential, and as the other comment says it makes more sense.

I don't have the data, but I superimposed an Excel graphic over the original graphic: http://imgur.com/L5f6CIa

The logarithmic fit looks better.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact