sick_days = -52 * log(x) + 236 .
sick_days = weeks * sick_days_per_week(x) + healthy_days + noise ,
days = healthy_days + sick_days(x) + noise .
sick_days_per_week = log(employee_index) .
r = mu + sigma * Z ,
This checks out intuitively, since this model comports with one in which the number of employee-sick days in any timeframe is independent of the number taken in previous periods. And the author says his sick-day policy is set up to incentivize taking sick days only if you're actually sick, not because you've accumulated them, etc. So the data look like what we would expect if the policies are working as they should. (Notice also that this model eliminates the need to control for employee tenure.)
So hopefully the author will run the data and see whether Z passes statistical tests for standard normality, then tell us what mu and sigma are. We might have a nice model for employee sick days and some solid empirical results to go along with it!
sick_days = -52 ln(x) + 236
To see this first note that the employee index "x" is equal to N(1 - p), where N is the number of employees and p is proportion of employees with less sick days. So if we invert the formula we can calculate the proportion of employees with less than a certain number of sick days, which is basically the cummulative probability distribution of the number of sick days.
Now let's try to invert the formula. First we simplify a bit:
-52 ln(x) + 236 = -52 ln(N(1-p)) + 236 = -52 ln(1-p) + 236 - 52 ln(N)
sick_days = -52 ln(1-p)
p = 1 - exp(-52*sick_days)
It's also satisfying to see that the constant "236" almost completely follows from this assumption. There was still a minor difference, but that can be removed by assuming that the number of sick_days is at least 4.37.
sick_hours = -52 * ln(x) + 236,
p = 1 - exp(- sick_hours/52).
For those of you following along, note that it's often rather difficult to tell what the "real" model is, even assuming that there is one and that we've somehow managed to discover it. Like many other pairs of distributions, the exponential and log-normal distributions are quite similar, both in how they fit data and in the intuition behind them. There are choices of parameters that make an exponential and log-normal distribution look about the same, so without good theoretical justifications for which parameters to choose, there's little reason to prefer one over the other. Each can be thought of as giving a time to failure (here, the time between sick days taken). The log-normal has two parameters and the exponential has only one, so if neither model is "correct", it's likely that we could find a choice of parameters that makes the log-normal fit even if the exponential doesn't. The two distributions differ most in the tails, but we would need gobs of data to see how the tail probabilities work out.
If other employers with some statistical expertise want to weigh in with their own employee datasets, please do!
Edit: By the way, I think this back-and-forth is a good example of the benefits of dialectical reasoning (that is, a conversational process of trying to hone in on the truth, with a real interlocutor to converse with). I came up with my initial model only because the two constants seemed curiously coincidental with the number of weeks and working days in the year; contravariant's model throws one of those away, which is fine, but I likely would not have proposed my model in the first place if only one constant looked suggestive—a singular "52" does not really get the modeling juices flowing. And I suspect thst contravariant would not have thought up his model if he hadn't seen mine. As someone who mostly works alone, I wish I had more of these sorts of dialogues about the problems I work on.
Edit 2: On re-reading my grandparent post, which I can no longer edit, I just noticed that I made a very fundamental mistake that means my model is certainly incorrect. To see my error, start with my third equation, which is obviously correct (except that there can't be any noise). Then plug in sick_days as defined in the first equation, which comes directly from the blog post. Now try to derive my second equation. ... contravariant did not make this mistake.
My last job had me working from home most of the time. In two years I think I took sick twice for a couple of days.
My current job has me in a shared office or in a shared team room (at my discretion), very little travel, stress is far lower and in the few months I've been here I've only taken sick once, for just a couple days...and I think it was my wife who gave it to me (picked up in her open office full of high stress frequent travelers).
In any case, this type of thinking shows one facet of the application of statistics to people. There are so many natural states of variation we take for granted. Where you might have thought that certain employees took sick time more often than others due to differences of integrity or work ethic in their individual control, in fact (if the assumption about this being an expected rate of sickness for a population) it's a result of systemic effects.
Interesting question: what other facets of human performance and habits are systemic, instead of individual? How are some other ways groups exhibit predictable, natural distributions?
W. Edwards Deming once said, "I should estimate that in my experience most troubles and most possibilities for improvement add up to the proportions something like this: 94% belongs to the system (responsibility of management), 6% special."
By this, he meant that 94% (and he probably thought that number out quite thoroughly) of potential to improve lies with the systems surrounding people, not in their individual contribution and ability alone.
I find this true in any organization of sufficient size. If it tends to be true for sick days, what else is it true for, and what would you gain by thinking about more things in this way? Interesting to think about.
Is there any way in management to avoid transforming the worst person into a black sheep? If he knows he's the worst, he'll be angry anyway and fail, right?
Look at Deming's management ideas, and look at Lean management concepts (the modern version of Deming's system). Both provide good answers to these organizational problems.
In the UK, the Bradford Factor
B = S^2*D
Where D is total days sick in a given year and S is the number of separate 'spells' of sickness in that year (so two days off for a cold counts as just one spell as does 12 days off after minor elective surgery) is used in some large organisations to detect patterns of one or two days off.
Most local employers disregard absence for serious illness or for things like dentist, repeat medical appointments, childcare problems &c interestingly.
If the person had a good reason then it would be ignored, for such a simple approach it worked pretty well.
The raw data is now available for download at the bottom of the original post: https://greatnotbig.com/2015/03/sick-time-logarithmic/
In this context, it's normal to graph with the x axis that is the "index in sorted list" as the author has done, although somewhat more common to make the y axis logarithmic.
What is more interesting, though, is what kind of probability distribution it follows. From that, it is straightforward to figure out other representations (as the sorted plot by the author). But it's kinda annoying to figure that out from that plot in reverse since I'd need to calculate cumulative distribution functions.
From what other's have posted the log-function fit means the probability distribution is exponential (since the cumulative distribution function is exponential). I wouldn't know, maybe a Gaussian is a better fit, exponential seems unnatural, but I'd need to sit down and compute error functions in order to figure that out from that plot.
It would've been great if the author would have made the effort to exclude the obvious guess that the sickness probability distribution function is a Gaussian distribution.
I've written something up where I used tenure instead of the ranks.
Here’s my concern with the original model, having slept on it: What does it actually buy you? In order to use the model, you must know the rank of sick time of the employee, relative to all other employees. In order to calculate this, you have to have the raw sick time numbers. At which point, what’s the point in making a regression with it—just work with the raw sick times to begin with!
Wouldn't sick time taken per year be a more interesting measure? Over time, wouldn't you expect that an employee of 30 years would've taken more sick time than an employee of six months? Aren't you really concerned about whether in a particular year older employees are not taking as much vacation as newer employees?
We averaged about 2.5 sick days / employee / year over the last 12 months. That's quite a bit lower than some reports I've read. It also happens to be exactly the number I picked 10 years ago when I was first making an economic model of my company.
Is there a picture associating the sick-time and the tenure? You mention it increases the R^2, but don't have an associated graph. Maybe you could post a scatterplot of Sick-Time and Tenure and show that logarithmic relationship?
Would you be able to provide anonymized versions of the data, or something like a table with (ID, Sick Time, Years of Tenure)? I'd love to play with it, and try a QQ plot. My email's in my profile.
I don't think that necessarily follows: it might be that willingness-to-cheat/malingering/hypochondria is distributed in such a fashion as to generate this log model.
Perhaps the author should check actual health statistics and try to find some correlations with actual evidence.
This post somewhat got me because I have missed the last two days of work with a wicked cold.
The graph he plots looks like the data fits the Exponential Distribution: http://en.wikipedia.org/wiki/Exponential_distribution
Edit: actually I think I completely misinterpreted the data. Now that I look more closely, I have no idea what the X axis is for. I assumed it was number of employees in a company whose sick time was somehow represented by bar height, but is it just a list of all employees sorted by how much sick time was taken?
If so, this is probably an example of a normal distribution with an exponential tale.
More interesting (and pertinent when trying to find a pattern in this data) would be a histogram for sick time taken. Trying to fit a curve to the graph as-is isn't useful, because the X-axis doesn't represent anything meaningful.
If you do a histogram and fit a function you get something that could conceivably be interpreted as a probability distribution function, you might be able to say something about predicting the sick time a given employee will take and the uncertainty of your prediction.
But I honestly don't see what visualizing the data in the method of the post, or fitting a function to it contributes. Hope that doesn't violate the new no negativity policy of HN.
x is the rank of the employee. Let N be number of employees. We can generate a new observation from the model by generating an x' uniformly between 1 and N, and inserting in the formula for y. Then p'=x'/N is a number between 1/N and 1, or if we round, between 0 and 1.
The generated observation will be distributed according to (convince yourself by looking at the submission's graph)
P(y' > y) = p, where y=-kln(Np) + m
or solving for p, where p = exp(m-y) / N. So
P(y' < y) = 1 - exp(m-y) / N
This is the exponential distribution.
I don't have the data, but I superimposed an Excel graphic over the original graphic: http://imgur.com/L5f6CIa
The logarithmic fit looks better.