The example was latency. If the programmers were told to achieve less than 1 ms for 99% of the requests -- then sure enough -- the 1% of requests would have sky-high latencies of multiple seconds.
If told they needed to achieve 1 ms for 99% and 100 ms for 99.99%, then -- you guessed it -- the worst 0.01% would be tens of seconds or even minutes.
Inevitably, there would be a visible discontinuity in the latency histogram just above whatever the official business requirement was.
It's a difficult thing to fix, because no matter where you set your threshold, unless it's 100%, you won't meet it. And even 100% is just a fiction, because you'd need an infinite number of tests to achieve it.
It’s worth noting that this is a learned behavior which is not natural for many people.
There are plenty of people who will naturally think about a problem in context and want to solve for the problem and not just the metric.
Those people will be slower to satisfy management demands which will be metric driven, and in order to progress in their careers will learn to focus on the requirement and not the problem.
Typically the issue was caused by garbage collection. You can twiddle with the parameters to meet your 99% latency goal, and then fail your 1% spectacularly.
Similarly, anything involving the network would slowly be optimised to meet the "typical case" requirements, while the extremes would be terrible.
If I remember correctly, this was a talk by someone working in a real-time trading firm, where latency was a critical metric for all of their systems designs. He had a lot of charts with very visible upticks in latency at the "nice round numbers" where the requirements were set.
Nobody experiences the mean, but every outlier in your optimization will affect it.
The system in question was essentially not meant to have any backlog of request under normal operations. But when overloaded, for this particular system it was better to serve the requests it could serve very fast, and just shed load on the overflow.
(I don't remember more specifics, and even if I could, I would probably not be allowed to give them..)
The other end gets far better information about congestion that way, and congestion control algorithms could be made much smarter.
Everyone says "jitter is bad in networks", but the reality is if the jitter is giving you network state information, it is a net positive - especially when you can use erasure coding schemes so the application need not see the jitter.
LIFO sounds annoying for bursty traffic, since you can only start processing the burst after the buffer has cleared.
This is the key feature of the CoDel solution to buffer bloat.
This is on topic; there is a discontinuity there which is an example of the same type of thing the rest of the post talks about.
But it's not the biggest problem illustrated by that graph. The dot at "p is just barely less than 0.05" is an outlier. But it's an outlier from what is otherwise a regular pattern that clearly shows that smaller p-values are more likely to occur than larger ones are. That's insane. The way for that pattern to arise without indicating a problem would be "psychologists only investigate questions with very clear, obvious answers". I find that implausible.
This is an important distinction, in my experience .
Many papers will report a p-value only if it is below a significance threshold, otherwise they will report "n.s." (no statistic) or will give a range (e.g. p > .1). This just means that in addition to pressure to shelve insignificant results, publication bias also manifests as a tendency to emphasize and carefully report significant findings, while mentioning in passing those that don't meet whatever threshold.
 I happen to be working on a meta-analysis of psychology and public health papers at the moment. One paper that we're reviewing constructs 32 separate statistical models, reports that many of the results are not significant, and then discusses the significant results at length.
But the oddity here is a pronounced trend in the reported p-values that meet the significance threshold. The behavior you mention cannot create that trend.
It looks to me like the y-axis is measured in number of papers. The lower a p-value is, the more papers there are that happened to find a result beating the p-value.
So low p-values are more likely to occur a priori than high p-values are. This is most certainly not true in general. We might guess that psychologists are fudging their p-values somehow, or that journals are much, much, much, much, much, much, much more likely to publish "chewing a stalk of grass makes you walk slower, p < 0.013" than they are to publish "chewing a stalk of grass makes you walk slower, p < 0.04".
I've emphasized the level of bias the journals would need to be showing -- over fine distinctions in a value that is most often treated as a binary yes or no -- because it is much easier to get p < 0.04 than it is to get p < 0.013.
More generally, scientists are incentivised to find novel findings (i.e. unexpectedly low p-values) or lose their job.
Given that, the plot doesn't surprise me at all (Also, people will normally not report a bunch of non-significant results, which is a similar but unrelated problem).
Are you saying that in other disciplines, the distribution of p-values in published papers does not follow this pattern?
Publishing introduces a systematic bias, because it's difficult to get published where p>0.05 (or whatever the disciplinary standard is).
That explains why the p-values above 0.05 are rare compared to values below 0.05. But it fails to explain why p-values above 0.02 are rare compared to values below 0.02.
Is that enough to tip the curve the other way across the range of p-values? Well, something is, and I am open to alternative suggestions.
One other point: while the datum immediately below 0.05 would normally be considered an outlier, the fact that it is next to a discontinuity (actual or perceived) renders that call less clear. Personally, I suspect it is not an accidental outlier, but given that it does not produce much distortion in the overall trend, I am less inclined to see the 0.05 threshold (actual or perceived) as a problem than I did before I saw this chart.
Don't be fooled by the line someone drew on the chart. There's no particular reason to view this as a smooth nonlinear relationship except that somebody clearly wanted you to do that when they prepared the chart.
I could describe the same data, with different graphical aids, as:
- uniform distribution ("75 papers") between an eyeballed p < .02 and p < .05
- large spike ("95 papers") at exactly p = 0.4999
- sharp decline between p < .05 and p < .06
- uniform distribution ("19 papers") from p < .06 to p < .10
- bizarre, elevated sawtooth distribution between p < .01 and p < .02
And if I describe it that way, the spike at .05 is having exactly the effect you'd expect, drawing papers away from their rightful place somewhere above .05. If the p-value chart were a histogram like all the others instead of a scatterplot with a misleading visual aid, it would look pretty similar to the other charts.
I think we are both, in our own ways, making the point that there is more going on here than the spike just below 0.05 - namely, the regular pattern that you identified in your original post. If we differ, it seems to be because I think it is explicable.
WRT p-values of 0.05: I almost, but did not, say that if you curve-fitted above and below 0.05 independently, there would be a gap between the two, and maybe even if you left out the value immediately below 0.05. No doubt that would also happen for other values, but I am guessing that this gap would peak at 0.05. If I have time in the near future, I may try it. If you do, and find that I am wrong, I will be happy to recant.
Don't throw the Seldon out with the bathwater. I think there is a very real chance that the problems psychologist address are extremely probable in the society they investigate them.
More like, "psychologists often publish results where questions have clear, obvious answers".
1. Choose a question to investigate.
2. Get some results.
3. Compute p < 0.03.
4. Toss the paper in the trash, because p < 0.03 isn't good enough.
But that's not how they operate. The reason there's a spike at 0.05 is that that's what everyone cares about. If you get p < 0.03, you're doing better than that!
So the bias in favor of even lower p-values is coming from somewhere else. It definitely is not coming from the decision point of "OK, I've done the research, but do I publish it?".
In all the systems I've been involved with since, "Discontinuous Behaviour" is a failure mode we've explicitly analysed, and "Graceful Degradation" is a technique we've often implemented.
In case people want to discuss that separately I've submitted it here:
The ideal solution presented here is instead that the tills should bottleneck the customers so that they collectively can't buy more items than the central computer is capable of logging in amortized real time. This preserves the ability to keep electronic track of everything the store sells. But it does it by preventing the store from selling its inventory to customers! Instead of a loss in real-time recordkeeping, we have a hard cap on the amount of money the store can earn during the Christmas season. And the reason we've put the cap in place is that otherwise we'd blow right past it!
That solution is so anomalous-seeming that I want to see a discussion of what exactly the store's goals are, and why refusing to sell your inventory to customers in December is a good idea for a retail store.
- Reconciling electronic inventory numbers with the real ones will cost more than the store will earn from uncapping the sales, and/or
- There's a potential legal/tax risk involved with having stock numbers be bad, and/or
- There's a risk of losing customers and reputation when stock of a product physically runs out while the tracking system isn't aware of this, and a bunch of customers have to have their orders cancelled because there's nothing to fulfill them with.
When moving up a range it seems more economical to revert to monthly pricing than to go annual. (Btw: that sucks!)
Or just don’t have phase-outs at all. What, exactly, is wrong with giving millionaires subsidies with similar dollar values as much poorer people? They’ll make it up in overall taxes paid anyway.
To put in concrete terms: if you tried to give every household $7000 of ACA subsidy, that would cost you $900B a year. That’s more than US military spending. Where you would get extra taxes from to cover that? There aren’t enough millionaires to pay for it (do the numbers here too, to see), so what you’d need to do is to get back $7000 from household that make “mere” $150-200k. You don’t want to slap extra $7k tax, because that would just be a cliff like the one you tried to avoid in the first place. Once you figure out how to set up your taxation to do that, you’ll find that it conceptually and practically much simpler to just do phaseouts.
I realize that health care in particular is a horrible mess due to most higher income people getting insurance through employer group policies. That, in and of itself, is a problem, and fixing would have the added benefit of removing a large disincentive to changing jobs or leaving a bad job.
It’s less straightforward than you think. Please, show me how to modify the current brackets to collect extra $7k from households making >$200k (but not more than $7k), and do slow phaseout starting from above $100k. Then, do the same also for TANF, SNAP, SSI etc, all at the same time, and all at different thresholds.
After doing this exercise, you’ll find that it’s much easier to think about it in terms of keeping brackets the same for all, and having separate, phased out deductions, rather to fiddle with the brackets.
For the lazy, the footnote basically says that the discontinuity at 280g is not caused by an increase in seizures at this amount (which would indicate police fraud), but by prosecutors choosing to charge defendants at slightly above the 280g threshold when the amount seized was actually significantly larger than the threshold.
The change the law increased the amount the police had to plant on someone to lock them away in a federal penitentiary, and they had not expected it to show up so strongly in the data. Back when the minimum was 50g they could hide in the noise but at 280g it stands out.
In a well functioning society this would trigger investigations of all of the cases where the suspect was charged with carrying just over the limit on suspicious of police misconduct, but the police don't want that so it's not going to happen.
Just wanted to point out that quote from danluu. It's a really good practice to sit down and plot histograms, scatterplots and other visualizations and JUST THINK for a while before trying to cram the model you think applies.