To reduce the occurrence of future similar programming errors, the Johns Hopkins Biostatistics Center has instituted a new standard operating procedure for checking randomization assignment to be followed in all trial analyses. To ensure that the group assignment used in any of the trial analyses is correct, a verification process will be included at the beginning and end of each analysis program. This process is intended to confirm that the group assignment separately provided by the trial team matches the group assignment used in the analysis program. The matching confirmation is reviewed by a second biostatistician/analyst before its use in the results.
I don't know what software quality control is already in place at this organization, but this corrective measure seems on its face wholly inadequate to me: they're just preventing a recurrence of the same exact problem, rather than the much broader class of problems due to programming errors. Do they have a code review process in place?
This speaks to a larger issue: if you write software for manipulating data as part of the production of a scientific paper, then the source code should be available for review as an attachment to that paper, and review of said code should be part of the peer review process in any reputable journal. Professional software engineers write bugs all the time that invalidate the correctness of their programs, never mind individuals whose primary job is research, not software.
>This speaks to a larger issue: if you write software for manipulating data as part of the production of a scientific paper, then the source code should be available for review as an attachment to that paper, and review of said code should be part of the peer review process in any reputable journal. Professional software engineers write bugs all the time that invalidate the correctness of their programs, never mind individuals whose primary job is research, not software.
Completely agreed, the source should be open (ideally FOSS) - but also, the software development should be conducted properly too. On a practical level, using version control and a code review mechanism (e.g. GitHub PRs) within the research group equivalent in rigorousness to what you'd see at a good practice software development shop in industry.
Clinical decisions are made off the back evidence published in peer-reviewed, respected journals. It would seem to me that serious software errors in this domain have the capability to contribute to grave patient consequences. Much more serious consequences than if I introduce a bug into a client project.
> On a practical level, using version control and a code review mechanism (e.g. GitHub PRs) within the research group equivalent in rigorousness to what you'd see at a good practice software development shop in industry.
Let us (professional s/w engineers) not pat ourselves in the back by confusing standard industry practices with 'rigorousness'. Rigorousness would be formal verification and proofs. How many s/w engineers can do that? How much will it slow down the development speed?
There’s a tradeoff between rigor and output. Except in toy or very niche problems, I’m not aware of formal verification striking a good balance between the two, although perhaps I’m just ignorant. But it seems to me that code review and some unit and functional tests are usually a clear win.
The way many clinical research groups are structured (unfortunately) precludes code review, and to some extent, version control. Often for projects like the one referenced in the OP you'll have just one statistical programmer working with a PhD-level biostatistician in addition to the MD investigators. The biostatistician guides the programmer through the statistical methods to use and steers the overall study design, but otherwise they never see the underlying code. There are some exceptions, but in many cases, the programmer ends up the only person seeing the code. A lot of this has to do with how funding is structured—grants are written assuming one FTE programmer, and something like 0.05 FTE for the biostatistician. It's hard to convince funding agencies that you need more than one FTE programmer in many cases; on top of that, farming out the biostatistican's time in small chunks like that splits their attention between dozens of projects, which precludes them engaging with any one of those projects in depth.
So, there's literally nobody else on these projects who could conduct a code review. This also sort of provides a disincentive towards using VCS even though it's so obviously a good idea if you're the only person contributing code—I've talked with programmers about this before and the response is "why bother?" unfortunately.
Unfortunately, this would make the already laborious process of peer review even longer and require more work. Given the reality of academia today (publish vs perish), not to mention the added work of preparing even well structured, version controlled code (which is not, in my experience common) for publishing, most researchers would not opt-in (or support) something which would make publishing more difficult.
Who cares? If researchers are producing crap to get published, why should we mind if they stop producing that crap when journals raise the standards of publication?
I previously made a big list of papers that were retracted due to software bugs. It was intended to go in a manuscript but I had to cut it out because the conference limited the number of references for the camera-ready version. If anyone is interested I can try to dig up the list again!
In fact, I would treat the retracted papers more like a data set than things to be cited. Then you get a nice paper with counts and statements about common themes, and references on those themes. Then post the dataset as supplementary material available on the arxiv.
Some of these links go directly to the retracted paper, some go to the retraction notice (where available), and some go to reports of software bugs which impacted others' results but didn't cause any retractions. This list is of course incomplete, but here are a few, mostly pulled from Retraction Watch:
This isn't surprising, and I'm sure has happened many times. If you get the result you expect, you are much less likely to check for a mistake. The authors deserve a lot of credit for owning up to it.
> Given the corrected finding of a paradoxical increase in acute care use in the intervention group
Now I’m curious why long term
intervention/support increased the number of acute cases. Maybe people were more likely to find themselves sick when provided with additional monitoring after they leave the hospital? Some sort of psychological connection or being overly careful?
Plenty of doctors will simply blame your past diagnosis for any broad new symptoms, without doing much critical thinking or investigating. I’ve seen this personally many times in the years following a colitis diagnosis. The symptoms are quite broad and easily mistaken.
You can find it here [0]. Easy to miss the reference since it is only indicated in the text by a superscript. Personally I prefer using the text as link or a reference put inside brackets inline.
Could easily be type I error. That's the problem with null hypothesis testing. You can't confidently say that the intervention caused the observed change without replication.
>Over the course of this reanalysis, we detected an error in imputing missing values for the SGRQ, whereby the worst possible score (100) was incorrectly imputed for missing values of participants who had died beyond the 6-month study period. The correct approach would have been to classify those values as missing because those participants had not died by the 6 months after discharge study end point.
The reassignment error is possibly forgivable, but I think this second error should have been easier to catch and is much less easy to forgive. A simple filter check between possible score and some other status variable in the dataset would of caught this mistake. I am doing a Masters in Biostatistics and this kind of checking is being taught to us early on, I hope there is more focus on it later to help avoid mistakes like this.
Even when following every protocol, everyone will screw up at some point.
The authors' approach is the correct way to deal with a screw up in the academic/research context -- broad communication, transparent assessment of the mistake, eager explanation.
Trivial errors can slip through the cracks easily.
For example, people sometimes misspell “would’ve” as “would of” even if they actually know that the latter spelling is actually incorrect.
Pointing fingers is easy after the fact, but spotting every possible error all of the time – no one is able to do that.
I’ve even made that very same spelling mistake you did a time or two myself even though I try really hard to be correct in spelling and in all aspects of grammar, and even though I am well aware that “would of” is just plain wrong. We all slip up, and sometimes we do so in embarrassing ways. Especially when we are lacking sleep or when we are otherwise exhausted.
The notion of "failing forward" is somewhat related here -- don't focus on the blame game when addressing honest mistakes. (Fraud is a different matter). Focus on rectifying then moving forward.
I don't know what software quality control is already in place at this organization, but this corrective measure seems on its face wholly inadequate to me: they're just preventing a recurrence of the same exact problem, rather than the much broader class of problems due to programming errors. Do they have a code review process in place?
This speaks to a larger issue: if you write software for manipulating data as part of the production of a scientific paper, then the source code should be available for review as an attachment to that paper, and review of said code should be part of the peer review process in any reputable journal. Professional software engineers write bugs all the time that invalidate the correctness of their programs, never mind individuals whose primary job is research, not software.