
Calculating the sample size required for developing a clinical prediction model - rbanffy
https://www.bmj.com/content/368/bmj.m441/rr
======
bonoboTP
Not sure if any of the test-driven development people have thought about this,
but the same principle also applies there: if you debug and fix the code until
it passes the tests, you can overfit the tests. It's no longer a good measure
for code quality, once it has been explicitly optimized. You'd need new,
previously unseen test cases. It would be an obvious mistake in machine
learning to simply add failed test examples into the training set and rejoyce
that the new model can now deal with those previously difficult cases.

It's related to Goodheart's law: "When a measure becomes a target, it ceases
to be a good measure".

It also works in one's personal life. If you find that you made a mistake,
don't just fix that particular thing, go back and see what other similar
things may need fixing. It connects to the idea of fixing the deeper reason
than fixing the symptoms. "Deeper reason" just means, something that
generalizes.

By the way, you need to be careful about your data split. This always depends
on how you intend to use the model. If you intend to use it on new, unseen
patients, then the train and test data cannot overlap with respect to people.
Another obvious case is videos: You can't just take every even frame as a
training sample and every odd frame as test. Even though technically you would
be testing on non-overlapping data, the performance measure would be biased
compared to real world performance. Or if you want to classify burglars from
security camera footage, you may need to test it on new camera setups from
different houses if you intend to deploy it to new houses. If your scenario is
such that you'd perform training on each new site and run a location specific
model, you can test on images from the same site.

You always have to use your brain to decide what to do.

~~~
shajznnckfke
> Not sure if any of the test-driven development people have thought about
> this, but the same principle also applies there: if you debug and fix the
> code until it passes the tests, you can overfit the tests. It's no longer a
> good measure for code quality, once it has been explicitly optimized. You'd
> need new, previously unseen test cases. It would be an obvious mistake in
> machine learning to simply add failed test examples into the training set
> and rejoyce that the new model can now deal with those previously difficult
> cases.

This is a great point, and it articulates some challenges that I’ve seen with
test-driven development.

I would say that if you can write tests that fully specify the desired
functionality, rather than merely check a few possible inputs, it’s less of an
issue. This is a reason to try to build things with less knobs to twiddle, so
the space of inputs has lower dimensionality.

~~~
virgilp
Yes but then the tests would formally describe the requirements - i.e.they
would be a de-facto implementation. Congratulations, you just coded an
(indirect) solution to your problem (a solution that you postulate is
correct/bug-free)

~~~
tonyhb
Are you saying this like it's a bad thing?

Tests describing requirements acts as documentation, validation, and
regression prevention with future refactors.

~~~
virgilp
My beef is with the assumption that tests _can_ fully encode the requirements
- to the point where any bug would trigger a test failure. Having such a test
suite is no simpler than having a perfect implementation of the requirements -
i.e. it's probably only feasible, at all, in the simplest/ "didactic" cases.

Having a perfect implementation is not _bad_ , of course. It's just not
realistic.

------
andersource
The link is not to the paper itself but to a response to the paper, this might
be (is probably?) intentional but the title is a bit confusing.

Essentially the response, submitted by several ML / CS / math researchers,
addresses a note in the paper which recommends against train/test split in
model training, calling it "inefficient". The response is dedicated to
explaining what generalization error is and why estimating it is important,
and how that's basically impossible without any sort of train/test split or
cross-validation.

~~~
mumblemumble
The article is there, if you click on the "Article" tab.

Perhaps if any staff are paying attention on the holiday, we could get the
"/rr" chopped off the end of the link, so that we get to the main article
instead?

~~~
andersource
My guess is that OP intended to submit the response, I was really surprised to
see the note in the article and found the response interesting (even the fact
of its existence). But that's just my take :)

~~~
mumblemumble
I suppose, but then you'd expect a different title on the HN submission, since
the response has barely anything to do with sample sizes.

------
peatmoss
And if you’re a Bayesian, you can choose a stopping criterion based on your
desired degree of certainty. This can have practical benefits of not having to
run an experiment to its end.

A family member was part of the control group for a cancer treatment study a
few years back. The study chose a stopping criterion based on Bayesian
methods. Relatively early into the study they were able to determine that it
made sense to move people from the control group to the treatment group.

~~~
icegreentea2
You don't have to be a bayesian to do that.

Once you're dealing with real people, early termination is a huge deal, and it
quickly goes beyond just the interpretation of statistics. You almost always
want to have 3rd party referees involved anyways.

~~~
MiroF
> You don't have to be a bayesian to do that.

The reasoning for doing so is much more coherent (imo) in a Bayesian framework
where you don't have to say that you are "going beyond" statistics.

~~~
mattkrause
By "beyond statistics", I think they meant that trials need on-going
monitoring for things beyond the outcome variable itself: safety, data
quality, and even feasibility of completing the trial itself. Once you've got
that infrastructure in place, interim monitoring of the outcome isn't much
extra work.

Some of the frequentist approaches to early-stopping seem pretty coherent to
me. Curtailiment, for example, just stops collecting data once it will no
longer change the outcome of a test. The frequentist emphasis on error control
_for the procedure_ also seems like a reasonable fit for a regulatory regime
where you're actually making decisions (no argument that it is...weirder for
basic research)

At any rate, some approaches for early-stopping have sensible frequentist and
Bayesian properties, which is nice.

~~~
icegreentea2
Yes, I meant for clinical trials.

Agreed that things are weirder for basic research.

------
ereinertsen
Estimating a minimum required sample size is one of the most common questions
asked by clinical or biomedical collaborators before embarking on a research
project. This is especially true when ML is an option. This paper provides
rules of thumb and a digestible amount of theory that could inform such
conversations, and will surely become a popular reference.

Note intuition from traditional statistics does not universally apply to deep
learning and/or extremely high-dimensional data. For example, deep neural
networks with 1-4 orders of magnitude more parameters than training examples
can still generalize well to unseen data.

------
2rsf
This is part of the reasons why A/B testing might fail in reality, it is not
just about collecting random number of responses about A and B.

When doing a test you should plan it in advance based on other data, you
should execute the test as planned and analyze the results accordingly failing
any of the tests will results in wrong to totally wrong conclusions.

~~~
activatedgeek
> When doing a test you should plan it in advance based on other data

I think this is not entirely true. In areas where experimentation feedback is
fast (and experiments are probably cheaper to run), the problem is much more
accurately and sample efficiently solved by Thompson Sampling [1,2] which in
fact dictates you to have a broad enough prior over the solution space and
then let the posterior dictate your conclusions.

[1]: [https://www.microsoft.com/en-us/research/wp-
content/uploads/...](https://www.microsoft.com/en-us/research/wp-
content/uploads/2016/02/thompson.pdf)

[2]:
[http://www.economics.uci.edu/~ivan/asmb.874.pdf](http://www.economics.uci.edu/~ivan/asmb.874.pdf)

