archiv's comments

archiv · on July 11, 2023

not found - is there an archived version I can take a look at?

brianjking · on July 11, 2023

https://archive.is/Y72Gu

redox99 · on July 11, 2023

https://i.4cdn.org/g/1689038229454107.png

vd1f386f3 · on July 11, 2023

https://web.archive.org/web/20230711002505/https://threadrea...

vd1f386f3 · on July 11, 2023

https://threadreaderapp.com/thread/1678545170508267522.html

archiv · on May 17, 2023

Well not exactly - Table 6 is a bit concerning. It's meant to show that despite significant data contamination between testing and training datasets, the model still performs 'well'. Except look at those confidence intervals - they're all over the place, meaning the model performance isn't very reliable. You have confidence intervals comprising 30-40 percentage points!

ftxbro · on May 17, 2023

Well not exactly - Table 6 isn't concerning at all to me.

> look at those confidence intervals - they're all over the place, meaning the model performance isn't very reliable. You have confidence intervals comprising 30-40 percentage points!

OK let's look at the confidence intervals. Let's look at the first row of the table. It has a pretty big confidence interval [76.0, 100.0] for its 'Performance (with Overlap)' entry.

Let's dig in to what this really means.

We can see from the first entry in that row only 12 of the 1273 questions from the 'MedQA (USMLE)' dataset had overlap (as they define it) with text in the training data. This is fewer than one percent of the questions. It's also a small absolute number. Twelve. The 'Performance (with Overlap)' entry is an attempt to judge how well the model does on these twelve questions. On average it does pretty well but there is a wide confidence interval because the number twelve, the number of questions that overlapped the training data, is so small. Because it's so small, it doesn't provide much of a sample size to estimate exactly how well the model does for questions that it has already seen.

Your objection is basically like 'there is so little data contamination between testing and training datasets, that there is not enough sample size to tell exactly how well the model does on questions that it has seen in training' which wouldn't make much sense as an objection. In particular, it doesn't mean that "the model performance isn't very reliable."

archiv · on May 17, 2023

I see what you mean although I was specifically referring to the delta column. I might be misunderstanding it - can you explain the huge flux in the confidence intervals there? That’s what struck me.

ftxbro · on May 17, 2023

> I was specifically referring to the delta column. I might be misunderstanding it - can you explain the huge flux in the confidence intervals there?

It's essentially the same reason. Some delta confidence intervals are wide like 'PubMedQA' for which only 6 test questions had overlap (as they define it) with the training data. The small sample size 6 made that interval wide. Some delta confidence intervals are much smaller like 'MedMCQA' which had 893 questions with overlap out of 4183 total questions. The large sample sizes for both classes (with overlap and without overlap) made that interval much more narrow.