paper thoughts: questionable research practices (QRPs) in machine learning

I enjoyed reading this paper [1] and wish I wrote it (or was a part of writing it!). Each section in this post can be cross-referenced with their Table 1:


AI-generated depressed scientist with a knife in his mouth
Stable Diffusion-generated "depressed scientist with a knife in his mouth". I think that was the prompt I used for it.

1. Tuning hyperparameters further after test

Another common way to leak information is to tune on the test set: training a model, evaluating it on the test set, and then doing further hyperparameter search or testing again with a different evaluation metric. … The resulting models are in some sense being implicitly fitted to the test set (since we use the test score as a signal to build the next model).

(n.b I originally wrote this here on hackernews but I develop the text a bit further here)

This is very true, and I would argue there is a very prevalent misunderstanding (or just ignorance) towards the distinction between a validation and test set. When the distinction is actually made between the two, the idea is that one performs model selection on the validation set, i.e. find the best HPs such that you minimise (or maximise) some metric that operates on that subset of the data. Once you've found your most performant model according to that metric, you then evaluate that same metric on the test set. Why? Because that becomes your unbiased measure of generalisation error. Note that in a production setting you'll want to get an even better model by re-training on all the data available (train + valid + test) under those ideal HPs but that's completely fine: if somebody asks you what the generalisation error of the re-trained model is, you simply point them to the test set metric computed on the model you trained beforehand, the one where you followed the train-valid-test pipeline.

This distinction goes against the publish or perish mentality of academia. Since reviewers (and by association, researchers) are obsessed with "SOTA", "novelty", and bold numbers, a table of results purely composed of metrics computed on the test set is not easily controllable from the point of view of actually "passing" the peer review process (if you want to be ethical about it). Conversely, what's easily controllable is a table full of those same metrics computed on the validation set: just perform extremely aggressive model selection until your best model gets higher numbers than all the baselines in the table. However, rather than report separate tables for validation and test set, the common QRP is to just treat them as one and the same.

Admittedly, it is very anxiety-inducing to leave your fate up to a held-out test set whose values you can't optimise for, especially when your career is at stake. Interestingly, if your validation set numbers were great only for the test set, it would indicate you were "overfitting" via model validation. That would suggest either making the model search less aggressive or going for a simpler model class. The latter approach is called Occam's razor, but does our field really encourage simplicity? (See "Superfluous cogs" at Sec. 3.3.1 of [1])

To distinguish this from classic contamination (training on test data), Hosseini et al. [2020] call this ‘over-hyping’ and note that it biases results even if every iteration of the cycle uses cross-validation properly.

It goes back even further than that, see [2] (back in the olden days before we had deep learning):

Cross-validation can help to combat overfitting, for example by using it to choose the best size of [model] to learn. But it is no panacea, since if we use it to make too many parameter choices it can itself start to overfit.

Even with cross-validation, we have to mitigate against this. The easiest solution is to simply hold out a test set which is independent from the cross-validation procedure. One can even have each fold of cross-validation serve as a test set (in the context of an "inner" cross-validation which handles training and model selection).

2. Over/underclaiming

To be done. There is a piece I'd like to write about the weirdness of evaluation metrics in generative models.

Bibliography

[1]
G. Leech, J. J. Vazquez, M. Yagudin, N. Kupper, and L. Aitchison, “Questionable practices in machine learning,” arXiv preprint arXiv:2407.12220, 2024.
[2]
P. Domingos, “A few useful things to know about machine learning,” Communications of the ACM, vol. 55, no. 10, pp. 78–87, 2012.