The obsession with SOTA needs to stop


Source: I generated this. I should be doing better things with my time.

(This is a rough draft. I may do some heavy edits to this over time. Stay tuned.)

One of my biggest frustrations with machine learning (I guess more specifically deep learning) has been the absolutely abysmal quality of reviews that has come out over the past few years. A very common trope is the rejection of a paper because it fails to ‘beat SOTA’ (state-of-the-art), amongst many other critiques (‘lacks novelty’, ‘too theoretical’, etc.). I want to specifically vent about the obsession of SOTA and why I think it is not only stupid, but incentivises bad research.

To start off, I want to get off my chest that I don’t have a problem with SOTA-chasing per se as a research contribution. If you come from an engineering mindset where you want to maximise the performance of some model on a real-world setting then sure, you would place lots of emphasis on obtaining state-of-the-art performance. We want to have the best performing algorithms deployed in the real world, else self-driving cars might drive off of cliffs and kill their owners. Obviously, there are many other research contributions one could pursue: optimising for other useful metrics (e.g. memory footprint, inference speed), comparing algorithms, performing ablations, testing hypotheses, new perspectives, and so forth. My main issue lies with reviewers (and authors) placing exhorbitant emphasis on “SOTA”. There are two main reasons why I have absolute disdain for ths:

In my experience reading papers in machine learning, many authors appear to lack a basic understanding for how to properly evaluate a machine learning model – or, maybe they do but want to appease silly reviewers who don’t know how to do it. After all, the ultimate unit of currency in academia is the (published) paper and this often incentivises sloppiness. Here are all of the things I have seen people do (or not do) when it comes to papers proposing SOTA contributions:

Both authors and reviewers are often ignorant of the above points, though my criticism lies more with reviewers because they are meant to be the gatekeepers of the literature (and I mean gatekeeping in a good way, not the malevolent elitist way). When reviewers are ignorant of it and are obsessed with SOTA, they end up accepting papers that claim SOTA but whose claims are barely supported statistically. This just adds noise to the literature and makes life harder for the honest researcher who is actually trying to beat SOTA in a a principled manner. Maybe that honest researcher finds that at the end of the day none of the algorithms compared really do any better than the other, but reviewers won’t care about their paper because ~novelty and SOTA wins above everything else~. On the other hand, reviewers can also end up rejecting papers on the basis of not beating SOTA because they value it disproportionally at the expense of all of the other interesting contributions and results that the paper may have proposed (see bullet point (e)).

Extra rambles

Furthermore, on the topic of measurements of uncerainty: what is the uncertainty being computed over (when confounding variables are controlled)? Computing uncertainty (variance) over random initialisation seeds is completely different to say, random subsampling or cross-validation over your training set. In the former case, one is measuring the behaviour of the algorithm when subjected to random initialisations: if algorithm A gets 85% +/- 5% accuracy and algorithm B gets 92% +/- 10% accuracy, then this would indicate that B is less stable and would probably need more repeated training runs so that we can select the model which performs best on the validation set. If we are randomly subsampling our data, then we are essentially measuring the stability of the algorithm with respect to what might happen if one were to collect the data in practice. For instance, if we performed cross-validation and algorithm A obtained 85% +/- 20% and algorithm B obtained 85% +/- 5%, then implementing and running algorithm A on our own dataset is a whole lot riskier since it may only give us an accuracy of 60% simply by chance (i.e. one standard deviation below the mean). I am bringing this specific example up because as an author, you may propose an algorithm which performs better with respect to dataset uncertainty than seed uncertainty, but in today’s day and age I don’t even think enough reviewers are nuanced enough (or simply care) to take this into account.

Conclusion

That is all for now.