How you could’ve got a silver medal in Kaggle’s 2022 Jigsaw competition
--
Don’t overfit it!
Arrogant forecasters don’t exist
The interesting thing about forecasting is that every week I get a new test set. Just because a model worked well on last week’s data doesn’t mean it’ll work well on next week’s. The challenge lies in properly validating models before putting them into production.
How do you learn the subtleties of cross-validation? With practice, and this is where Kaggle comes into the picture. In the words of Competitions Grandmaster Jean-Francois Puget:
unbiased evaluation is the single most important feature of Kaggle. It teaches kagglers that properly evaluating predictive model performance is key.
Once the test score has been revealed, that’s it, you can’t go back and change you model. If you overfitted, it’s too late. Let’s talk about overfitting in the context of the 2022 Jigsaw Competition.
Don’t you know that you’re toxic?
The goal of the 2022 Jigsaw competition is, given a pair of sentences, to determine which one is more toxic. Because toxicity isn’t objective, the ground truth for this competition has been determined by human annotators. Solutions are evaluated in terms of Average Agreement with Annotators.
For example, suppose we have the sentences “you’re so stupid!” and “shut up!”, and 2 human annotators thought the first one is more toxic, while a third one thought the second one is more toxic. If you give a toxicity score of .1
to the first sentence and a toxicity score of .2
to the second one, then your Average Agreement with Annotators for these two sentences would be 2/3
. You goal is to maximise this number based on the toxicity scores you assign to each sentence you’re provided with.
This subtitle overfits
By inspecting the most popular public notebooks, you could notice a couple of red flags. Some of them contained really arbitrary scalings such as
for i in range(801, 1200):
df_test['score'][i] = df_test['score'][i] * 1.34
They evidently came up with some magic numbers which happened to score high on the public leaderboard (approx. 5% of the final test set) without regard to generalisation. Overfitting doesn’t get much more overt than this.
But even many of the public notebooks without these magic scalings were overfitting. You could conclude this by forking them and calculating their validation scores, which were much lower than their public leaderboard scores. This kind of overfitting is harder to spot — if you want to detect it, cross-validation is a must.
Keep it simple
Based on the above, I had a strong feeling that most competitors were heavily overfitting to the public leaderboard, probably without even realising it. So I figured I’d try the following:
- make a simple solution
- check that its validation score beats that of the overfitted public notebooks
- submit
This meant ignoring the public leaderboard, which is never easy.
Model-wise, I tried out Detoxify, an open-source pre-trained model which, given some texts, predicts the number values corresponding to: toxic
, severe_toxic
, osbcene
, threat
, insult
, identity_hate
. I then found the linear combination of such values which would maximise my local validation score. That’s it, that was the whole solution.
Results
My model:
- validation score: 0.691
- public leaderboard: 0.753
By contrast, a very highly-upvoted public notebook achieved:
- validation score: 0.672
- public leaderboard: 0.873
Bingo. As long as my validation score was higher, I wasn’t bothered by being low on the public leaderboard. This ended up paying off — when the private leaderboard was revealed, my model scored 0.801, while the aforementioned public notebook only scored 0.762.
The takeaway, as with all competitions, is simple: trust your CV.
Want more?
The best way to learn is by doing. If you’re able to find an awesome Data Science job at an awesome organisation, then great — otherwise, Kaggle can be great practice to help you get there.