People keep saying that winning Kaggle competitions can help you make a name in the data science community and land you a job. With faint hope to achieve a bit of that, I had made my first serious participation in an active Kaggle competition other than Titanic or Housing Prices, had my spell of shooting up to top positions in the leaderboard, but in the end did not win any prize. But still some valuable lessons are learnt.
The Competition
So, what is the competition about? It is called Football Match Probability Prediction, organized by football prediction and analytic company Octosport and sports data provider Sportmonks. The purpose is to predict football match results primarily with the results of the previous ten matches of both teams. It is a community competition, with no money prizes and ranking points awarded, but the subject attracted close to 400 competitors to participate.
The train set contains more than 110,000 matches and the test set more than 63,000 matches. Apart from the major European leagues and tournaments, matches from all over the world, including different divisions, club friendlies, youth teams and women teams competitions are included. Data provided includes match dates, whether the match is a cup game, league ids and coach ids, and in the so-called “history” section, similar information and also ratings and goals scored by both sides are provided. This information is to be used to predict the probabilities of home win, draw and away win of each match.
For evaluation and scoring, this competition uses log loss as the metric.With a baseline which gives 0.333 probability for each class has a score of 1.09861, and the odds of bookmakers have a score of 0.97884 as the benchmark to beat, competitors squeeze on every 1/1,000 point reduction of log loss to get ahead in the scoreboard.
And Here are my 5 lessons learnt:
1. The Effect of Data Leakage Can be Beyond Imagination
My surge of positions in the leaderboard came unexpectedly from the construction of sets of features measuring teams’ offensive and defensive strengths. This is done by dividing the average goals scored and conceded by each team with the league/competition averages, so it has to go beyond the past 10 matches and aggregating across rows.
The average goals data of every league/competition in the dataset can be found by pooling the history feature columns data and groupby league id. I felt uneasy about the idea that some of the goal data used were actually happened after the match to be predicted, but decided to try with this simpler implementation to see how it goes, and thought that if there were no significant trend of change in the time period covered, the average goal figures should be more or less the same.
In the first trial I used average goals data of all ten matches divided by league averages, and it had a modest improvement in log loss score.
To collect more league goals data of a team beyond the 10 past matches, normally we should use the team identifiers. But to prevent people using team names to search the match results to be predicted from the past matches section in the rows of later matches, the team names are masked in the test set. And since no team ids are provided, the coach id becomes an imperfect proxy, as coach records can be coming from different teams for coaches who have changed teams. So after setting up this set of features by groupby coach id and league id and starting the XGBoost training process, I did not have much expectation and did internet surfing elsewhere. But when I turned back to watch the validation log loss score figure, I was bewildered: “Oh my God. It descends fast below 1.00, and continues to drop under the bookmakers benchmark. Is it real?” The validation score stopped at 0.944. I just thought there must be something wrong, and did a submission to check whether it can be reproduced for the test set, and it gave an even lower 0.907.
Even though the coach goal features involve “future data”, I never thought it can have such a big impact. Later I figured out a possible reason: for goals records groupby coach and league id which have only one entry, the average goals are exactly from the match to be predicted. So I changed the codes to exclude such single entry cases and redid the training, but the validation score was still low at 0.94. Raising the exclusion threshold to 5 cases, it was 0.96. Further raised to 10 cases, it was still 0.979. When only average goal records which have 20 or more observations were included, it was just under 0.99.
So it seems that even the goal scored and conceded information for the match to be predicted are mixed with data from other 4, 9 or even 19 matches and taking average, it can still leave too big a telltale sign. And I learnt a big lesson from it.
2. A Change of Approach of Pandas Dataframe Manipulation Can Increase the Speed 100X
My data leakage problem is somehow related to Pandas skill. The original dataframe put all the information of 10 past matches in columns, such as home_team_history_goal_1, home_team_history_coach_5, away_team_history_opponent_rating_10, so it has a very “wide” structure, having 190 columns.
In earlier attempts I used looping across columns to get aggregated data like this:
df[f'home_team_history_average_goal_scored_{i}']=df.apply(lambda r: pd.Series(r[f'home_team_history_goal_{j}'] for j in range (1,i+1)).mean(), axis=1)
Operations like this for a single row can have results instantly, but looping it for a dataframe having more than 100,000 rows, it can be painfully slow. As I experienced, a line like this took minutes to run, so my preprocessing codes for feature creation needed 20 to 30 minutes to run.
Later I discovered after reshaping the dataframe by wide_to_long function to make similar columns “vertically stacked”, the running time of making the same feature can be greatly reduced to seconds, with codes like this:
home_team_history_average_goal_scored=df.query(f'match <= {i}').groupby('id').home_team_history_goal.mean()
df=df.join(home_team_history_average_goal_scored.to_frame(name=f'home_team_history_average_goal_scored_{i}'), on='id')
The problem of aggregating data before the time of the match to be predicted to avoid data leakage is a more complex query, a direct approach like this take more than an hour to run:
df_grouped=df.groupby('id').first().reset_index()
df_grouped['home_team_coach_league_average_goal_scored']=df_grouped.apply(lambda r: home_history.loc[ \
(home_history.home_team_history_match_date < r['match_date']) & \
(home_history.home_team_history_league_id==r['league_id']) & \
(home_history.home_team_history_coach==r['home_team_coach_id'])\
].home_team_history_goal.mean(), axis=1)
df=pd.merge(df, df_grouped[['id','home_team_coach_league_average_goal_scored']], how='left')
A more skillful and much less time consuming approach needs very clear thinking and careful experimentation, which I did not fully solved with codes like this:
home_coach_goal=df.drop_duplicates(subset=home_history_columns)[['home_team_history_coach','home_team_history_league_id', \
'home_team_history_match_date','home_team_history_goal']].dropna()
home_coach_goal=home_coach_goal.sort_values(by=['home_team_history_coach','home_team_history_league_id','home_team_history_match_date'],\
ascending=[True,True,True])
home_coach_goal2=home_coach_goal.set_index(['home_team_history_coach','home_team_history_league_id','home_team_history_match_date']\
).groupby(level=1,as_index=False).\ home_team_history_goal.agg(['cumsum','cumcount']).reset_index()
home_coach_goal2['home_team_coach_league_average_goal_scored']=(home_coach_goal2['cumsum'])/(home_coach_goal2['cumcount']+1)
home_coach_goal3=home_coach_goal2[['home_team_history_coach','home_team_history_league_id',\
'home_team_history_match_date','home_team_coach_league_average_goal_scored']]
home_coach_goal3.columns=['home_team_coach_id','league_id','home_team_history_match_date','home_team_coach_league_average_goal_scored']
home_coach_goal3=home_coach_goal3.drop_duplicates(subset=['home_team_coach_id','league_id','home_team_history_match_date'])
df=df.merge(home_coach_goal3, how='left')
3. I should have Made Myself More Comfortable with Using Neural Network
Neural networks can be powerful in building models, but I always have reservations about them taking longer time to train and less interpretable, so I tend to use other models if the same effect can be achieved, and in this case I want to try how far XGBoost can go. But in this competition it is exactly neural networks that are needed. The data of 10 previously matches forms a mini time series, and the trend of form is better captured by recurrent neural networks (RNN), especially Long short-term memory (LSTM). While XGBoost is a powerful algorithm for tabular data, it would not take the time or sequential factor into account if we feed the data “flatly”, and it will treat features of different matches the same, only consider their informative values with relation to the target. And the competition results did show that all top leaders use LSTM.
4. Even Documentation of Well-Known Library May Contain Mistake
This lesson is not directly related to the competition indeed, but I encountered it within the process. I wanted to try the Python API instead of the Scikit-learn API of XGBoost this time, and consulted the documentation of XGBoost. In its Python Package Introduction, the demonstration codes of setting parameters goes like this:
evallist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 10
bst = xgb.train(param, dtrain, num_round, evallist)
I followed this code and found early stopping did not work, and later concluded that the train score instead of eval score was set as criteria of early stopping, so apparently the evallist line had the order of the train set and test set reversed. In fact it should be:
evallist = [(dtrain, 'train'), (dtest, 'eval')]
I thought this mistake may make many first time users confused, and raised this issue in XGBoost’s Github, and had a little contribution to XGBoost as my first ever pull request.
5. The Temptation of Vanity
In Kaggle’s permanently ongoing Titanic competition, many people are amused by the phenomenon that so many users use a complete Titanic record from outside to get all survival predictions correct and make perfect accuracy scores, some may blast it as “outrageous cheating”. But after my spell of shooting high in the leaderboard, I understand a little more why people do it.
Though my submission assisted by data leakage was not valid, in the leaderboard I still hold the second place, and in discussion and other places the “second place” label is still attached. I can imagine people feel good about it, getting attention and perhaps many praises and upvotes when they publish their notebooks.
And there are many ways of reaching high positions in the leaderboard without putting too much effort by oneself. One may copy high scoring public notebooks, or pooling different models of public notebooks in the name of ensembling.
Of course there are people using invalid means by accident or experiment. I had deselected the best scoring version which had the most outrageous data leakage for final submissions, but included the version of trying to dilute data leakage with at least 9 other observations, in the hope of raising a discussion.
The other submission was my final version of the solution, which partially got rid of using “future data”, had test scores ranked behind dozens others which used LSTM. So in the end no award was won, but my ranking should be decent for a first time competitor who devises original solutions. And with my first Github pull request and the first bronze medal made with a notebook about the ratings feature, I think I had not a small harvest from this competition.
No comments:
Post a Comment