Analyzing Wimbledon 2019 Final using Machine Learning Models
As my first end-to-end data science project, I decided to conduct it around one of my passions, tennis. My project consists of web-scraping, data cleaning, exploratory data analysis, clustering, classification models and finally using the final classification models for a case study on Wimbledon 2019 Final between Roger Federer and Novak Djokovic, which will be the main focus of this blog post.
P.S. I am a fan of Nadal, so no bias on my side!
Initial Data Analysis (Pre-Case Study)
Before getting into our case study, I would like to share some interesting insights from the initial data analysis using all the ATP matches from 1990–2019.
- Service-dominant players struggle on clay and hard courts, heading into the match with a base win rate of 47.9 % and 49.5 % respectively. However, this playing style proves to be very effective on grass courts, with a base win rate of 53.4 %, the highest of any group-to-surface comparisons. Overall, it does not pay off to be a service-dominant grass-court specialist as only 12 % of the current ATP tour matches are played on grass surfaces.
- No player had ever won a match without having a break-point opportunity, in both Best-of-3 and Best-of-5 sets format. Though it is theoretically possible to do so by winning just tiebreaker sets, no one has ever achieved it.
- It may seem counterintuitive but it is possible to win a match with a total point deficit, with 4.58 % of match winners winning fewer points than their opponent. In the extreme case, in 2002 Australian Open R1, Stefan Koubek managed to win 30 points less than his opponent, Cyril Saulnier and got away with it.
As a conclusion to this initial data analysis, it is okay to win fewer points than your opponent but it is not okay to not getting break-point opportunities on your return games.
Additionally, it is more effective to have a more balanced game and focusing on improving return games rather than building a service-dominant game as it helps you obtain and convert more break-points, which proves to be the key to winning, especially for hard courts and clay courts.
Case Study Part 1: Pre-Match Prediction
As I developed 2 classification models, one for prediction and one for analysis, we will first look at the prediction model, which can be used for any pre-match statistics.
Classifier: XGBoost with Validation AUC of 76.73 %
For our pre-match case study, we prepared 3 sets of data:
- Tournament Average: This data describes both players’ performance across first 6 matches of the tournament leading up to the final
- Year-To-Date Average: This data describes both players’ performance from the start of the year 2019 up until the final
- Head-To-Head Average: This data describes both players’ performance from all the past H2H matches
Tournament Average
We start by looking at the tournament average, using the .predict_proba function, we can see who had played better during their first 6 matches before the final. As shown above, we can see Federer being more dominant in his wins, with his statistics describing a 77.31 % match win rate compared to Djokovic’s 71.14 %.
This becomes more impressive when taken into account that Federer had also faced harder opponents on his run to the final compared to Djokovic.
Paths To Final:
Federer had faced 4 seeded players and 2 top 10 players (including Rafael Nadal) while Djokovic had not faced any top 20 players before the final.
Using this dataset as a prediction for the match would give Federer a relative edge of 3 to 4 % against Djokovic.
Next, we look at the prediction based on their Year-to-Date statistics prior to this final.
Year-To-Date Statistics
Prior to this match, Federer had accumulated 32 Wins/4 Losses (88.89 %) in 2019 while Djokovic had accumulated 28 Wins/6 Losses (82.35 %). With this information, the .predict method once again becomes trivial as they both had a successful season at 80 + % win rate.
However, we can see some interesting figures when we look at the .predict_proba method. Though Federer had a more successful season with higher win rate, Djokovic actually had been more dominant with his game on-court throughout the year, his average match stats give him a 67.03% of winning any match while Federer had 55.12%, indicating less dominant wins in the year compared to Djokovic.
Based on this dataset, we would expect Djokovic to win the final with an edge of 5–10 % over Federer.
Finally, we look at their previous head-to-head records to better capture their contrasting playing style and the match dynamic when the two of them face each other.
Head-To-Head
Before this match, Djokovic held a 25–22 winning record over Federer.
With the .predict method, we can see that both players bring a good level of tennis when they face each other, as both compiled statistics are capable of winning matches based on the model.
Next, we look at the .predict_proba results, we can see that Federer stats had been more dominant than Djokovic in their past matches, at 60.23 % against Djokovic’s 51.04 %. This is a relative dominance ratio of 54:46 in favor of Federer. This is an interesting statistic as Djokovic held a 25–22 winning head-to-head against Federer.
This may be due to the matches earlier in their career when Federer was at his prime while Djokovic is still struggling to challenge Federer or Nadal.
Using this model, Federer would be expected to win the final with an edge of 5–10%. However, this is the least accurate model of the 3 as we are taking into account matches from 2006–2018 between them, with no additional bias given to the recent matches, as there had been a dynamic switch during the time period.
Prediction Model Conclusion
The most accurate predictor would be the tournament average as it encompasses the players’ most recent form leading up to the match, but interesting insights can be seen using the other datasets used. Using the tournament average, our prediction function gives Federer a relative edge of 3~4 % over Djokovic.
Case Study Part 2: Match Analysis
Now we will look at our analysis model, which uses the actual in-game statistics of Wimbledon Final 2019.
Classifier: AdaBoost with Validation AUC of 92.60 %
Before we begin our analysis, look at the full match statistics above and guess who won the match. Left or right?
Without any tennis knowledge, the basic intuition would guess the player on the right won the match. Most of the time, you would be right, but not for this match.
Federer performed better in every statistics overall:
- Won 14 more total points (218–204)
- Higher Service Points Won % (68 % — 64 %)
- Higher Return Points Won % (36 % — 32 %)
- 5 more break point opportunities (13–8)
- Converted 4 more break-points (7–3)
- Higher break point conversion rate (54 % — 38 %)
- Higher break point save rate (63 % — 46 %)
But he lost….
The analysis model that we built predicted a win for Federer based on the full match stats, which perhaps deservedly so as Federer was the better player throughout the whole match, won 14 more points than Djokovic and had 2 match points on his own serve but failed to convert in the final set. This would be a false prediction by the analysis model but it does give insights into the relative performance and which player’s performance is more deserving of victory.
When we look at the win probability for each player, we can see the model predicted a relative win rate of 51 % to 49 % in favor of Federer. Therefore, Federer did play better than Djokovic and would have probably won the match any other day, but the model shows that Djokovic did not play terrible but was about 2 % below Federer’s level on the day. The difference was Djokovic increased his level when it comes to pressure points like facing Federer’s break-points and match points opportunities. This helped him stay in the match until the 5th set despite playing at a lower level than Federer, we will look into a more detailed set-by-set analysis now.
Final Score:
- Set 1: Djokovic won 7–6(5)
- Set 2: Federer won 6–1
- Set 3: Djokovic won 7–6(4)
- Set 4: Federer won 6–4
- Set 5: Djokovic won 13–12(3)
Set-By-Set Analysis:
As discovered in our initial data analysis, no player had ever won a match without breaking the opponent’s serve from 1990–2019. It is theoretically possible but highly improbable as the statistics show, but assuming they played a normal Best-of-3 Sets match, Djokovic would have been the only player in 2 decades to accomplish that feat. Djokovic had 0 break point opportunities in the first 3 sets as compared to Federer’s 6, but ended Set 3 with a 2–1 lead in sets, winning the first and third set via tiebreakers.
As we can see from the predict function above, based on their performance, Federer was forecasted to win the first 3 sets while Djokovic was forecasted to lose all 3 sets. Most tennis speculators think Federer lost the match in the 5th set when he led 8:7 in games and 40:15 in points with 2 match points on his serve, but in reality, Federer lost his match in 1st and 3rd set where he played significantly better than Djokovic in the 2 sets but failed to convert his break-points to capitalize on his superior performance over Djokovic, and partially due to Djokovic stepping up his game in crucial moments to win all 3 tiebreakers that led him to the victory.
When we look at the win probability of each player in each set, we can see a fluctuating performance from Djokovic and consistent performance from Federer before falling in set 4 and 5. This is not surprising as Federer was turning 38 years old in a month during that match and Djokovic was 6 years younger than him, thus giving Djokovic an advantage as the matches went over 3–4 hours as he had better physical endurance.
When we look at the final set, we can see that the model predicted both players to lose, with a probability of winning at 49.04 % for Djokovic and 49.64 % for Federer, signifying that both player’s performance is not worthy of winning, but it also shows that both players are playing at a similar level, hence the game went on to the 12–12 in games, requiring a final-set tiebreaker to decide the winner.
Though they played at a similar level in the final set, Djokovic’s final set performance was his personal 2nd best set in terms of the level of play, while Federer’s final set performance was his personal 2nd worst set, indicating Djokovic improved his performance after set 3 and Federer’s level falling gradually after set 3.
We will try to investigate this further in our next analysis, which looks at the cumulative performance at the conclusion of each set.
Here, we can see the shift in their relative performance from the start of the match to the end, showing that Federer gradually increases his dominance over Djokovic from the first point to the end of set 2, with the peak at 57% to 43 % in relative performance in favor of Federer, a staggering 14 % edge over Djokovic.
After set 2, Federer’s relative performance against Djokovic dipped slightly in set 3 and continued to drop further in set 4 and 5, but statistically remain the superior player throughout the whole match from the first point to the last point.
Final Thoughts
I specifically chose this match as an ultimate test for my models, and yes the model failed to accurately predict the match, but it did give us accurate insights on how the match went. In a lot of ways, Federer should have won this match and deservedly so, the fact that he was the better player in a 5-hour marathon final against one of the greatest players of all-time at the age of 38 years old is mind-boggling.
However, most tennis experts would agree that tennis is not about winning the most points, but to win the points when it mattered the most. And that is the beauty (and cruelty) of a tennis match, it puts the players under constant mental pressure, making every minute of a tennis match exciting.
Great respect to both Djokovic and Federer for producing a classic match, it wasn’t the highest quality of tennis matches, but the drama sure made it memorable.
Thanks for reading about my project and do reach out if you‘re interested to talk about tennis, data science, or anything at all!
Project Repository
Do visit my repository to check out the full code from data collection to the case study of my data science project. More stuff will be added in the future too!