Round 8, 2020 Predictions

These tips are temporary due to problems with obtaining stats from last round. These tips are predictions made without information from last round. Hopefully these can be corrected for soon!

Winning TeamOpponentDateProbability (%)Margin
Western BulldogsGold Coast23/07/20206913
RichmondGreater Western Sydney24/07/2020551
CarltonNorth Melbourne25/07/2020608
Port AdelaideSt Kilda25/07/20207924
West CoastCollingwood26/07/2020540
Brisbane LionsMelbourne26/07/20208327
Predictions for Round 8, 2020.

Round 7, 2020 Predictions

Winning TeamOpponentDateProbability (%)Margin
EssendonWestern Bulldogs17/07/2020588
Brisbane LionsGreater Western Sydney18/07/2020501
SydneyGold Coast Suns18/07/20207824
RichmondNorth Melbourne18/07/2020617
Port AdelaideCarlton19/07/20207721
West CoastFremantle19/07/20207920
St KildaAdelaide20/07/2020578
Predictions for Round 7, 2020.

Round 6, 2020 Predictions

In this round, I have assigned the home team in my model to the team that last played at the venue. I’ve had to do this since all my features in my linear/logistic model are subtractions of the away teams statistic from the home team. This has led to some strange tips, but let’s go with them…

Winning TeamOpponentDateProbability (%)Margin
GeelongBrisbane Lions09/07/2020543
FremantleSt Kilda11/07/2020543
West CoastAdelaide11/07/20208125
MelbourneGold Coast11/07/20207015
EssendonNorth Melbourne11/07/20206310
Port AdelaideGreater Western Sydney12/07/20206713
Western BulldogsCarlton12/07/2020648
Predictions for Round 6, 2020.

Round 5, 2020 Predictions

Winning TeamOpponentDateProbability (%)Margin
CarltonSt Kilda02/07/20206210
West CoastSydney04/07/2020624
GeelongGold Coast04/07/20208736
Western BulldogsNorth Melbourne04/07/20206611
Brisbane LionsPort Adelaide04/07/2020584
Greater Western SydneyHawthorn05/07/2020503
Predictions for Round 5, 2020.

Round 4, 2020 Predictions

Winning TeamOpponentDateProbability (%)Margin
SydneyWestern Bulldogs25/06/2020565
CollingwoodGreater Western Sydney26/06/2020602
Port AdelaideWest Coast27/06/20207525
RichmondSt Kilda27/06/20207414
FremantleGold Coast27/06/20205910
Brisbane LionsAdelaide28/06/20208229
HawthornNorth Melbourne28/06/20206711
Predictions for Round 4, 2020.

Round 3, 2020 Predictions

Winning TeamOpponentDateProbability (%)Margin
Western BulldogsGreater Western Sydney19/06/2020561
North MelbourneSydney20/06/2020679
CollingwoodSt Kilda20/06/20208322
Brisbane LionsWest Coast20/06/20206314
AdelaideGold Coast21/06/2020606
Port AdelaideFremantle21/06/20206714
Predictions for Round 3, 2020.

Round 2, 2020 Predictions

Predictions for Round 2, 2020. Assuming match time is reduced to 80%. No changes to models. Foopy is back! I have been using machine learning to investigate other things which I will post about soon, stay tuned 🙂

Winning TeamOpponentDateProbability (%)Margin
Brisbane LionsFremantle13/06/20207720
Port AdelaideAdelaide13/06/20207626
West CoastGold Coast13/06/20208633
Greater Western SydneyNorth Melbourne14/06/2020629
Western BulldogsSt Kilda14/06/202053-2
Predictions for Round 2, 2020.

Comparing Models (Linear, Logistic, SVM)

What’s New?

I was invited to join Squiggle, an online AFL ladder where mathematical models are compared using three metrics: correct tips (accuracy of win/loss prediction), margin mean absolute error (accuracy of margin prediction), and bits (accuracy of predicting the correct probability of winning). Essentially, we want to maximise correct tips, minimize margin mean absolute error, and maximise bits. I’m super excited to join the Squiggle competition, and can hopefully not come last in everything ;). The mathematics of Bits can be found here, but essentially they’re a measure of how accurate your model is at predicting the probability of a team winning. For example, if Richmond and Carlton were to play each other 100 times, you are trying to predict what percentage of games Richmond would win. Whereas tipping doesn’t account for this (simply a measure of predicting the winning team, but does not take into account the “confidence” in a tip).

What I have discovered recently is a few things. That my data contained duplicates of games, and it also didn’t contain all games from every season. After correcting for this, my model predictions have improved quite significantly. I also feature scaled each player’s stats over a game then took the average for both teams, instead of feature scaling over the entire sample. This was done in an attempt to increase the accuracy of who performed better out of two teams in a particular game.

Creating and Comparing Models

We left off with a linear regression model to predict margin, and a logistic regression model to predict the probability of winning. I decided to compare the predictions of both models, as well as a simple support vector machine for classification, using a radial basis function (Gaussian) kernel.

Our entire data set contains the previous 3000 games played (approximately from 2004 to 2019). I split the data set into two subsets, a test set (20% of the most recent games, i.e. all games from approximately the beginning of 2017 to the present) and a training set containing the remaining games (80% of the original set). I then used a shuffle split method to optimize hyper parameters for each model, where for each hyper parameter value, the accuracy was measured by averaging over 100 shuffles. This resulted in consistent hyper parameter values for each model. Note: Technically, whilst optimizing hyper parameters, I split the training set into two disjoint sets for each iteration, one to fit the model (75% of training set size), and one to cross-validate (25% of training set size).

From the linear model, the probability of winning was calculated by assuming that the residuals are normally distributed, with the mean set equal to the expected margin, then simply calculating the CDF to give the probability of winning.

See Table 1 below for comparisons between models. Note: The 2019/18 seasons lie in the test set, and are disjoint from the training and CV sets.

We see that the logistic model is slightly better than the linear model for both tips and bits. The SVC model appears to be only slightly better at tips (66.7%) on the test set than the linear and logistic models (66.2% and 66.5%, respectively), but significantly worse when it comes to bits.

ModelBits Test/TrainMAE Test/Train (Margin)Tips Test/Train (%)Bits 2019/18MAE 2019/18 (Margin)Tips 2019/18 (%)
Table 1: Comparing Models. over the past 3000 games. Training set = 2400 games, Test set = 600 games.

What is interesting to note, is that the MAE of the margin in the test set is slightly smaller than the training set. I wouldn’t expect this, but I’ve looked over the code a bunch and I’m confident the training and test sets are disjoint. This could possibly be due to MAE not being a good measure of the models accuracy, or the test set size, or simply (as pointed out by MoS) that some years margins are more predictable overall than others. Here’s an updated figure of the 2019 AFL Season predictions.

The vertical axis is the predicted probability of the home team winning estimated by the logistic model. The horizontal axis is the predicted margin (home minus away team score) estimated by the linear model.

I think from these comparisons, it is optimal for me to use the linear model for my margin prediction, and logistic model for probability of winning prediction in order to maximise bits. I will update my predictions using these two models now. It’s also interesting to note the performance of the SVM, and I’ll look into possibly improving it along with looking into other models.

Until next time. Have a peaceful day 🙂

Round 1, 2020 Predictions (v3)

Margin and probability predictions for Round 1, 2020 using a slightly updated model (scaling factor has been added due to a change in time per quarter). Have a peaceful day 🙂

Winning TeamOpponentDateProbability (%)Margin
Western BulldogsCollingwood20/03/2020532
Port AdelaideGold Coast21/03/20208433
GeelongGreater Western Sydney21/03/2020522
North MelbourneSt Kilda22/03/20207214
HawthornBrisbane Lions22/03/20206611
West CoastMelbourne22/03/20208529
Predictions for Round 1, 2020.

Logistic Regression and Mistakes

So, once I had found some time outside of working on the thesis, I managed to set up a logistic regression model for estimating win probability for each game whilst also cleaning up the code. I also noticed my last model had an error (I’m very much used to making errors), and was not as accurate as I thought. This means I want both my models to be improved over the year. They’re not terrible, but they’re not the greatest, they’re at least “competitive”.

I’ll get into what has changed with the linear regression model for predicting margin, then the logistic regression model, the accuracy of both models, and what I’ll do next 🙂

What has changed…

So, it appears I gave my model training data that contained information from future games, and that’s been corrected for (yikes). I also changed the features to just make them more simplistic, instead of what I originally did which was create a function that took several statistics and used the output as a feature. So instead, the linear regression model has 23 features (including the bias feature x_{0}). The learning curves for the linear model are given in Figure 1

Fig 1: Learning curves for the linear regression model predicting margin. Training set (= 2000 games), CV and Test Set (= 500 games each).

Just as a reminder, the cost function is the square of the difference, with a regularization parameter optimised over the cross-validation set. The mean absolute error (MAE) on the test set is \approx 33.57, which I strongly want to improve (accuracy on previous seasons are discussed later). As can be seen in Figure 1, the learning curves converge to the same asymptote, with the error “high”, this implies that our model is experiencing high bias. I attempted to increase the number of features by taking a combination of their products and powers, but none of my attempts seemed to fix the high bias problem. I think the only way I can improve it is by getting new features. The two that I want to do first in particular are (1) team chemistry (determine the number of times each player on a team has played with all other teammates, then take the sum over all players), and 2) distance traveled (determine the distance between each venue, and either use the distance traveled since the last game or second last game).

Logistic Regression

Logistic Regression is used for classification problems i.e. where the outcome is discrete. In the case of an AFL match, the two outputs are a loss (denoted by a value of 0) and a win (denoted by a value of 1). In linear regression, our hypothesis (what is used to calculate our margin prediction) was given by h_{\theta}(x) = \theta^{T}x, in the case of logistic regression our hypothesis is given by the sigmoid function (with parameters \theta)

h_{\theta}(x) = \frac{1}{1 + e^{-\theta^{T}x}}

where \theta, x \in \mathbb{R}^{(n+1)} (with n features). The sigmoid function is used since it maps to the open interval (0,1), essentially returning a probability of the home team winning for a given input x with parameter vector \theta. In order for our cost function to have a global minimum (in order to determine the optimal parameters), we have to use an alternative to the square of the difference. Our cost function is given by (with regularisation parameter \lambda)

J(\theta) = -\frac{1}{m} \sum_{i=1}^{m}(y^{(i)}\log(h_{\theta}(x^{(i)})) + (1 - y^{(i)})\log(1 - h_{\theta}(x^{(i)})) ) + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_{j}^2

Similarly, our logistic model takes the same features as our linear model. Again, I chose gradient descent as the optimisation algorithm. The learning curves for the logistic model is given in Figure 2.

Fig 2: Learning curves for the logistic regression model predicting probability of winning. Training set (= 2000 games), CV and test set (= 500 games each).

After the optimal parameters are found, the hypothesis function h_{\theta}(x) returns the probability of the home team winning, given the features x \in \mathbb{R}^{(n+1)}. Similar to Figure 1, Figure 2 also shows strong bias, implying we need to increase our features!


I ran both the margin and probability of winning models over the 2018 and 2019 season with up-to-date-round information.

For the 2019 season, the logistic model had 127 correct tips (61.35% accuracy) with the average on Squiggle (models only) was 135 correct tips (65.1% accuracy). The linear model had a MAE of 27.73, where the Squiggle average was 27.71. A scatter plot of probability of winning prediction vs margin prediction for the 2019 season is given in Figure 3.

Fig 3: 2019 Season

For the 2018 season, the logistic model had 144 correct tips (69.6% accuracy) with the average on Squiggle (models only) was 143 correct tips (69.1% accuracy). The linear model had a MAE of 28.86, where the Squiggle average was 27.35. A scatter plot of probability of winning prediction vs margin prediction for the 2018 season is given in Figure 4.

Fig 4: 2018 Season

What is interesting from both Figure 3 and Figure 4 is the small range in margin prediction for both seasons. Notice how the margin prediction for both seasons lie in the range of approximately (-25, 35) points whereas the high performing models like Live Ladders, Massey Ratings, and Squiggle have margin prediction ranges of approximately (-40, 50), (-45, 50), and (-50, 50) respectively (see the really beautiful Matter of Stats post). Another interesting property of both Figures 3 and 4 are the linearity of the two plots.

What next?

My results for both the 2019 and 2018 season are not as good as I would like them to be. But there’s still features to add in order to possibly improve them. After that is accomplished I will move onto creating neural networks for both margin prediction and probability of winning prediction.

Thanks for reading and have a peaceful day 🙂