# Logistic Regression and Mistakes

So, once I had found some time outside of working on the thesis, I managed to set up a logistic regression model for estimating win probability for each game whilst also cleaning up the code. I also noticed my last model had an error (I’m very much used to making errors), and was not as accurate as I thought. This means I want both my models to be improved over the year. They’re not terrible, but they’re not the greatest, they’re at least “competitive”.

I’ll get into what has changed with the linear regression model for predicting margin, then the logistic regression model, the accuracy of both models, and what I’ll do next 🙂

What has changed…

So, it appears I gave my model training data that contained information from future games, and that’s been corrected for (yikes). I also changed the features to just make them more simplistic, instead of what I originally did which was create a function that took several statistics and used the output as a feature. So instead, the linear regression model has 23 features (including the bias feature $x_{0}$). The learning curves for the linear model are given in Figure 1 Fig 1: Learning curves for the linear regression model predicting margin. Training set (= 2000 games), CV and Test Set (= 500 games each).

Just as a reminder, the cost function is the square of the difference, with a regularization parameter optimised over the cross-validation set. The mean absolute error (MAE) on the test set is $\approx 33.57$, which I strongly want to improve (accuracy on previous seasons are discussed later). As can be seen in Figure 1, the learning curves converge to the same asymptote, with the error “high”, this implies that our model is experiencing high bias. I attempted to increase the number of features by taking a combination of their products and powers, but none of my attempts seemed to fix the high bias problem. I think the only way I can improve it is by getting new features. The two that I want to do first in particular are (1) team chemistry (determine the number of times each player on a team has played with all other teammates, then take the sum over all players), and 2) distance traveled (determine the distance between each venue, and either use the distance traveled since the last game or second last game).

Logistic Regression

Logistic Regression is used for classification problems i.e. where the outcome is discrete. In the case of an AFL match, the two outputs are a loss (denoted by a value of 0) and a win (denoted by a value of 1). In linear regression, our hypothesis (what is used to calculate our margin prediction) was given by $h_{\theta}(x) = \theta^{T}x$, in the case of logistic regression our hypothesis is given by the sigmoid function (with parameters $\theta$) $h_{\theta}(x) = \frac{1}{1 + e^{-\theta^{T}x}}$

where $\theta, x \in \mathbb{R}^{(n+1)}$ (with $n$ features). The sigmoid function is used since it maps to the open interval $(0,1)$, essentially returning a probability of the home team winning for a given input $x$ with parameter vector $\theta$. In order for our cost function to have a global minimum (in order to determine the optimal parameters), we have to use an alternative to the square of the difference. Our cost function is given by (with regularisation parameter $\lambda$) $J(\theta) = -\frac{1}{m} \sum_{i=1}^{m}(y^{(i)}\log(h_{\theta}(x^{(i)})) + (1 - y^{(i)})\log(1 - h_{\theta}(x^{(i)})) ) + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_{j}^2$

Similarly, our logistic model takes the same features as our linear model. Again, I chose gradient descent as the optimisation algorithm. The learning curves for the logistic model is given in Figure 2. Fig 2: Learning curves for the logistic regression model predicting probability of winning. Training set (= 2000 games), CV and test set (= 500 games each).

After the optimal parameters are found, the hypothesis function $h_{\theta}(x)$ returns the probability of the home team winning, given the features $x \in \mathbb{R}^{(n+1)}$. Similar to Figure 1, Figure 2 also shows strong bias, implying we need to increase our features!

Results

I ran both the margin and probability of winning models over the 2018 and 2019 season with up-to-date-round information.

For the 2019 season, the logistic model had 127 correct tips (61.35% accuracy) with the average on Squiggle (models only) was 135 correct tips (65.1% accuracy). The linear model had a MAE of 27.73, where the Squiggle average was 27.71. A scatter plot of probability of winning prediction vs margin prediction for the 2019 season is given in Figure 3.