English Premier League Predictive Model
The world of data science in sports is a seemingly evergrowing field with new statistical metrics and predictions being created all the time, and the worlds most beautiful game is no exception. There are thousands of different metrics to choose from when trying to evaluate a game of soccer, but which ones are the best, and which ones can we use to build a model to predict if a team is going to win, draw or lose based off of those metrics? I found a data set on Kaggle containing every result from the 2019–2020 English Premier Leauge season, with over 576 observations and 45 features. The features consisted of predictions such as, goals scored, goals missed, expected goals or xG(the probability a shot will go in), xGA(expected goals allowed), npXG(non penalty kick expected goals), xPTS(the expected number of points from 0 to 3), as well as who was refereeing the game and what time the match took place. A hefty number of those features weren’t going to help my model so after dropping many of those columns, I was left with a dataset that had 576 observations and 16 features. Then it was time to begin building the model, with the first step being to separate X and y into two separate data frames, followed by a train-test split and a train-validation split.
Now that we have our training, test and validation split done we can start to build the model. Because this is a classification problem, I’m going to first use a Random-Forsest Classifier model, and then determine what its accuracy is.
We can see that the model is performing really well, but thats not always a great sign at first and can signal feature leakage, where essentially the model is getting a high accuracy score because some of the features are giving it the answers before it has taken the test. This is also hinted by the perfect training accuracy which suggests that the model is overfitting on the training data. To examine this further we need to bring up our feature importances and see what could be causing the leakage, by anlayzing which features are most important to the model.
Rather predictably, the two most important features to the model by far are goals scored and goals missed, shocker! Those two are followed by xPTS(expected points), and npxGD(non penalty kick expected goal difference). It’s clear that these features in a model make it a little unfair, because its essentialy making a prediction from other predictions, but this bar chart does give us a pretty good idea of which of these predictions has the greatest impact on predicting the outcome of a game. When trying to consider which one of these non predictive statistical features I should use, I was captured by the ppda statistic. Ppda is a measurement of the intensity with which a team presses their opponent, and is measured by the number of passes made by the attacking team divided by the number of defensive actions. Here is a link to a much more thorough explanation of this statistic https://statsbomb.com/2014/07/defensive-metrics-measuring-the-intensity-of-a-high-press/.
In summary, a lower ppda reflects a higher pressing intensity and a higher ppda reflects a lower pressing intensity. So let’s see if we can build a model that will predict the outcome of the game solely off of ppda and allowed ppda. First lets take a look at the relationship between ppda and pts.
While this isn’t the prettiest scatter plot, because our y variable only has three outputs, it does give us a decent idea of the relationship. The higher the defensive intensity(lower the pdda) the slightly better chance of getting more points. Now let’s repeat the previous steps and build another random forest model but this time only with these two features, and then asses the models accuracy.
As you can see our random forest model this time doesn’t have nearly as high of a testing accuracy score at 39%, but that is to be expected as the model has only two features to train as opposed to the previous 16. Again we have the same problem as before with this model is that it has a perfect training accuracy, which means that it is overfitting. However we are not stuck with this model as there are many different models we can build and tune to try and get the highest possible accuracy score. First lets try a simple logistic regression model and see how it compares to the random forest model.
The Logistic regression actually scored about 4 percent better than the Random forest model on testing accuracy and is not overfitting on the training data. Next we’ll try a Randomized Search model, where we randomize all the hyperparamaters.
The Randomized Search is doing the best of all the models in both the training accuracy and the testing accuracy, so it makes sense to move forward with this model.
To further evaluate a classification model it’s always a good idea to print out a classification report, which will show how our model is performing at predicting a win, tie and loss(3pts, 1 pt, 0pts).
This report shows that the model is doing best at predicting when a team will get 0 pts and having the toughest time predicting when a team will get 1 pt. To get another visualization of how our model is predicting these three values, we can plot a confusion matrix.
Because our target is non-binary it can be slightly more confusing to read but it still holds some value. The confusion matrix reiterates to a certain degree what our classification model was telling us. The model is having a tough time predicting ties, and an easier time predicting wins and losses.
Lastly, we can finally set our randomized search model equal to a variable and generate a list of predictions.
Now we have created a predictive model that will take the ppda and ppda allowed statistics and generate a prediction as to wether that team will receive 0 points, 1 point, or 3 points in that game.