AI to figure out Lord Stanley's pick
Can we figure out the 2019 Stanley Cup Champions using AI and available data? Let's find out.
For every Canadians hockey fans, the most exciting season is about to start! While grieving the Habs' absence, I decided to tackle the task of predicting the Stanley Cup Champions with simple ML algorithms and available data. With an hour at hand, how good can such a predictor be?
It is a bit ironic for a guy who is working every day with data to trust his guts on the chances of his favourite team to win the Cup. Indeed, every year, I believe in the Habs' chances to go all the way but end up disappointed. 25 times in a row to be more precise. Let’s look at this problem a bit more pragmatically. The NHL API provides open source stats so let's dive in.
For this small exploration, in addition to Tensorflow and sklearn, I wanted to try plotly express. Indeed, my team and I went to the 74th Montréal Python meetup last week and enjoyed the talk given by Nicolas Kruchten on it. I will also use the amazing module mlxtend for modelization and feature selection, since I'm in a hurry.
The dataset provided by the NHL API is the following. We can also generate a lot of features by making more calls, but let’s start simple and see if we can achieve good results first.
I chose to limit the analysis from 1942, since this is the date where the 6 original teams were in the league.
There are multiple ways to see this problem. I could try to predict the winner's rank, the number of victories in the playoffs, the winner of each round or simply if the team will win the Cup or not. Because simple models are a good benchmark, I chose to go with a binary target defined as 0 if the team did not win the Cup and 1 if they won. As you may guess, this target will be unbalanced since only one team wins the Cup per year. I know, this is deep.
Let’s look at the teams historical performances
In this brief EDA, I want to clean the data and look for features with some possible predictive power. Basically, we want to know if the performance of a team in the regular season can help us predict the outcome of the playoffs.
First, let’s gain some intuition. We always hear hockey experts say that there is an overall increase in the number of goals scored, don't we? In order to be able to verify this statement, we need to normalize this feature with the number of games played since the seasons weren't always 82 games long.
We could argue that the goalsAgainst is strongly correlated to winning teams but is this true? In other words, is there a strong correlation between goalsScored, goalsAgainst and wins?
We can see that winning the Stanley Cup has a linear relationship between the number of points and the ratio of the goalsAgainst and goalsScored. It makes sense so far!
Let’s see if we can detect abnormal observations with the px.scatter function that allows us to plot all the features in a dataset in a one-liner. I would usually use either a boxplot or distplot to identify potential anomaly but plotly_express does not provide those functionalities at the moment.
The plotting lacks refinement and is a bit messy, we would need to play around with angles, label names and more to make this readable. I'd also like to have a function to modify the diagonal line in order to show distributions instead of a scatter plot. Not much there. Let’s run a proper anomaly detection analysis.
I created a simple model in TensorFlow to detect the potential anomalies in our dataset. I will not remove these observations but I want to potentially build features to allow the model to detect them! There are a lot of ways we could do this but I personally enjoy AutoEncoders for their simplicity and effectiveness.
Pure detection of Stanley Cup winners with this unsupervised technique is not working very well. However, we can still use this information as a feature for our supervised model. Below is a graphic of the observations flagged as anomalies (higher than the threshold). I did not use plotly express for this one since lots of features are missing in order to do a similar graph.
The moment all data scientists can't wait to get to... building a model! To demonstrate another algorithm from the one Julien used in his impressively accurate UFC simulator, I chose to use a VoterClassifier. A VoterClassifier uses the knowledge from a set of models to predict the class of an observation. We can do either a soft-vote based on probabilities or hard-vote based on the predicted classes. I chose to use the former because we will only use the highest probabilities for each year to determine the winner of the Cup. In other words, we won’t have any kind of threshold since there is only one winner each year. The models we use in our VoterClassifier come from a diversity of family of models (i.e. Boosting, Distance based, Bayesian, Linear…).
Building the Model
First, since the data is highly unbalanced, I chose to compare 2 strategies. A model where we upsample the dataset using the SMOTE algorithm and another without any kind of upsampling.
I ran a meta-parameters optimization on each of the models used in the Voter. We don’t optimize one independently then use it as a vote, we optimize all the algorithm respecting the change in the parameters of its colleague. Without going nuts on models selection, feature engineering and meta-parameters optimization we were able to reach a 88% accuracy performance on detecting the winner of the Cup since 1942 with the upsampling strategy on the test set. If you wondered, the models predicted that:
Washington would win the cup last year
Also predicted correctly a harder one like the 2011-12 winners, the Kings
Not bad at all!
So, who will Lord Stanley choose this year?
We can expect competitive playoffs! Only 3% difference between the 1st and 2nd team. Tampa Bay comes out 5th. If we compare with the predictions of 2011-12 where the model predicted the Kings as the winner with 94% chance, this year seems more challenging.
The top feature for the models are the goals against and the goals scored as expected. Don’t we always say that it is the defense that wins championships?
On the technologies I tried, plotly express was very useful but is still young and missing some very useful plots! However, It’s a good first experience and I think the interactive aspect of it is incredible.
On the models used, I think it is a good starting point, however we could go a lot deeper in this analysis. For example, we could use the results of each game to create a feature useful to predict the winner of the Cup. We could also describe the team using players data that could potentially give us a better insight into how ''hot'' the key players are in each team entering the playoffs. The target could be complexified in order to predict the result of each round of the playoff, etc...
Most people would say that Tampa Bay should win the Cup this year however, let’s hope the model is right and a Canadian team can win. They would be the first since the Habs of 1993!