Regularized Adjusted Plus-Minus (RAPM) is a heavily used technique for analyzing players in the NBA and more recently in the NHL, where it strives to quantify the individual effect of each individual player without incorporating the effects of its surroundings. This is something traditional plus-minus is notoriously useless at, while RAPM is specifically designed to mitigate the effects of collinearity through the use of a ridge regression.
In this article, I'll be taking an initial dive into turning NHL play-by-play data into a simple RAPM calculation for the 2019-20 regular season. This is the first of three articles, and is a simple and slapdash look at how a player influences the Corsi +/- while they're on the ice, without separating their offensive and defensive effects. Later I will take a look at more factors to create offensive and defensive evaluations for skaters, and the (planned) third edition is going to incorporate a Bayesian prior in calculating the impact of skaters on on-ice results.
Circling back, ridge regression is the tool of choice for calculating RAPM because it can handle the extreme collinearity that accompanies the data of choice. The general framework of the data is a list of categorical variables regarding which players are on the ice, and then a target of the Corsi +/- during that particular stint. As you would imagine, it's difficult to parse out the individual effects of line-mates, which is why ridge regression reduces the variance in the coefficient weights (for our purposes these coefficients represent how good players are) by normalizing them towards zero. Researching and coding this project was my first exposure to this regularization, so this information is a summary of various sources, most crucially EvolvingWild's excellent writeup here.
As for the details on the model I built, you can find the full code here on Github. The quick summary is that I used EvolvingHockey's NHL scraper in R to gather PBP data for this prior regular season, pulled out the shots, players on ice, and other relevant categorical variables (home team, zone start, lead/deficit) and put it all into one big data frame. Then I ran the regression using the RidgeCV function from Python's Scikit-learn library. More advanced versions of RAPM use different targets than just Corsi (shot attempts) differential, often xG, but I wanted to keep my first attempt simple. This is how the model performs compared against the Evolving-Hockey RAPM calculations.
Distribution
My run-through appears more thin-tailed than the EH RAPM, which broadly makes sense as I incorporate fewer non-player categorical values and thus offer less certainty on the effect of players, but could also indicate that the alpha (pull towards zero in the distribution) in my model is too high.
Correlation - My RAPM (x-axis) vs EH RAPM (y-axis)
R^2= 0.36 | y = .84 x + 0.016
Not exactly the most encouraging R^2 value considering the player evaluation should be mostly the same, but considering most of the values should be clustered around zero, I was happy to see a visible but vaguely positive linear trend. To look further, let's examine some individual players.
As the above graphs show, comparing the raw RAPM coefficients on a player-by-player basis isn't exactly a fair comparison. Considering how similar the distributions are, it's more intuitive to just compare the rankings of each player (out of 926) than get too far into the weeds with Z score.
My Top Players
Congrats to Brett Kulak for being the best active NHL player by a wide margin.
But more seriously, it's interesting to see the consensus and disagreement between the two models, as also highlighted by how the top players according to Evolving Hockey fare in my calculation.
Evolving Hockey Top Players
Some of the same players are on both lists (Nichuskin, Tatar, Toffoli, Theodore, Eller) but there are also some major discrepancies, especially impending free agent Craig Smith and Oliver Bjorkstrand. Let's take a look at the players the models diverged on most.
Biggest Differences
Is Drew Doughty actually good? Are Cirelli, Duchene, and Saad actually bad? Probably not, unless I meandered my way into a superior model than people who do this for a living, but it's interesting to see the effects of a few cut corners. With these results in mind, I look forward to taking a closer look at RAPM after the Stanley Cup final concludes and the NHL heads into what's sure to be an interesting offseason.
コメント