Exploring the theoretical limits of shot quality
Corsi is the buzzword of the advanced hockey stats community, and for good reason. Corsi is a more descriptive statistic in smaller sample sizes, and thus it can be used to give us more accurate information on performance before other more traditional stats normalize. If you’re familiar with the current state of hockey analytics feel free to scroll down a little bit.
In other words we use corsi because goals (what we’re really after) are relatively infrequent events. A shot attempt is a positive outcome, and as long as what we are measuring is an innately positive outcome (something the team is trying to do and something the opposition is trying to prevent), we know there will be some value in it.
“A corsi attempt is a innately positive outcome.”
This is true, but it is only a representative positive outcome. A corsi attempt, in isolation, isn’t really valuable. This might be the point you think, “well corsi shows that you were in possession, and that is innately positive.” Well, it’s not possession, it shows possession. To be clear, I’m not saying corsi is not representative of possession (having access to actual possession data, I can say that it’s not too bad), what I’m saying is that corsi, in itself, is not possession.
Corsi acts as an inference of a goal, and we use it instead of goals because it has more predictive power in smaller sample sizes, and since these ‘smaller sample sizes’ can be entire seasons, it has value above what goals for% can offer us, since so much can change season to season in auxiliary variables (quality of teammates, ravages of age, etc..)
The barrier between shot and goal based statistics is called shooting or save percentage. The problem with goals is that we don’t really get enough of them in a season to give us a confident enough sample size, and, since these statistics require goals, they have all the same problems.
There’s a solution to this problem. By using data that is available for every shot, we can circumvent this sample size issue, while helping to fix the central problem with corsi (every shot attempt is valued exactly the same). I used all of the data that is available for every shot in the RTSS era, which goes back to the 2007 NHL season, to try to predict the chances of that particular shot going in. By predicting the chances of any one shot going in, we don’t need to rely on the small sample size ‘true’ shooting percentage gives us. This is not an unfamiliar practice, and I myself have had some attempts at it in the past.
The goal of this analysis is to see how good of a model we can create to accurately reflect the ‘true’ chances of a shot being scored, using only the data the NHL provides.
Here are the variables that went into my model.
A rebound is described as any shot that followed a shot that happened 3 seconds or less prior. Shots classified as rebounds go in 29.2% of the time.
Distance of shot
The record keeping of the distance from net is much derided by the analytical community, and for good measure, but there’s no question it has some predictive power.
Another source of poor data from the NHL, but hey! if it has predictive value, it gets put in the model.
Analysts often use the term ‘score effects’ to describe the phenomenon of leading teams taking fewer shots with more shot/goal success while trailing teams taking more shots with less success, but it’s actually a little more complex then that: the score effects shooting% in different ways based on the time remaining in the game. To account for this time remaining is used in tandem with the score state in the model.
Strength can seriously effect shooting%.. especially when killing a 5 on 3 penalty (?)
I put all of the data into a logistic regression, which estimates the probability of a binary outcome occurring (if the shot resulted in a goal or not) based on 487,663 shots in the RTSS era. With this we get an ‘expected shooting%’.
Here’s what our results look like. This shows every player with over 500 shots taken, and their ‘expected sh%’ based on the model, and their actual shooting%.
That line gives us an r squared of .973, with a standard error of .017, which means that on average, shooters are about 1.7% in shooting% away from their expected shooting%.
This is probably about as good of a model as can be constructed using what we have to work with in the RTSS data, so I think we can draw some conclusions about the NHL shot data from it.
1. A player’s shooting skill, above the variables described in the model, is worth about 1.6% per shot. The best shooters (Tanguay, Marchand, STEVEN STAMKOS) can add 3-5% in shooting percentage above what the model inputs predict.
To describe this better I charted the difference in real and model predicted shooting% and expected shooting%:
2. With players who have played more then 5 seasons since 2007, the standard deviation of their season to season shooting% is 3.1%. For expected shooting%, the standard deviation is 0.9%. Expected shooting% is a repeatable statistic at the season level, and it has more predictive power then raw shooting% season to season. In 56.7% of cases, the previous season’s expected shooting% was closer to his seasons shooting% then his previous seasons actual shooting%. Pretty impressive when you consider that a player’s previous performance is completely removed from the predictive model.
3. This model does not perfectly estimate each individual shot’s chance of going in, but it’s close except for the most extreme of players. This model can be improved upon by adding inputs on the shooter and goaltenders previous performance, but that would be cheating now wouldn’t it?
So what does this all mean?
I plan on releasing this statistic shortly as part of some statistic renovating going on here at Hockey Prospectus, and I think it will have a lot of value when interpreting shot quality quandaries like team PDO and the varying quality of shots faced by goaltenders. With this number, we can adjust shot to goal ratios and attribute the rest up to luck/shooting skill. I basically just hope it scrapes some of the luck off the top of shooting and save% statistics.
To get back to corsi and what I started this article off with, there’s nothing innately special about ‘corsi’, or shot differential, that needs to be preserved for the sake of it. Speaking from a strictly analytical sense, I found 7 different variables that we know effect the likelihood of a shot going in in some way. Why ignore these things that would obviously improve the value of shot differential (by weighting them appropriately) if we don’t have to?