“Day 102. The WAR wars rage on. Tentative fire fills the night sky, peace talks stalled as trenches dug deeper, charts pored over ever more furiously. Morale at an all-time low.”
- Dramatic, fake report from scout (Me) about a war that is more an unpleasant conversation than a battle.
For those of you not following, hockey twitter has been mired in controversy over the value of WAR, or Wins Above Replacement, metrics of late. The counterargument to such methods, the best I understand it is as follows:
Wins Above Replacement metrics use overly technical, black-box regression models to produce a single-number metric. This obscures the process, is unlikely to capture everything in a sport as fluid and poorly documented as hockey, is hard to critique because it is so complex and arcane (does not seem to have stopped anybody…), the outputs are sometimes weird, and error bounds are not displayed.
I don’t have the time nor the brainpower to address everything, but others have done a good job (see Brian MacDonald Interview for example) . Suffice it to say, I agree partially with the critiques listed above but am much more interested in finding better ways to model hockey data than discredit an entire class of methods. To me being an applied statistician is all about considering assumptions and limitations and working with institutional knowledge and data to make better models. Most of the process is about contextualizing and knowing what you are looking at and how to use it. My point here being that people making these models are probably more obsessed with limitations and context than detractors, but it’s hard to express that clearly when the value of what you do is being attacked and straw-manned.
What I am trying to do with this piece is to use the attention and discussion about WAR to make better models or at least to give people tools to use them in new or better ways. Today that means trying to show how we can construct a test-statistic to test if one player has a significantly better model output than another using so-called black box regression techniques. I’m not suggesting that every WAR list should have some huge matrix comparing every permutation of hypotheses or that this is necessary for WAR to be useful (it’s not), but just addressing the point that this can be done. It is not some fundamental problem with regression modeling, in fact in my business of causal inference it’s one of the reasons why we turn to regression models. (Aside: this is much easier with regression than with a corsi or counting stat analysis. If someone has figured out valid error estimates for Wowy or something like that, I would love to see how the hell that works. My guess is that anti-WAR folk aren’t super into block bootstrapping methods.)
Consider a relatively simple WAR model (compared to say Manny Elk or Evolving Wild anyway).
Value will be the sum of contributions to shot creation, shot suppression, goal creation and goal suppression. Everything will be at 5 on 5. Notice I do not have any penalty differential component, nothing to do with expected goals, special teams, etc. I will use OLS regression to estimate the impact for the 4 components, in two (sorta) different regressions. The data used is coming from shift data that I transformed from pbp I downloaded at corsica.hockey. It is a small subset of data from the 2007-2008 season (a random sampling of the equivalent of 10 games). Many of these choices would not be great if I was trying to produce a serious model for use, but they help me make things more concise, clear, and easier to code. What I describe can be extended to more complicated components and methods.
I have two linear models, one for goals and the other for shots. Notice that the regressor matrix X is the same in both and in this case it will include dummy variables for whether each player is on offense or defense. Theoretically X could include some different covariates and ideally they would also include state variables like score effects etc. The outputs are recorded per shift, which are defined to be when the same set of players are on the ice at the same time or when a faceoff has occurred.
Now imagine if we wanted to test the hypothesis that Jarome Iginla and Evgeni Malkin are equally valuable according to my (incomplete) WAR stat, how would we do that?
let WAR for player j be defined as:
What we want to test then is the null that against the alternative that they do not.
Let’s start simpler, imagine we just want to know about the shot creation (SC) abilities of the two, the nice thing about this hypothesis is that the parameters that we are interested in are all in the same regression model.
What we want is a test stat that looks like this:
We need to know what the standard error of the difference in parametears to test this hypothesis. This is slightly different than when we want to test is a parameter is different than zero because the regression summary doesn’t automatically tell us this information. Here are two ways you can get this test statistic.
First, let us just use the definition of standard error
Where df is the degrees of freedom in the model (n-p), where n is data points and p is the number of estimated parameters). By definition
We can extract all of this information directly from the covariance matrix provided with the regression. As long as the regression method and package you use outputs a valid covariance matrix, you can always do this. Pretty much every type of regression can do this, although of course you must be careful about assumptions, but for many methods covariance matrices which require looser assumptions exist in a mature form.
Xmat <- as.matrix(readRDS("Xmat.Rda"))
Fen <- readRDS("fenwick.Rda")
model1 <- lm(Fen ~ Xmat)
betas <- model1$coefficients
Iggy <- betas[grep("IGINLA",names(betas))]
search <- paste(c("MALKIN","IGINLA"),collapse="|")
sig <- summary(model1)$sigma
covmat <- summary(model1)$cov.unscaled
cov <- covmat[grepl(search,rownames(covmat)), grepl(search,colnames(covmat))]
Here we can see just the estimate and the relevant portion of the covariance matrix. Now let’s make our test statistic for just the offense.
se <- sig*sqrt(cov[1,1] + cov[2,2] - 2*cov[1,2])
t_stat <- (Iggy - Malk)/se
Here the estimated t statistic is 1.54. At the typical 5% level this isn’t significant for a two tail test. In general we might be less interested in a formal hypothesis test than we are in getting some probability that some player is better than another, since we aren’t considering rolling out some drug or something where being very conservative is important. We just want to make the best decision possible regarding relative values of players. In this case, 1.54 is around the 93rd percentile so it seems fairly likely that Jarome Iginla has a larger effect on shot creation than Malkin in 2007- 2008.(Again, not necessarily in real life because this is a small subset of data with only a subset of the real coefficients).
You should be able to see that we can extend this pretty easily if we wanted to combine the shot creation and shot suppression estimates because we have a covariance matrix that estimates the standard deviation between all these parameters for both parameters of interest.
There is another way to get this same test stat. Notice that we can always write
Consider an arbitrary regression of Y on X1 and X2.
What this says is that if we simple add the covariates X_1 and X_2 together and make that a new variable and drop X_2 as a regressor for our model, testing the significance of the parameter associated with X_1 being different than 0 is the same as testing if the difference between the two original coefficients is different than zero. The added bonus with this is that we do not have to mess around with the covariance matrix of the model, the desired hypothesis will be in our summary output!
newvar <- Xmat[,grepl("MALKIN",colnames(Xmat))] + Xmat[,grepl("IGINLA",colnames(Xmat))]
colnames(newvar) <- c("Iggy_plus_Malk","Iggy_plus_Malk_Def")
Xmat2 <- cbind(Xmat[,-which(grepl("MALKIN",colnames(Xmat)))],newvar)
ff <- Xmat[,-which(grepl("MALKIN",colnames(Xmat)))]
model2 <- lm(Fen ~ Xmat2)
betamat <- summary(model2)$coefficients
test <- betamat[grepl("IGINLA",rownames(betamat)),]
Here we can see the estimated difference and we get the same t-stat, 1.54 for the difference in shot creation. We also see a significant difference in shot suppression, represented by the second row. A negative result in this model means better defense (it means that the player being on the ice is associated with fewer shots against).
Great, so we see that it is possible to do hypothesis testing between coefficients. If we want to test differences between shot creation + shot suppression we can extend either of these two techniques because the model we are using estimates the covariance between all the regressors in the model.
If you haven’t seen this stuff before and are following along you might be scratching your head at what we might do if we want to test the difference between sums of components from different models. If we use two different models, a covariance matrix between coefficients from difference models won’t magically appear will it?
Yes that is true, but we can still do it. It becomes trickier of course, but what we have to do is think very carefully about the errors in the different models. I am going to cut this off because I think this is already too long, but try to explain how we do it soon. Spoiler, we might even gain efficiency in the process.
Please go check out or even better, support Corsica and/or Evolving Wild , links to patreon’s below. Both of these websites are amazing resources for people trying to do hockey analytics and the creators are doing really cutting edge work to make WAR models better all the time.
Evolving Wild Patreon
And while I am plugging cool analytics projects, check these patreon’s out as well please!!!
The Ice Garden Podcast