An Introduction to Hockey Analytics Part 3: The Perils of Sample Size

There are two features of the game of hockey that make effective use of statistical analysis difficult. The first, as discussed in the last four parts (Part 2, linked at the bottom of this post) is the problem of Context, and as detailed in those posts, Hockey Stats have evolved so that we can take into account the effect of context on a player's numbers.

The second is Sample Size - the issue being that the numbers we have often come from such a small sample size that we can't rely on them to make accurate judgments of a player's true talent and his worth going forward. Now this is true in every sport (and other things as well), but it is a particular problem in hockey, far more than the average fan realizes. Obviously a player with a career high of 33 goals/60 points who scores 4 goals in his first two games* on a new team isn't likely to break 100 goals scored, or even put up 50 goals. But for several commonly used statistics, even a full season isn't quite enough for a user of statistics to conclude anything about a player's true talent, something that often is unrealized by the average fan.**

*I'm sure you all realize who I'm referencing.

**Once again, not suggesting there's something wrong with being an average fan who doesn't care about this stuff.

The misleading small sample sizes can be "dangerous*" because they can trick fans into believing that certain trends are sustainable. This results in misleading expectations for fans of certain players and their teams going forward. When a sample size becomes large enough that we trust the numbers, we say that the statistics are "sustainable" - meaning that you can trust them to be a correct valuation of the player under the circumstances and that the player can be counted on putting up similar numbers going forward overall (though small sample sizes in the future may make it seem like not the case).

*Well, not really dangerous....but you get what I mean.

So let's go through some of the commonly used numbers that are often unreliable for showing true player talent/value due to the small sample size involved with these numbers.

Special Teams Play:

Special teams play is very different from even-strength play. It's a different game completely, really. As such, we look for players who can specialize in roles on special teams - the power play quarterback (Defenseman) and the expert penalty killer are two examples of this. So we really should look at special teams statistics separately from even-strength statistics.

But this presents a problem: Special-Teams sample sizes over 1 - or even more than 1 - seasons are too small to be trusted. Brad Richards led the NHL in Power Play minutes last year and he only played 384 minutes on the power play - roughly equivalent to a 1/3 of an NHL Season at even strength. So his numbers there are essentially equivalent to a 1/3 of the season (26-27 games) at even strength, and he's the NHL LEADER in Power Play minutes. On the Islanders, only three Isles - Tavares, Moulson, and PAP - played more than 200 Power Play minutes (280 a piece), so the equivalent of 1/4 of a season in even strength minutes (20 games). Would you judge a player's performance after only 20 games normally? Well you might, but you'd still be reluctant to make any definite conclusions. And the total number of minutes given to non first line power players is far smaller, to the point of making their statistics near useless.

And then there is goalie statistics* on special teams, which are completely unreliable: Goalies face usually 1200-1600 shots per season on even strength. The leader in shots faced by a goalie on the PK faced only 400 shots, equivalent to a quarter of a season at even strength, and most goalies face only 200-300 shots on the PK a year, equal to 1/5 or 1/6 of a season. Would you judge a goalie based upon only 14-16 games? Well that's what you're doing when you judge a goalie's performance on the PK.

*Goalie statistics should ALWAYS be split into even-strength statistics and special teams statistics, but this is often ignored. Every goalie features a big drop in SV% when his team is on the PK, and by using general SV%, you penalize goalies whose teams are on the PK an awful lot. I'm planning on mentioning this again in a special post on Goalies in general, but it bears mentioning here.

And don't get me started about shorthanded statistics for offensive players and power play statistics for goalies. Not a single goalie faces a 100 shots when his team is up a man, while many PKers don't even get 200 minutes of ice time per year (equivalent to 10 games worth of even strength play) - only Frans Nielsen of the Isles last year broke the 200 minute mark last year, with the 2nd most used PKer getting only 174 minutes all year (Blake Comeau).

This is not to say that we can't figure out who is a good power play quarterback or PKer from scouting/watching the game, but just that statistics of one season will be of little help for us in doing so. Though I'd state that we can't really tell from watching whether a goalie is really a good special teams goalie (shorthanded), so don't try that.

Now the End Result of these Sample Size Problems is that any analysis of a player's value tends to be dominated by even-strength play, since the sample sizes for special teams play are so low as to make such numbers not fully credible. As such analyses of hockey players, especially those using advanced stats, are near entirely focused upon even strength play.

Goalie #s:

Even when just using even-strength numbers, goalie statistics are incredibly unreliable. Goalies face around 1200-1600 even strength shots per year, which sounds like a lot. And you'd think that this number of shots would give you a good estimate of how well a goalie will perform each year. But that's not the case, as Table 1 should show:

NAME 2006-2007 SV% 2007-2008 SV% 2008-2009 SV % 2009-2010 SV % 2010-2011 SV %
Player A
.911 .906 .918 .929 .916
Player B
.909 .901 .915 .907 .914
Player C
.905 .921 .933 .915 .938
Player D
.917 .912 .916 .921 .923

Table 1: The SV%s of four goaltenders over the last five years.

Notice how the SV%* of each of these goaltenders, except for maybe Goalie D, fluctuates wildly form year to year. Some of that is perhaps because of a change in the talent level of these goalies from year to year: goalie D certainly seems to have improved with age. But most of the fluctuations are simply the result of randomness - the effect of looking at small sample sizes, only a year's worth of shots faced, at a time. Goalie C did not simply go from being elite to merely average and back to elite in 3 consecutive years: what happened is almost certainly the result of random chance and taking small sample sizes. Goalie B didn't go from being below average (by a lot in season 2) to average in season 3 to below average in season 4 to average in season 5, he stayed the same - as a probably slightly below average goaltender, but simply had his results change due to randomness in small sample sizes.

*Note: The Chart uses Total Save Percentage instead of the superior Even Strength Save %, which is a flaw, but is because It's easier to find numbers quickly for total SV% than EV SV%. However, the point holds if you actually look at EV SV%, so it's not an issue.

This might seem problematic to you: if we can't trust a full season of goaltending results, then what can we trust? Well the answer given tends to be 3000 even strength shots or roughly 2-3 goalie seasons, before we can trust the statistics of a goalie to tell us how good that goalie truly is. Goalies can put up incredibly flukish numbers for 400-500 shots, or even 1500 shots, only to turn out to be not very good at all (see Steve Mason).


+/- is a statistic with a whole bunch of problems: it doesn't take into account context for example, meaning it undervalues players on bad teams or those who face tough competition. Various fixes to +/- have been suggested to make up for this, such as adjusting +/- based upon the strength of the team (Relative +/-). But +/- suffers from another problem: it relies on small sample sizes, even over an entire season. Essentially the sample size of +/- is the total number of goals that occur in either direction while a player is on the ice at even strength. This sample size, as I mentioned just a second ago, is not very big.

Take the Islanders last year: The Islander who had the largest +/- sample size was John Tavares, who had a measly 137 goals occur while he was on the ice at even strength all year last year. Not very many at all. Michael Grabner only had 81 goals occur while he was on the ice at even strength all last year despite playing all but 6 games, while Kyle Okposo's 38 games amounted to only 48 goals on the ice while he was playing at even strength. These are tiny amounts right there and thus are subject to being greatly affected by factors outside a player's control. What happens if most of these goals occur while the team's inferior backup goalie is on the ice? Well a player's +/- will look a lot worse than it should based upon his own play: that's what happened to Kyle Okposo in 2009-2010 thanks to Martin Biron.

Over a long time, +/- may normalize, but I don't know the time period needed to trust the stat, and I suspect there really isn't such a time period, as external factors will change a lot (goalies) before such a sample size is ever reached.

As a result, advanced hockey statistics has favored looking at metrics similar to +/-, but based upon shots instead of goals. If you use shots on goal instead of goals, you get a sample size TEN TIMES as large as you would otherwise, and hockey metrics such as Corsi and Fenwick count shots that miss the goal (and blocked shots, for Corsi) as well, so as to increase the sample size further. As a result, these metrics are far more stable from year to year and give you a better idea of player value over sample sizes such as a single season.


There are a whole bunch of other issues of sample size I've not mentioned in this post. But the key thing to remember is that advanced hockey statistics and the field of hockey analytics have been created with the issues of sample size in mind. Thus the statistics commonly used in analyses in this field are those that are meant to counteract most the problems in sample size - as mentioned previously, corsi is such a statistic.


The Intro to Hockey Analytics/Advanced-Hockey-Statistics Primer so far:

Part 1: - What is the field of Hockey Analytics and Why Might You be Interested?
Part 2.1: - The Importance of Context Part 1 - Time on Ice
Part 2.2: - The Importance of Context Part 2 - Evaluating the Difficulty of Certain TOI through QUALCOMP and Zone-Starts
Part 2.3: - The Importance of Context Part 3 - Evaluating (and Compensating for) the Effect of Teammates via QUALTEAM and Relative Measures
Part 2.4: - The Importance of Context Part 4: The Concept of the Replacement Level Player

<em>Submitted FanPosts do not necessarily reflect the views of this blog or SB Nation. If you're reading this statement, you pass the fine print legalese test. Four stars for you.</em>