An Introduction to Hockey Analytics Part 3: The Perils of Sample Size
There are two features of the game of hockey that make effective use of statistical analysis difficult. The first, as discussed in the last four parts (Part 2, linked at the bottom of this post) is the problem of Context, and as detailed in those posts, Hockey Stats have evolved so that we can take into account the effect of context on a player's numbers.
The second is Sample Size - the issue being that the numbers we have often come from such a small sample size that we can't rely on them to make accurate judgments of a player's true talent and his worth going forward. Now this is true in every sport (and other things as well), but it is a particular problem in hockey, far more than the average fan realizes. Obviously a player with a career high of 33 goals/60 points who scores 4 goals in his first two games* on a new team isn't likely to break 100 goals scored, or even put up 50 goals. But for several commonly used statistics, even a full season isn't quite enough for a user of statistics to conclude anything about a player's true talent, something that often is unrealized by the average fan.**
*I'm sure you all realize who I'm referencing.
**Once again, not suggesting there's something wrong with being an average fan who doesn't care about this stuff.
The misleading small sample sizes can be "dangerous*" because they can trick fans into believing that certain trends are sustainable. This results in misleading expectations for fans of certain players and their teams going forward. When a sample size becomes large enough that we trust the numbers, we say that the statistics are "sustainable" - meaning that you can trust them to be a correct valuation of the player under the circumstances and that the player can be counted on putting up similar numbers going forward overall (though small sample sizes in the future may make it seem like not the case).
*Well, not really dangerous....but you get what I mean.
So let's go through some of the commonly used numbers that are often unreliable for showing true player talent/value due to the small sample size involved with these numbers.
Special Teams Play:
Special teams play is very different from even-strength play. It's a different game completely, really. As such, we look for players who can specialize in roles on special teams - the power play quarterback (Defenseman) and the expert penalty killer are two examples of this. So we really should look at special teams statistics separately from even-strength statistics.
But this presents a problem: Special-Teams sample sizes over 1 - or even more than 1 - seasons are too small to be trusted. Brad Richards led the NHL in Power Play minutes last year and he only played 384 minutes on the power play - roughly equivalent to a 1/3 of an NHL Season at even strength. So his numbers there are essentially equivalent to a 1/3 of the season (26-27 games) at even strength, and he's the NHL LEADER in Power Play minutes. On the Islanders, only three Isles - Tavares, Moulson, and PAP - played more than 200 Power Play minutes (280 a piece), so the equivalent of 1/4 of a season in even strength minutes (20 games). Would you judge a player's performance after only 20 games normally? Well you might, but you'd still be reluctant to make any definite conclusions. And the total number of minutes given to non first line power players is far smaller, to the point of making their statistics near useless.
And then there is goalie statistics* on special teams, which are completely unreliable: Goalies face usually 1200-1600 shots per season on even strength. The leader in shots faced by a goalie on the PK faced only 400 shots, equivalent to a quarter of a season at even strength, and most goalies face only 200-300 shots on the PK a year, equal to 1/5 or 1/6 of a season. Would you judge a goalie based upon only 14-16 games? Well that's what you're doing when you judge a goalie's performance on the PK.
*Goalie statistics should ALWAYS be split into even-strength statistics and special teams statistics, but this is often ignored. Every goalie features a big drop in SV% when his team is on the PK, and by using general SV%, you penalize goalies whose teams are on the PK an awful lot. I'm planning on mentioning this again in a special post on Goalies in general, but it bears mentioning here.
And don't get me started about shorthanded statistics for offensive players and power play statistics for goalies. Not a single goalie faces a 100 shots when his team is up a man, while many PKers don't even get 200 minutes of ice time per year (equivalent to 10 games worth of even strength play) - only Frans Nielsen of the Isles last year broke the 200 minute mark last year, with the 2nd most used PKer getting only 174 minutes all year (Blake Comeau).
This is not to say that we can't figure out who is a good power play quarterback or PKer from scouting/watching the game, but just that statistics of one season will be of little help for us in doing so. Though I'd state that we can't really tell from watching whether a goalie is really a good special teams goalie (shorthanded), so don't try that.
Now the End Result of these Sample Size Problems is that any analysis of a player's value tends to be dominated by even-strength play, since the sample sizes for special teams play are so low as to make such numbers not fully credible. As such analyses of hockey players, especially those using advanced stats, are near entirely focused upon even strength play.
Goalie #s:
Even when just using even-strength numbers, goalie statistics are incredibly unreliable. Goalies face around 1200-1600 even strength shots per year, which sounds like a lot. And you'd think that this number of shots would give you a good estimate of how well a goalie will perform each year. But that's not the case, as Table 1 should show:
| NAME | 2006-2007 SV% | 2007-2008 SV% | 2008-2009 SV % | 2009-2010 SV % | 2010-2011 SV % |
|---|---|---|---|---|---|
| Player A |
.911 | .906 | .918 | .929 | .916 |
| Player B |
.909 | .901 | .915 | .907 | .914 |
| Player C |
.905 | .921 | .933 | .915 | .938 |
| Player D |
.917 | .912 | .916 | .921 | .923 |
Table 1: The SV%s of four goaltenders over the last five years.
Notice how the SV%* of each of these goaltenders, except for maybe Goalie D, fluctuates wildly form year to year. Some of that is perhaps because of a change in the talent level of these goalies from year to year: goalie D certainly seems to have improved with age. But most of the fluctuations are simply the result of randomness - the effect of looking at small sample sizes, only a year's worth of shots faced, at a time. Goalie C did not simply go from being elite to merely average and back to elite in 3 consecutive years: what happened is almost certainly the result of random chance and taking small sample sizes. Goalie B didn't go from being below average (by a lot in season 2) to average in season 3 to below average in season 4 to average in season 5, he stayed the same - as a probably slightly below average goaltender, but simply had his results change due to randomness in small sample sizes.
*Note: The Chart uses Total Save Percentage instead of the superior Even Strength Save %, which is a flaw, but is because It's easier to find numbers quickly for total SV% than EV SV%. However, the point holds if you actually look at EV SV%, so it's not an issue.
This might seem problematic to you: if we can't trust a full season of goaltending results, then what can we trust? Well the answer given tends to be 3000 even strength shots or roughly 2-3 goalie seasons, before we can trust the statistics of a goalie to tell us how good that goalie truly is. Goalies can put up incredibly flukish numbers for 400-500 shots, or even 1500 shots, only to turn out to be not very good at all (see Steve Mason).
+/-
+/- is a statistic with a whole bunch of problems: it doesn't take into account context for example, meaning it undervalues players on bad teams or those who face tough competition. Various fixes to +/- have been suggested to make up for this, such as adjusting +/- based upon the strength of the team (Relative +/-). But +/- suffers from another problem: it relies on small sample sizes, even over an entire season. Essentially the sample size of +/- is the total number of goals that occur in either direction while a player is on the ice at even strength. This sample size, as I mentioned just a second ago, is not very big.
Take the Islanders last year: The Islander who had the largest +/- sample size was John Tavares, who had a measly 137 goals occur while he was on the ice at even strength all year last year. Not very many at all. Michael Grabner only had 81 goals occur while he was on the ice at even strength all last year despite playing all but 6 games, while Kyle Okposo's 38 games amounted to only 48 goals on the ice while he was playing at even strength. These are tiny amounts right there and thus are subject to being greatly affected by factors outside a player's control. What happens if most of these goals occur while the team's inferior backup goalie is on the ice? Well a player's +/- will look a lot worse than it should based upon his own play: that's what happened to Kyle Okposo in 2009-2010 thanks to Martin Biron.
Over a long time, +/- may normalize, but I don't know the time period needed to trust the stat, and I suspect there really isn't such a time period, as external factors will change a lot (goalies) before such a sample size is ever reached.
As a result, advanced hockey statistics has favored looking at metrics similar to +/-, but based upon shots instead of goals. If you use shots on goal instead of goals, you get a sample size TEN TIMES as large as you would otherwise, and hockey metrics such as Corsi and Fenwick count shots that miss the goal (and blocked shots, for Corsi) as well, so as to increase the sample size further. As a result, these metrics are far more stable from year to year and give you a better idea of player value over sample sizes such as a single season.
SUMMING UP:
There are a whole bunch of other issues of sample size I've not mentioned in this post. But the key thing to remember is that advanced hockey statistics and the field of hockey analytics have been created with the issues of sample size in mind. Thus the statistics commonly used in analyses in this field are those that are meant to counteract most the problems in sample size - as mentioned previously, corsi is such a statistic.
-------------------------------------------------------------------------------------------------------------------------------------------------
The Intro to Hockey Analytics/Advanced-Hockey-Statistics Primer so far:
Part 1: - What is the field of Hockey Analytics and Why Might You be Interested?
Part 2.1: - The Importance of Context Part 1 - Time on Ice
Part 2.2: - The Importance of Context Part 2 - Evaluating the Difficulty of Certain TOI through QUALCOMP and Zone-Starts
Part 2.3: - The Importance of Context Part 3 - Evaluating (and Compensating for) the Effect of Teammates via QUALTEAM and Relative Measures
Part 2.4: - The Importance of Context Part 4: The Concept of the Replacement Level Player
Submitted FanPosts do not necessarily reflect the views of this blog or SB Nation. If you're reading this statement, you pass the fine print legalese test. Four stars for you.
7 comments
|
2 recs |
Do you like this story?
Comments
Interesting, but...
I think there may be some flaws in your analysis. First, with respect to the minutes played on the power play, why compare them to the minutes in a season? The PP is a unique situation that should be assessed independently (or nearly) of other play. Suggesting that PP time would translate into only 26-27 games misses the point. Consider this: imagine you have a 4th grader who misbehaves at lunch on a consistent basis. Let’s suppose that lunch is only 1 of 8 periods during the school day (12.5% of his time during the school year is in lunch). Can we not draw meaningful conclusions from the behavior demonstrated by the child simply because it represents only 12.5% of their time in school? Of course we could. The sample size is fine. In fact, it’s not really a “sample” at all. Time on the power play over an entire season is probably better thought of as a “population.” We’re not sampling (i.e., taking only a part of the whole and drawing conclusions from it,). As you say, the PP and even strength are two different creatures, so why compare them in terms of length/time?
Second, the goalie stats you cited above may not actually fluctuate all that much. The problem is we don’t know how many minutes were played each season (plus other factors like injuries, etc). The numbers above don’t really mean all that much unless you know the total minutes played. That said, Goalie A has a mean of 916 and a SD of 8.6. Goalie B has a mean of 909 and an SD of 5.6. Goalie C has a mean of 922 and an SD of 13 and Goalie D has a mean of 918 and an SD of 4.35. To show you how misleading and incomplete the stats are, all goalies have 2 seasons which fall outside the mean (including goalie D – though just barely).
Anyway, thanks for some good reading. I think the main issue is how you conceptualize the amount of time on the PP and try to link it to total time in a season. For players who consistently are on the PP unit (or PK), that doesn’t seem appropriate. I think….
The comparing to the regular season is to make a point, not to do anything else.
As I say above, Special Teams should not be lumped into the regular season. However, the point of this article is simply this: the sample size for special teams data in a single season is not enough to be reliable.
(Note: you’re confusing terms here: a “population” is something from which you take a sample: a “population” would be all the forwards in the NHL, for example. If you wanted to test whether something is true of the general population of forwards, you’d take a sample (a group of forwards) to test. That’s not what we’re talking about here.
Sample size here means simply the amount of data we have, or what “N” is equal to. For +/-, that sample size is measured in goal events (goals at even strength while the player is on the ice), for SV% it’s shots faced, for general play, it’s time on ice.).
In your lunch example, the sample size would be the number of lunch periods (aka the number of days observed) for each child in your study. If that sample is too low, no you can’t tell anything from them.)
The point is this, which I tried to illustrate through examples, the statistics in question come from too small a sample size for us to make accurate and reliable observations statements about players from.
With regards to the goalies, all played similar minutes, but once again, those are just examples. If you look at goalies as a whole, their SV%s, or better yet their Even Strength SV%s, show great random variation from year to year because the sample is too small. I just picked out those goalies since they were notable goalies who showed my point.
Also it’s not misleading to have 2 seasons which fall outside the mean, that’s the point.
Writer at Beyond the Box Score and The Hardball Times
Pitchf/x enthusiast.
Once again, this primer is simply summerizing the work of others.
If you have any specific questions, I recommend a fanpost over at Arctic Ice Hockey, which was formerly the stat blog here BehindtheNethockey.com
(I can answer some of this, and I know the general answers, but I don’t have links to the specific studies).
Writer at Beyond the Box Score and The Hardball Times
Pitchf/x enthusiast.
Absolutely, and great work.
I really enjoyed the write-up, though this is coming from someone who spends way too much time on Capgeek and calculatedrisk. Anyway, just two quick points:
1- With regard to sample size, I understand why you thought I was confusing the terms sample and population. I should have been clearer, but that’s what I get for replying at 5 AM! Yes, the sample, in this case, is simply the amount of data. But when doing a case study (a single player), we aren’t sampling his performance, we are using his entire body of work (he IS the population, or at least his performance on the PP unit for a season would be). As you point out, if that number is too low, it can be misleading. I wasn’t clear about the difference between sampling and sample. My apologies.
2- I think I understand now your point about why a single season of data would not be reliable because in order for something to be reliable, it must be consistent. As there is only one season of data, there are no other data points with which to compare. But, I don’t think that makes the data meaningless. One can dig deeper for other types of reliability to see if the player was consistent in their performance within that season. If a player scores at a 50% rate consistently throughout that one season, you have pretty reliable performance. If the player scores at 90% for first 20 PP chances and 10% for last 20 PP opportunities, something ain’t right.
Also, the problem with establishing reliability (in this case) is that it takes place over such a long period of time and, as such, there are so many confounding variables (other players on the PP unit, quality of opposition, score at the time of the PP, injuries, even the possible growth and experience of playing on a PP unit can impact reliability – though preferably in a way that shows growth).
Over a long time, +/- may normalize, but I don’t know the time period needed to trust the stat
According to this work, it’s at least well over 120 games. I agree though, in a sample that big, assignments, goalies, zone starts are all too big a factor for it to ever really be reliable.
Blueshirt Banter - Where Rangers' Fans Matter
Tracking the Rangers - Numbers don't lie. They just don't agree with you.
Twitter: RangerSmurf
Reliable
I suppose how “reliable” the stat is depends in part on what it is meant to measure. Was +/- originally meant to measure (help to measure) the value of two players on different teams against one another? Or was it meant to help to measure the value of two players on the same team in the same season, relative to each other? In the latter case, it is more reliable than the former (although far from perfect).
Here is an example of +/- that I have been turning over in my head the past few days. I’ll call him Player E. Here are his games played, even-strength points… +/-… +/- of his team at even strength, and pts his team ended up with the past 4 seasons:
- 07-08 72 24 …. -11 …. -21 79 (14th in conference)
- 08-09 82 43 ….. -3 ….. -14 92 (6th)
- 09-10 79 40 ….. -4 …… -8 90 (9th)
- 10-11 82 47 … +32 ….. +7 87 (11th)
I know there are many other stats that would be helpful, but what are the chances Player E comes close to repeating a +32 in 2012? A couple things I like (a) his ES points have increased two of the past three seasons (and a very small drop the other season) and (b) his team’s goals for/against ES have been better each of the past 3 seasons.
Would I be right to say that around +15 to +20 would be a fair guess for his +/- next season? It seems to me he could be up around +30 or back down around 0. Which of the two is more likely? Are my questions unanswerable because +/- is not as reliable/predictable as other stats, such as the # of even strength points he puts up next season? Is the sample size of this past season too small for it to even affect my guess for next season? (In other words, is he more likely to have a high +/- in ‘11-’12 having recorded a +32 last season rather than a +10?)
Just so no one is in the dark, Player E is David Backes of the St. Louis Blues. Backes is a player (and St. Louis a team) I hardly know, so I am trying my best with the statistics available.
by North Dakota Red Eagle on Jul 27, 2011 10:56 AM EDT reply actions
The first thing I would look at in a case like that is the on ice percentages, shooting and saves. As it turns out, that very well could be the root cause, as the team shot 11.29% while he’s on the ice. That would boost his points (more goals = more opportunity for points), and would also be a reason his +/- is so out of whack.
Even further back, if you look at Backes in the three previous years, he’s been the victim of some truly terrible goaltending, with an on ice ES SV% of under .900 in two of the three years. That would have serious weight on his +/- in the opposite direction.
Coming back though, It’s hard to really project +/- year to year, but you’d expect that the +32 is not sustainable, no. If he got league average shooting (about 8.1%), it drops his g/60 rate from 3.74 all the way down to 2.73, which is about 20 goals. That takes his +32 down to a +12. That’s still excellent for a season where his Corsi Rel QoC was top ranked on STL.
It would take his points down from 47 to 35 though (assuming he maintained the same% of points per goal), which moves his season total from 62 down to a more typical career total of 50 for him.
Blueshirt Banter - Where Rangers' Fans Matter
Tracking the Rangers - Numbers don't lie. They just don't agree with you.
Twitter: RangerSmurf
by George E. Ays on Jul 27, 2011 3:03 PM EDT up reply actions

by 











































