EWPA: A New Way to Evaluate NFL Play-callers and Teams

WashU Sports Analytics
Dec 19, 2018
13 min read

Updated: Sep 10, 2021

By Aaron Margulis

INTRO

In 1977, Bill James published the inaugural Bill James Baseball Abstract, unknowingly creating the literary series which would later become recognized as the genesis of the great sabermetric revolution in baseball. In the early 1990s, NBA teams began taking more three-pointers and less mid-rangers, and shortly thereafter writers including Dean Oliver and Ken Pomeroy sought to become the Bill James's of basketball. Yet while most MLB front offices house sizable research and development teams to utilize magnitudes of data, and NBA teams use resources like player tracking technology to further advance the game, the NFL remains confined by the Jon Grudens of the world and has been largely excluded from the great analytics revolution in sports in recent decades.

Fourth downs seem to be the only area of football to have received adequate analytical attention, but even so the public body of work studying fourth downs falls a bit flat in my opinion. Over the past five or so years, it’s become largely agreed upon in the football analytics community that coaches punt too much and should go for it more. The best and most convincing of these research efforts have based their work on expected points (the expected value of the next scoring play in the game from the perspective of the possession team), but this lacks consideration of important game context variables like score differential, not to mention I have found little research applying these findings to historical data. This project seeks to quantify this phenomenon in terms of win probability and to examine how different teams and coaches have performed in recent seasons. My goals for this article are to [1] determine the best in-game win probability model for NFL games, [2] introduce my Expected Win Probability Added (EWPA) framework and metric, and [3] share the first findings of the EWPA evaluative system.

I will be using a dataset containing every regular season NFL play from the 2009 through 2017 seasons made possible by Maksim Horowitz’s nflscrapR package. I have decided to exclude overtime plays from the data I will be considering due to the many changes to NFL overtime rules during this nine season timeframe.

MODELING WIN PROBABILITY

In-game win probability (WP) is the measure of a team’s chances of winning a game at any given time throughout the game. In our case, we will determine the possession team's WP using variables such as time, score, field position, and more. My EWPA evaluation system which I will later introduce is reliant on modeling win probability (WP) and therefore can only be as insightful as our WP model is accurate. Unfortunately, WP models are inherently imperfect: a system that gives every team a 50% chance of winning the game on every play would be perfectly accurate in the long run (assuming losing teams run just as many plays as winning teams, the possession team will indeed always end up winning 50% of the time), but such a model wouldn't provide any useful information. Similarly, a model which only outputs 0% or 100% is no good. If we could say with certainty who is going to win a game at any point during that game, why would teams play in the first place? This introduces the challenge of finding the perfect middle ground which tells us as much about the game as we can discern. Given how hard a task this is, I chose a couple different measurements of probabilistic prediction accuracy to grade a number of WP models I generated in order to give EWPA the best foundation possible.

My first task was to determine what models to implement and compare. I chose to consider a multinomial logistic regression, a random forest method, and a k-nearest neighbors model. Yurko et al. used a generalized additive model in their recent research which produced promising results and high accuracy, but I unfortunately ran into a lot of errors trying to implement this type of model in my research. After experimenting with the data a bit, I settled on the following as the input variables for each of the WP models:

Time (seconds remaining)
Field Position (yards to end-zone)
Is First Down (binary indicator of whether or not it is a first down)
Is Second Down
Is Third Down
Is Fourth Down
Yards-To-Go (yards to first down)
Score Differential (possession team score minus opponent score)
Total Score (possession team score plus opponent score)
Possession Team Timeouts (remaining timeouts for the possession team)
Defensive Team Timeouts
Is Home Team (binary indicator of whether or not possession team has home field advantage)

Notice there is no measure of skill, and the only predetermined or non-game-state factor is home field advantage. I felt the inclusion of skill-related variables would hinder my ability to fairly evaluate team fourth down execution ability later in this article. This does mean, though, that the model I determine to be the best equipped for this project may not be the best available model for determining true in-game win probabilities: it is inaccurate to say that two different teams playing against two different opponents are equally likely to win or lose a game when put in the same situation in those games.

Finally I split up the data into training and testing sets. The training set is used to build the models and the test set is used to evaluate them. I made the arbitrary cut off between training and testing data to be between the 2014 and 2015 seasons – 2009-2014 is the training data; 2015-2017 is the test set.

FINDINGS

I chose two methods to grade the three WP models I created. First I calculated the Brier score for each model. Brier score is the average squared difference between each prediction and its actual value (whether the team actually went on to win or not). A lower Brier score is better, and a model that always predicts a WP of 50% would produce a Brier score of 0.25. I also split the game into five-minute intervals so we can see how each model’s performance improves as the game progresses.

In the above graph we can see that the multinomial logistic regression provides the best Brier score for every interval except that of the last five minutes in which it is still the second best model only behind the random forest model.

For the second comparative method, I grouped every play into 10%-width bins based on the predicted WP (for example, a play with a predicted WP of 64% would fall into the 60-70% bin). I calculated the average residual amongst each of the ten bins and found what I call the “weighted bin residual” by multiplying each of these averages by the frequency of plays in that bin. Here I have graphed the weighted bin residual for each model as a game progresses.

The multinomial logistic regression also performs better than the k-nearest neighbors and random forest models using this second evaluation method and is therefore the best of these three models to use in predicting WP. Unfortunately we can see that the logistic regression has somewhat poor late-game performance, though. This fall-off in performance results from the model’s hesitance in making extreme predictions as can be shown by how its predictions stack up against the actual WP in the last five minutes of games in the test set:

Here the blue line represents the number of plays which the model predicted to be in each of the ten bins, and the y-axis value of the red points shows how frequently those teams actually went on to win the game. As you can see, there were hundreds of plays in the last five minutes of games where a team actually had essentially a 0% change of winning but the model assigned them a 10-20% chance. Similarly there were hundreds of plays where a team was nearly guaranteed to win but the model gave them no more than a 90% likelihood of such.

EXPECTED WIN PROBABILITY ADDED (EWPA)

With our win probability model established, I finally get to introduce the main contribution of this article, my new metric, Expected Win Probability Added (EWPA). EWPA is exactly what it sounds like: the expected change in win probability of a play. I know what you might be thinking – shouldn’t EWPA always be zero, and wouldn’t any expected change in WP just indicate a flaw or lack of calibration within our WP model? Yes, but that’s only the case for the total EWPA of a play, and we will instead be focusing on the individual EWPA of every possible play-call. This framework is theoretically sound for all plays in football (or any sport for that matter), but I've confined this project to fourth downs because fourth down play-calling is significantly more clearly defined than play-calling on other downs.

Fourth downs have very clear goals. Although there is no consensus of what constitutes success on first downs, the goals of fourth downs are unanimous: if you go for it, your goal is to get a first down; if you kick a field goal, your goal is to make that field goal; and if you punt, your goal is to pin the opponent as deep in their own territory as possible. The following graphic illustrates how I calculate the expected win probability (EWP) resulting from each potential play-call (EWPA is then calculated as EWP minus current WP):

I must acknowledge a few assumptions used in my implementation of the above framework.

All field goal attempts run 15 seconds off the clock
The probability of making a field goal is determined by a smoothed logistic regression graphed below. This model does not account for the abilities of the specific player attempting the kick
After a made field goal, opposing teams will take possession of the ball at their own 25 yard line
After a missed field goal, opposing teams take possession at the spot of the kick (I ignore the possibility of returning a missed or blocked field goal)
Punts take 30 seconds off the clock
Punt net distance is determined as a function of field position as shown in the graph below. This model does not account for the abilities of the specific player punting the ball or the coverage team
Converting takes 30 seconds off the clock
On all successful conversions, the offensive team only gains as many yards as were required to achieve a first down
Failing to convert takes 10 seconds off the clock
On a failed conversion, field position does not change when the opponent gains possession of the ball
The probability of converting is based on both the distance to the first down and to the end zone

These assumptions cause the overall EWPA of fourth downs in the dataset to often not equal exactly zero, and there is room for improvement in that regard, but the advantage gained from using more accurate and less arbitrary assumptions is more or less negligible. All plays in the dataset will have the same potential biases in assumptions, and I don't believe that the assumptions inherently make the model more or less favorable to kicking, punting, or going for it compared to a perfectly specified model.

Below is a graph displaying the EWPA model’s accuracy. This is similar to the binned WP graph in the above section, however here all twenty bins contain an equal number of plays and are therefore not equally spaced on the x-axis:

We have a very well calibrated model (correlation > 0.993), but its center may be a bit different from what we expected. I’ve explained why our predicted EWPA won’t always be zero, but nonetheless we would expect it to center around zero. I would imagine the negative center is a result of my assumptions being harsher towards the possession team than reality. It also appears that our WP model generally slightly underestimates win probabilities on fourth downs. This is unfortunate, but the model treats every fourth down equally in this respect (as can be seen by the fact that the residual of most of the red points to the black line are similar) and therefore shouldn’t affect how EWPA values rank relative to one another.

FINDINGS

For the rest of the article I will continue to use a multinomial logistic regression model in determining WP, however I will recreate the model using all 2009-2017 regular season, regulation plays as input data (rather than just the 2009-2014 training set) so I can evaluate every coach and team over a nine year period rather than just over the three season test set. While this will change the exact EWPA numbers within my findings, I'm not too worried about the specific values in this article, I'm more curious about how values rank relative to one another.

Let’s start by looking at the fourth down play-calling ability of the league’s coaches as a whole. We’ll look at every decision in which the actual play-call was expected to hurt a team’s win probability by at least 7.5%, 5% or 2.5% compared to the best available call. Before showing you any results, I want to remind you that EWPA does not take into account a team’s skill; it’s possible that we could wrongfully indict a coach for, say, kicking a long field goal instead of going for it when in fact he has the strongest-legged kicker in the league and a weak offense. With that in mind, we find that coaches overwhelmingly punt when that is far from the best option:

In case you were wondering, that one field goal in the less than -7.5% EWPA versus best call category came in a 2013 week seven matchup between the Cowboys and Eagles in Philadelphia when, with 14 seconds left in the first half and down 0-3, Chip Kelly elected to attempt a 60 yard field goal on 4th and 1 from the opponent's 43 yard line.

In the graph below, we can see what our model would’ve called on those plays, and it’s almost always to go for it:

Another question I had when considering league-wide fourth down play-calling is if it has improved over recent years. With the prevalence of sports analytics on an insane rise and the already-proven odds of fourth downs readily available, I would guess so.

In the above graph, there does appear to be an upward trend, but I caution you to pay close attention to the y-axis values as there is not much increase despite how the graph appears. This shows that in 2017 the average coach made the correct call about 2% more often than in 2009, and the average fourth down play-call hurt a team's chances of winning by about 0.08% less than the average call in 2009. I tested the two-tailed p-value between the proportions of correct fourth down play-calls in 2009 and in 2017, and found p = 0.06876, making the statistical significance of the trend somewhat doubtful. Coaches may be improving their play-calling, but not by much.

COACHES

After considering the league-wide results, I figured that maybe certain coaches were immune to this unwillingness to go for it. To test that, we first have to settle on a sample of coaches. I decided to compare coaches with at least five full regular seasons between 2009 and 2017. Here’s how those 23 coaches rank:

It's interesting and reassuring to see that coaches who are generally considered good coaches trend towards the top of this list while less favorably viewed coaches trend towards the bottom of these rankings.

I was also curious to see if good fourth down play-calling correlates with winning. Using the same 23 coaches, here’s what I found:

At first glance, it does appear that there’s a correlation, albeit a weak one, between fourth down play-calling and team success. But as low as the R^2 value may be from a statistical point of view, in context this value seems almost too high. Can the quality of fourth down play-calling really affect how much a team wins by that much, or is Bill Belichick, that upper-right data point, skewing the data? If we add the 20 coaches who collected only three to four full seasons under their belt between 2009 and 2017, this is what we find:

That data actually has a negative slope and an R^2 of essentially zero. To be clear, it’s wrong to look at this and conclude, “Well looks like fourth down play-calling doesn’t affect winning,” because logically it obviously does. My takeaway from this graph would instead be that play-calling matters, but so do a lot of other things and there isn't a big enough discrepancy between teams in terms of questionable fourth down decision-making to significantly affect a team’s win total in most cases.

TEAMS

Another useful application of EWPA is looking at how team units performed on fourth downs without the once-compounding factor of their coach’s calls. We can do this by comparing the actual results of plays in terms of WP to the expected result given the play-call according to the EWPA framework. Of the 288 team-seasons from 2009-2017, here were the top ten fourth-down offenses:

The 5-11, 2009 Raiders who rank fourth on this list also ranked fourth in fourth down kicking performance of all 288 team-seasons: a large factor in their high ranking on the above list. In 2009, Sebastian Janikowski hit 89.7% of his field goal attempts for the Raiders, including going 6/8 from 50+ yards while nailing a 61-yarder. The 2016 and 2017 Saints converted 5.4 and 4.2 more fourth downs, respectively, than we would expect from a team with a league-average conversion rate in the situations where they attempted to go for it. Since we’re using win probability instead of expected points, it's also important to keep in mind that game context is very important. The 2011 Patriots didn’t kick exceptionally well and only posted a 46.7% fourth down conversion rate which is nothing extraordinary. So how’d they make the list? Well of the seven fourth downs that they did convert, one was a go-ahead touchdown on a 4th and 9 inside the two-minute warning, one was a go-ahead touchdown in the final four minutes of another game, and another of the seven was a 25 yard play to enter the red-zone with two minutes remaining while trailing by one. Also keep in mind that fourth down offense in this analysis includes punt coverage teams, and none of the top five “offenses” here allowed a single punt return touchdown.

Here’s a look at the top ten defenses (and keep in mind there is a noticeable luck factor here based on the quality of kickers a team faces):

As a Bears fan I have to be biased and talk about their two appearances on this top ten list at first and seventh. Because of how often coaches elect to punt, punt returns are a significant contributing factor to these defensive rankings. In 2010, Devin Hester returned three punts for touchdowns, second for a single-season in this dataset to only the 2014 Eagles who ranked 11th and just barely got cut out of the above graphic. Also, going back to the importance of game situation as we saw with the 2011 Patriots offense, two of those 2010 Hester return TDs were in the fourth quarter of one-score games.

CONCLUSION

In this article I established a proven win probability model to use for NFL games, and with it presented league trends and the best fourth down coaches and teams since 2009 as determined by my Expected Win Probability Added metric. While I’m of course biased being that I built the system, I see a lot of potential in using this framework for future NFL research. EWPA can already be used to answer countless questions beyond what I covered in this article such as Do fake punts work? and even What makes a fake punt likely to succeed? Maybe in the near future I’ll return with work examining some of these other questions and deeper insights about fourth downs. Beyond that, I hope to expand EWPA to 1st, 2nd and 3rd downs to evaluate differences between running and passing in today’s evolving league. Eventually this research could be used to quantify the contributions of coaches as well as to advise them, and maybe projects like mine can drive the NFL to become more efficient and a better experience for organizations and fans alike just as the work of Bills James, Dean Oliver, Ken Pomeroy and many more did for baseball and basketball years ago.

EWPA: A New Way to Evaluate NFL Play-callers and Teams

INTRO

MODELING WIN PROBABILITY

EXPECTED WIN PROBABILITY ADDED (EWPA)

CONCLUSION

Recent Posts

Comments

Mailing List