1 Defining Expected Points

In this notebook I develop and explore an expected points model at the play level for evaluating college football offenses and defenses. The goal of this analysis is to place a value on offensive/defensive plays in terms of their contribution’s to a team’s expected points.

The data comes from collegefootballdata.com, which has play by play data on games from 2000 to present. Each observation represents one play in a game, in which we know the team, the situation (down, time remaining), and the location on the field (yards to go, yards to reach end zone). We have information about the types of plays called as well in a text field.

Due to data quality issues, I focus my analysis on the years from 2007 and onwards.

1.1 Sequences of Play

For each play in a game, I model the probability of the next scoring event that will occur within the same half for either team. This means the analysis is not at the drive level, but at what I dub the sequence level. For any given play, the next scoring event can take on one of seven outcomes:

  • Touchdown (7 points)
  • Field goal (3 points)
  • Safety (2 points)
  • No Score (0 points)
  • Opp safety (-2 points)
  • Opp field goal (-3 points)
  • Opp touchdown (-7 points)

Suppose we have two teams, A and B, playing in a game. Team A receives the opening kickoff, drives for a few plays, and then punts. Team B takes over, which starts drive 2, and they drive for a few plays before also punting. Team A then manages to put together a drive that finally scores.

All plays on these three drives are one sequence. The outcome of this sequence is the points scored by Team A - if they score a touchdown, their points from this sequence is 7 (assuming for now they make the extra point). Team B’s points from this sequence is -7 points.

This means that each one of these play was leading up to the Next Scoring Event of Team A scoring, which was the outcome we assign to each drive (and play) in that sequence.

If the team on offense drives down and scores a TD/FG, this will end the sequence. If the team on offense does not score but punts or turns the ball over, the sequence will continue with the other team now on offense. The sequence will continue until either one team scores, or the half comes to an end. From this, a sequence begins at kickoff and ends at the next kick off. When Team A kicks off to Team B to start drive 4, we start our next sequence, which will end either with one team scoring or at the end of the half.

Why model the outcome of sequences rather than individual drives? Individual plays have the potential to affect both team’s chances of scoring, positively or negatively, and we want our model to directly capture this. If an offense turns the ball over at midfield, they are not only hurting their own chances of scoring, they are increasing the other team’s chance of scoring. The value of a play in terms of expected points is function of how both team’s probabilities are affected by the outcome.

1.2 Defining Expected Points

A team’s expected points is sum of the probability of each possible scoring event multiplied by the points of that event. For this analysis, I assume that touchdowns equate to 7 rather than 6 points, assuming that extra points will be made. I can later bake in the actual probability of making extra points, but this will be a simplification for now.

For a given play \(i\) for Team \(A\), we can compute Team A’s expected points using the following:

\[ {Expected Points}_A = \\Pr(TD)*7 + \\ Pr(FG)*3 + \\Pr(Safety)*2 + \\ Pr(No Score)*0 + \\ Pr(Opp. Safety)*-2 + \\ Pr(Opp. FG) * -3 +\\ Pr(Opp. TD) * -7 \]

How do we get the probabilities of each scoring event? We learn these from historical data by using a model - I train a multinomial logistic regression model on many seasons worth of college football plays to learn how situations on the field affect the probability of the next scoring event.

1.3 Next Scoring Event

The outcome for our analysis is the NEXT_SCORE_EVENT. Each play in a given sequence contributes to the eventual outcome of the sequence. Here we can see an example of one game and its drives:

For this game, we can filter to the plays that took place in the lead up to first score event. In this case, the first sequence included one drive and ended when Texas A&M kicked a field goal.

If we look at another sequence in the second half, there were multiple drives before a team was able to score in that sequence. The next scoring event is always defined from the perspective of the offense.

2 Modeling Expected Points

Our goal is to understand how individual plays contribute to a team’s expected points, or the average points teams should expect to have given their situation (down, time, possession).

For instance, in the first drive of the Texas A&M-Florida game in 2012, Texas A&M received the ball at their own 25 yard line to open the game. The simplest intuition of expected points is to ask, for teams starting at the 25 yard line at the beginning of a game, how many points do they typically go on to score? The answer is to look at all starting drives with 75 yards to go and see what the eventual next scoring event was for each of these plays - we take the average of all of the points that followed from this situation.

In this case, this means teams with the ball at their own 25 to start the game generally obtained more points on the ensuing sequence than their opponents, so they have a slightly positive expected points.

But, this is also a function of the down. If we look at the expected points for a team in this situation in first down vs a team in this situation for fourth down, we should see a drop in their expected points - by the time you hit fourth down, if you haven’t moved from the 25, your expected points drops into the negatives, as you will now be punting the ball back to your opponent and it becomes more probable that they score than you.

The fact that the expected point changes based on the down and yard line allows us to look at the difference between expected points from play to play - the difference in expected points based on how the situation changed allows us to compute the Expected Points Added from a single play.

For any given play, we get a sense of the expected points a team can expect from their situation. For instance, if we look at all total plays in a game, how do expected points vary as a function of a team’s distance from their opponent’s goal line?

This should make sense - if you’re backed up against your own end zone, your opponent has higher expected points because they are, historically, more likely to have the next scoring event, either by gaining good field advantage after you punt or by getting a safety. We can see this if we just look at the proportion of next scoring events based on the offense’s position on the field.

From this, when we see an offense move the ball up the field on a given play, we will generally see their expected points go up. The difference in expected points before the snap and after the snap is the value added (positively or negatively) by the play.

But, it’s not just position on the field - it’s also about the situation. If we look at how expected points varies by the down, we should see that fourth downs have lower expected points.

We also have other features like distance to convert the first down (filtering here to plays with a maximum of 30 yards to go, as we start to run out of data at higher values and it looks wonky).

And we also have info on time remaining in the half - as we might expect, the proportion of drives leading to no scoring goes up as the amount of time remaining in the half goes down.

We use all of this historical data to learn the expected points from a given situation, then look at the difference in expected points from play to play - this is the intuition behind how we will value individual plays, which we can then roll up to the offense/defense/game/season level.

2.1 Building Models

How do these various features like down, distance, yards to goal, and time remaining affect the probability of the next scoring event? We use a model to learn this relationship from historical plays. I’ll now proceed to building models which I’ll use for the bulk of the analysis.

I’ll set up training, validation, and test sets based around the season. I’m mostly going to build the model using plays from the 2007 season onwards, as the data quality of the play by play data starts to get worse the further back we go, though I can later do some backtesting of the model on older seasons. I’ll train the model using regular season plays only, but I can predict postseason plays.

I’m going to use the seasons 2007-2015 as my main training set, 2016-2019 as a validation set, and leave 2020 and 2021 as my test set that I won’t look at until later on.

I want to evaluate the output of this model for later predicting games in a season, so I’m going to train the model based on the seasons in the run up to each season so that predictions are always occuring using only information available at the time.

# full plays
plays_full = plays_score_events %>%
        filter(PLAY_TYPE != 'Kickoff') %>%
        arrange(SEASON, GAME_ID) %>%
        ungroup() %>%
        mutate(OFFENSE_DIVISION = case_when(HOME_TEAM == OFFENSE ~ HOME_DIVISION,
                                            HOME_TEAM == DEFENSE ~ AWAY_DIVISION),
               DEFENSE_DIVISION = case_when(AWAY_TEAM == DEFENSE ~ AWAY_DIVISION,
                                            AWAY_TEAM == OFFENSE ~ HOME_DIVISION)) %>%
        select(GAME_ID,
               DRIVE_ID,
               PLAY_ID,
               SEASON,
               SEASON_TYPE,
               HOME,
               AWAY,
               OFFENSE,
               DEFENSE,
               OFFENSE_CONFERENCE,
               DEFENSE_CONFERENCE,
               OFFENSE_DIVISION,
               DEFENSE_DIVISION,
               OFFENSE_SCORE,
               DEFENSE_SCORE,
               # OFFENSE_PLAY_NUMBER,
               # DEFENSE_PLAY_NUMBER,
               SCORING,
               PLAY_TEXT,
               PLAY_TYPE,
               NEXT_SCORE_EVENT_HOME,
               NEXT_SCORE_EVENT_HOME_DIFF,
               NEXT_SCORE_EVENT_OFFENSE,
               NEXT_SCORE_EVENT_OFFENSE_DIFF,
               YARD_LINE,
               HALF,
               PERIOD,
               MINUTES_IN_HALF,
               SECONDS_IN_HALF,
               DOWN,
               DISTANCE,
               YARD_LINE,
               YARDS_TO_GOAL) %>%
        mutate(OFFENSE_ID = factor(case_when(OFFENSE_DIVISION == 'fbs' ~ OFFENSE,
                                      TRUE ~ 'fcs')),
               DEFENSE_ID = factor(case_when(DEFENSE_DIVISION == 'fbs' ~ DEFENSE,
                                              TRUE ~ 'fcs'))) %>%
        filter(DOWN %in% c(1, 2, 3, 4)) %>%
        filter(PERIOD %in% c(1,2,3,4)) %>%
        filter(!is.na(SECONDS_IN_HALF)) %>%
        filter(DISTANCE >=0 & DISTANCE <=100) %>%
        filter(!is.na(NEXT_SCORE_EVENT_OFFENSE)) %>%
        mutate(NEXT_SCORE_EVENT_OFFENSE = factor(NEXT_SCORE_EVENT_OFFENSE,
                                                 levels = c("No_Score",
                                                            "TD",
                                                            "FG",
                                                            "Safety",
                                                            "Opp_Safety",
                                                            "Opp_FG",
                                                            "Opp_TD"))) %>%
        arrange(SEASON, GAME_ID, PLAY_ID)

# plays filter to postseason
plays_postseason = plays_full %>%
        filter(SEASON_TYPE == 'postseason')

# regular season
# training set
plays_train = plays_full %>%
        filter(SEASON_TYPE == 'regular') %>%
        filter(SEASON >= 2007 & SEASON <2016)

# validation set
plays_valid = plays_full %>%
        filter(SEASON_TYPE == 'regular') %>%
        filter(SEASON >= 2016 & SEASON <= 2020)

# test
plays_test = plays_full %>%
        filter(SEASON_TYPE == 'regular') %>%
        filter(SEASON > 2020)

# make an initial split based on previously defined splits
valid_split = make_splits(list(analysis = seq(nrow(plays_train)),
                                 assessment = nrow(plays_train) + seq(nrow(plays_valid))),
                               bind_rows(plays_train,
                                         plays_valid))

# test split
test_split = make_splits(
        list(analysis = seq(nrow(plays_train) + nrow(plays_valid)),
             assessment = nrow(plays_train) + nrow(plays_valid) + seq(nrow(plays_test))),
        bind_rows(plays_train,
                  plays_valid,
                  plays_test))

The outcome is the next scoring event, always defined from the perspective of the offense for any given play.

2.1.1 Baseline

I currently use the following as features for plays in a baseline model:

  • Quarter
  • Seconds Remaining in Half
  • Down
  • Distance (logged)
  • Yards to opponent’s end zone
  • Down and goal indicator for whether the offense is in a ‘first and goal’ situation

I also include interactions between down and distance, down and yards to end zone, and yards to end zone and seconds remaining. This baseline model doesn’t account for things like offense/defense quality or scoring effects, meaning this analysis focused on estimating the expected points given the situation without respect to opponent.

baseline_recipe = recipe(NEXT_SCORE_EVENT_OFFENSE ~.,
                         data = plays_train) %>%
        update_role(all_predictors(),
                    new_role = "ID") %>%
        update_role(
                c("GAME_ID",
                  "DRIVE_ID",
                  "PLAY_ID",
                  "SEASON",
                  "SEASON_TYPE",
                  "HOME",
                  "AWAY",
                  "OFFENSE",
                  "DEFENSE",
                  "OFFENSE_ID",
                  "DEFENSE_ID",
                  "OFFENSE_DIVISION",
                  "DEFENSE_DIVISION",
                  "OFFENSE_CONFERENCE",
                  "DEFENSE_CONFERENCE",
                  "SCORING",
                  "OFFENSE_SCORE",
                  "DEFENSE_SCORE",
                  "PLAY_TEXT",
                  "PLAY_TYPE",
                  "NEXT_SCORE_EVENT_HOME",
                  "NEXT_SCORE_EVENT_HOME_DIFF",
                  "NEXT_SCORE_EVENT_OFFENSE_DIFF",
                  "YARD_LINE",
                  "MINUTES_IN_HALF",
                  "HALF"),
                new_role = "ID") %>%
        step_mutate(PERIOD_ID = PERIOD,
                    role = "ID") %>%
        # features we're inheriting
        update_role(
                c("PERIOD", 
                "SECONDS_IN_HALF",
                "DOWN",
                "DISTANCE",
                "YARDS_TO_GOAL"),
                new_role = "predictor") %>%
        # filters for issues
        step_filter(!is.na(NEXT_SCORE_EVENT_OFFENSE)) %>%
        step_filter(YARD_LINE <= 100 & YARD_LINE >=0) %>%
        step_filter(YARDS_TO_GOAL <=100 & YARD_LINE >=0) %>%
        step_filter(DOWN %in% c(1, 2, 3, 4)) %>%
        step_filter(DISTANCE >=0 & DISTANCE <=100) %>%
        step_filter(SECONDS_IN_HALF <=1800) %>%
        step_filter(!is.na(SECONDS_IN_HALF)) %>%
        step_filter(PERIOD_ID == 1 | PERIOD_ID == 2 | PERIOD_ID == 3 | PERIOD_ID == 4) %>%
        # create features
        step_mutate(KICKOFF = case_when(grepl("kickoff", tolower(PLAY_TEXT)) | grepl("kickoff", tolower(PLAY_TYPE))==T ~ 1,
                                        TRUE ~ 0),
                    role = "id") %>%
        step_mutate(TIMEOUT = case_when(grepl("timeout", tolower(PLAY_TEXT)) ~ 1,
                                        TRUE ~ 0),
                    role = "id") %>%
        step_filter(TIMEOUT != 1) %>%
        step_filter(KICKOFF != 1) %>%
        step_mutate(DOWN_TO_GOAL = case_when(DISTANCE == YARDS_TO_GOAL ~ 1,
                                             TRUE ~ 0)) %>%
        step_mutate(DOWN = factor(DOWN)) %>%
        step_mutate(PERIOD = factor(PERIOD)) %>%
        step_log(DISTANCE, offset =1) %>%
        step_dummy(all_nominal_predictors()) %>%
        step_novel(all_nominal_predictors(),
                   new_level = "new") %>%
        step_interact(terms = ~ DISTANCE:(starts_with("DOWN_"))) %>%
        step_interact(terms = ~ YARDS_TO_GOAL:(starts_with("DOWN_"))) %>%
        step_interact(terms = ~ YARDS_TO_GOAL*SECONDS_IN_HALF) %>%
        check_missing(all_predictors()) %>%
        step_zv(all_predictors()) %>%
        step_normalize(all_numeric_predictors())

2.2 Workflows

I’ll define the model I’ll be using here, which is a multinomial logistic regression.

# from glmnet
multinom_mod = multinom_reg(
  mode = "classification",
  engine = "glmnet",
  penalty = 0,
  mixture = NULL
)

I’ll then create a workflow, which will allow me to apply the recipe and the model estimation in one go.

# create baseline
baseline_wf = workflow() %>%
        add_recipe(baseline_recipe) %>%
        add_model(multinom_mod)

# workflow settings
# metrics
class_metrics<-metric_set(yardstick::roc_auc,
                          yardstick::mn_log_loss)

# control for resamples
keep_pred <- control_resamples(save_pred = TRUE, 
                               save_workflow = TRUE,
                               allow_par=T)

2.3 Training

I’ll now fit the model on the training set and predict the validation set.

# fit the model to the whole training set
last_fit_baseline = baseline_wf %>%
        last_fit(split = valid_split)

2.4 Model Performance

We can evaluate the model via a leave-one-season out approach, or via some in sample metrics of fit, but I’ll predict the validation set as check. I’ll compare performance relative to a null model that simply predicts the incidence rate of each outcome in the training set.

What we really care about is the calibration of the predictions - does the observed incidence rate of events match the predicted probabilities from the model? That is, when the model predicts that the next scoring event has a probability of 0.5 of being a TD, do we observe TDs occur about half of the time?

2.5 Inference

Understanding partial effects from a multinomial logit is already difficult, and I’ve thrown a bunch of interactions in there to make this even more difficult. We can extract the coefficients and take a look, but really in order to interpret the model we need to use predicted probabilities.

I’ll look at predicted probabilities using an observed values approach for particular features (using a sample rather than the full dataset to save time). This means taking the model and then altering the feature of interest for every observation and taking the average predicted probability for each outcome across all observations.

How is the probability of the next scoring event influenced by where the offense has possession?

How is this affected by the down?

How does this translate into expected points?

I’ll now retrain the model on the validation set and predict the test set so I can save the model’s predictions for every play. I won’t take a look at these predictions until I’ve done some more analysis, but I’ll score them anyway.

That’s enough of this writeup, I’ll proceed to examining expected points added in the next section.

## Creating new version '20220718T215146Z-06251'
## Writing to pin 'expected_points'
## 
## Create a Model Card for your published model
## • Model Cards provide a framework for transparent, responsible reporting
## • Use the vetiver `.Rmd` template as a place to start