targets
Let’s go back to my ‘template’ for organizing an R repo.
├── _targets <- stores the metadata and objects of your pipeline
├── renv <- information relating to your R packages and dependencies
├── data <- data sources used as an input into the pipeline
├── src <- functions used in project/targets pipeline
| ├── data <- functions relating to loading and cleaning data
| ├── models <- functions involved with training models
| ├── reports <- functions used in generating tables and visualizations for reports
├── _targets.R <- script that runs the targets pipeline
├── renv.lock <- lockfile detailing project requirements and dependencies
Now that we’ve covered Git, GitHub, and renv
, we can start talking about the third pillar here, which is the targets
package.
A predictive modeling workflow typically consists of a number of interconnected steps.
We typically build these pieces incrementally, starting from loading the data, preparing it, then ultimately training and assessing models.
The end result can look nice and tidy, and maybe you get really clever and assemble a series of scripts or notebooks that detail the steps in your project.
Your project might end up looking something like this:
01-load.R
02-tidy.R
03-model.R
04-evaluate.R
05-deploy.R
06-report.R
And you might have some sort of meta script that runs them all.
And this is working fine… until you discover an issue with a function in 02-tidy.R
, or want to make a change to how you’re evaluating the model in 03-evalaute.R
.
How do you insert a change into this process? Like, if you make a change to a function, how do you know what needs to be re run?
How many times do you just end up rerunning everything to be safe?
This pattern of developing, changing, re-running can consume a lot of time, especially with time-consuming tasks like training models.
This is the basic motivation for the targets
package:
It might not be too bad when you’re actively working on a project, but suppose you’re coming back to something after a few months away.
Or suppose you look at someone else’s repo for the first time, and you have to try to figure out how to put the pieces together to produce their result.
We’d like an easier way to keep track of dependencies so that we are only re-running things when necessary, as well as provide others with a clear path to reproduce our work.
targets
targets
Data analysis can be slow. A round of scientific computation can take several minutes, hours, or even days to complete. After it finishes, if you update your code or data, your hard-earned results may no longer be valid. Unchecked, this invalidation creates a chronic Sisyphean loop
https://books.ropensci.org/targets/
The solution to this problem is to develop pipelines, which track dependencies between steps, or “targets”, of a workflow.
When running the pipeline, it first checks to see if the upstream targets have changed since the previous run.
If the upstream targets are up to date, the pipeline will skip them and proceed to running the next step.
If everything is up to date, the pipeline will skip everything and inform you that nothing changed.
Most pipeline tools, such as Make, are either language agnostic or depend on using Python.
targets
lets you build Make-style pipelines using R
├── _targets <- stores the metadata and objects of your pipeline
├── renv <- information relating to your R packages and dependencies
├── data <- data sources used as an input into the pipeline
├── src <- functions used in project/targets pipeline
| ├── data <- functions relating to loading and cleaning data
├── _targets.R <- script that runs the targets pipeline
├── renv.lock <- lockfile detailing project requirements and dependencies
targets
adds two main pieces to a project:
targets.R
is the script that will implement our pipeline. This is what we will build and develop._targets
is a folder containing metadata for the steps defined in targets.R
, as well as cached objects from the latest run of the pipelineNote: by default, _targets
objects are stored locally. But you can configure targets
to store objects in a cloud bucket (GCP/AWS)
When you use targets
locally, it will store objects from the latest run of the pipeline. If you use a cloud bucket for storage, you can enable versioning so that all runs are stored.
What is targets.R
?
Running targets::use_targets()
will create a template for the targets.R
script, which all follow a similar structure.
# Created by use_targets().
# Follow the comments below to fill in this target script.
# Then follow the manual to check and run the pipeline:
# https://books.ropensci.org/targets/walkthrough.html#inspect-the-pipeline
# Load packages required to define the pipeline:
library(targets)
# library(tarchetypes) # Load other packages as needed.
# Set target options:
tar_option_set(
packages = c("tibble") # Packages that your targets need for their tasks.
# format = "qs", # Optionally set the default storage format. qs is fast.
)
# Run the R scripts in the R/ folder with your custom functions:
tar_source()
# tar_source("other_functions.R") # Source other scripts as needed.
# Replace the target list below with your own:
list(
tar_target(
name = data,
command = tibble(x = rnorm(100), y = rnorm(100))
# format = "qs" # Efficient storage for general data objects.
),
tar_target(
name = model,
command = coefficients(lm(y ~ x, data = data))
)
)
Going back to our Star Wars sentiment analysis, we can build a simple targets pipeline to recreate what we did earlier. The basic steps of our pipeline will look something like this:
This is what the resulting pipeline will look like:
library(targets)
# set options
tar_option_set(packages = c("readr", "dplyr", "sentimentr", "here"))
# functions to be used
# load starwars data
load_data = function(file = here::here('materials', 'data', 'starwars_text.csv')) {
read_csv(file)
}
# prepare data
clean_data = function(data) {
data |>
mutate(episode = case_when(document == 'a new hope' ~ 'iv',
document == 'the empire strikes back' ~ 'v',
document == 'return of the jedi' ~ 'vi')) |>
mutate(character = case_when(character == 'BERU' ~ 'AUNT BERU',
character == 'LURE' ~ 'LUKE',
TRUE ~ character)) |>
select(episode, everything())
}
# calculate sentiment
calculate_sentiment = function(data,
by = c("document", "character", "line_number")) {
data |>
sentiment_by(by = by) |>
sentimentr::uncombine()
}
# define targets
list(
tar_target(
name = starwars,
command =
load_data() |>
clean_data()
),
tar_target(
name = sentences,
command =
starwars |>
get_sentences()
),
tar_target(
name = sentiment,
command =
sentences |>
calculate_sentiment()
)
)
#> [[1]]
#> <tar_stem>
#> name: starwars
#> description:
#> command:
#> clean_data(load_data())
#> format: rds
#> repository: local
#> iteration method: vector
#> error mode: stop
#> memory mode: persistent
#> storage mode: main
#> retrieval mode: main
#> deployment mode: worker
#> priority: 0
#> resources:
#> list()
#> cue:
#> mode: thorough
#> command: TRUE
#> depend: TRUE
#> format: TRUE
#> repository: TRUE
#> iteration: TRUE
#> file: TRUE
#> seed: TRUE
#> packages:
#> readr
#> dplyr
#> sentimentr
#> here
#> library:
#> NULL
#> [[2]]
#> <tar_stem>
#> name: sentences
#> description:
#> command:
#> get_sentences(starwars)
#> format: rds
#> repository: local
#> iteration method: vector
#> error mode: stop
#> memory mode: persistent
#> storage mode: main
#> retrieval mode: main
#> deployment mode: worker
#> priority: 0
#> resources:
#> list()
#> cue:
#> mode: thorough
#> command: TRUE
#> depend: TRUE
#> format: TRUE
#> repository: TRUE
#> iteration: TRUE
#> file: TRUE
#> seed: TRUE
#> packages:
#> readr
#> dplyr
#> sentimentr
#> here
#> library:
#> NULL
#> [[3]]
#> <tar_stem>
#> name: sentiment
#> description:
#> command:
#> calculate_sentiment(sentences)
#> format: rds
#> repository: local
#> iteration method: vector
#> error mode: stop
#> memory mode: persistent
#> storage mode: main
#> retrieval mode: main
#> deployment mode: worker
#> priority: 0
#> resources:
#> list()
#> cue:
#> mode: thorough
#> command: TRUE
#> depend: TRUE
#> format: TRUE
#> repository: TRUE
#> iteration: TRUE
#> file: TRUE
#> seed: TRUE
#> packages:
#> readr
#> dplyr
#> sentimentr
#> here
#> library:
#> NULL
We can view the steps that will be carried out by pipeline using tar_manifest()
#> name command
#> 1 starwars clean_data(load_data())
#> 2 sentences get_sentences(starwars)
#> 3 sentiment calculate_sentiment(sentences)
Or, we can visualize the pipeline using tar_glimpse()
.
tar_visnetwork()
provides a more detailed breakdown of the pipeline, including the status of individual targets, as well as the functions and where they are used.
We then run the pipeline using tar_make()
, which will detail the steps that are being carried out and whether they were re-run or skipped.
#> v skipped target starwars
#> v skipped target sentences
#> v skipped target sentiment
#> v skipped pipeline [0.044 seconds]
We can then load the objects using tar_read()
or tar_load()
.
episode | document | line_number | character | dialogue | element_id | sentence_id | word_count | sentiment |
---|---|---|---|---|---|---|---|---|
iv | a new hope | 1 | THREEPIO | Did you hear that? | 1 | 1 | 4 | 0.00000000 |
iv | a new hope | 1 | THREEPIO | They've shut down the main reactor. | 1 | 2 | 6 | -0.24494897 |
iv | a new hope | 1 | THREEPIO | We'll be destroyed for sure. | 1 | 3 | 5 | -0.60373835 |
iv | a new hope | 1 | THREEPIO | This is madness! | 1 | 4 | 3 | -0.57735027 |
iv | a new hope | 2 | THREEPIO | We're doomed! | 2 | 1 | 2 | -0.70710678 |
iv | a new hope | 3 | THREEPIO | There'll be no escape for the Princess this time. | 3 | 1 | 9 | -0.11666667 |
iv | a new hope | 4 | THREEPIO | What's that? | 4 | 1 | 2 | 0.00000000 |
iv | a new hope | 5 | THREEPIO | I should have known better than to trust the logic of a half-sized thermocapsulary dehousing assister... | 5 | 1 | 17 | 0.06063391 |
iv | a new hope | 6 | LUKE | Hurry up! | 6 | 1 | 2 | 0.00000000 |
iv | a new hope | 6 | LUKE | Come with me! | 6 | 2 | 3 | 0.00000000 |
This might seem like a lot of overhead for little gain; if re-running is relatively painless, then is the it worth the time to set up a pipeline?
I, and the author of the package, will argue that yes, yes it is.
targets expects users to adopt a function-oriented style of programming. User-defined R functions are essential to express the complexities of data generation, analysis, and reporting.
https://books.ropensci.org/targets/functions.html
Traditional data analysis projects consist of imperative scripts, often with with numeric prefixes.
01-data.R
02-model.R
03-plot.R
To run the project, the user runs each of the scripts in order.
As we’ve previously discussed, this type of approach inherently creates problems with dependencies and trying to figure out which pieces need to be rerun.
But even more than that, this approach doesn’t do a great job explaining what exactly is happening with a project, and it can be a pain to test.
Every time you look at it, you need to read it carefully and relearn what it does. And test it, you need to copy the entire block into the R console.
For example, rather than write a script that loads, cleans, and outputs the Star Wars data, I simply wrote two functions, which we can easily call and run as needed to get the data.
…instead of invoking a whole block of text, all you need to do is type a small reusable command. The function name speaks for itself, so you can recall what it does without having to mentally process all the details again.
Embracing functions makes it easier for us to track dependencies, explain our work, and build in small pieces that can be tested and put together to complete the larger project.
It also can really help when we have time consuming steps.
targets
demo - College Football and Elo Ratingstargets.R
_targets
metadata_targets
objects15:00
The targets
pipeline in cfb_elo
looked like this:
We configured the pipeline to make API calls to get the full history of college football games, then ran a time-consuming function to calculate Elo ratings for all teams across all available games.
We were then able to develop a simple model to examine the value of home field advantage and predict the spread of games.
Let’s not get too distracted by this, but check out how the spread predictions from a simple Elo model compare to Vegas for week 1 of the 2024 season.
targets
There are a couple additional things we need to cover about targets
before we move onto building more complex pipelines for the purpose of putting models and projects into production.
Recall that running a targets
pipeline creates a _targets
folder within our project folder.
_targets/meta/
_targets/objects/
The meta
folder contains metadata relating to the objects you created in your pipeline. This is what determines if your pipeline is up to date and tracks its lineage across different runs.
Note:
This meta
folder and its contents are committed to GitHub, as it is required to run your pipeline, but also by committing it you are storing the history of how your pipeline has changed during development.
The objects
folder contains the actual objects created by runs of your pipeline. These are the most up to date versions of the objects in your pipeline from your most recent run with tar_make()
and can be loaded using tar_read()
and tar_load()
.
Importantly, objects
are not committed to GitHub. These objects are the various artifacts of your pipeline runs (data, models, etc) and can be quite large. Git is not intended to handle diffs for objects of this nature, so committing these would be a bad idea.
You’ll notice that, by default, you can’t even commit _targets/objects
to your repository; this is because by default these are ignored with a special .gitignore.
By default, the objects you create in your pipeline will be stored locally - that is, on the machine running the pipeline. This means, by default, that pipeline runs are isolated from each other. 1
If you want to fiddle with adding a new step to my pipeline, you will have to re-run the entire pipeline; the objects
stored on my machine will not be available to you.
This also means that, locally, the stored objects
are always from the last time a target was run via tar_make()
This means that targets
does not, by default, have data version control; you are not storing multiple versions of your objects as your pipeline changes. You are always overwriting old output with new output.
However, we can configure targets
to export the objects
to a shared cloud location so that:
objects
are no longer isolated to the machine of the runobjects
are stored using cloud versioning to preserve the lineage of our pipelineAt the top of our _targets.R
script, we have options we use to define the pipeline.
This includes setting the packages that should be loaded during the entire pipeline, the format of the targets to be saved, and the location, or repository to store the objects
.
By default, this is set to “local”.
But we can set the repository to a cloud storage location (AWS, GCP), which will then export our objects and their metadata to a cloud bucket.
This is what I tend to do for my own projects, as it shifts all of my storage to the cloud and I can pick up and work on pipelines between different workstations without needing to re-run the pipeline everytime.
It also stores the lineage of my work historically, so that I can easily revert to past versions if needed.
However, using the cloud introduces an added wrinkle of requiring authentication around our pipeline, which we will cover later.
Sadly, at the minute, targets
is only set up to use AWS and GCP out of the box; Azure is in development but would currently require some custom configuration.
git
+ targets
+ renv
for predictive modeling projects
Let’s revisit some of the original motivations for this workshop.
How do I share my code with you, so that you can run my code, make changes, and let me know what you’ve changed?
How can a group of people work on the same project without getting in each other’s way?
How do we ensure that we are running the same code and avoid conflicts from packages being out of date?
How can we run experiments and test out changes without breaking the current project?
How do we take a project into production?
targets
and predictive modelsLet’s talk about building predictive modeling pipelines in targets
, the thing most of us are ultimately employed to do.
As with any other type of project, we want to write code that is transparent, reproducible, and allows for collaborative development and testing.
In principal, Git/GitHub and renv
are the most important pieces for allowing us to do this; we are not required to use targets for training/deploying models. I would
But I have found its functional, Make-style approach to be well suited for managing the predictive modeling life cycle.
Predictive modeling runs are, after all, a DAG.
flowchart LR raw[Raw Data] --> clean[Clean Data] clean --> train[Training Set] clean --> valid[Validation Set] train --> preprocessor(Preprocessor) preprocessor --> resamples[Bootstraps] resamples --> model(glmnet) model --> features(Impute + Normalize) features --> tuning(Tuning) tuning --> valid preprocessor --> valid valid --> evaluation[Model Evaluation] train --> final(Model) valid --> final
In the sections to come, we will be splitting/training/finalizing/deploying predictive models in a targets
pipeline.
Most of the examples we’re going to work on will assume some level of familiarity with tidymodels
. What is everyone’s famililarity with tidymodels
?
Again, in principal, you do not have to use tidymodels
in pipelines, but they provide a standardized way to train models that naturally works well with functional programming.
Therefore:
tidymodels
refers to a suite of R packages that bring the design philosophy and grammar of the tidyverse
to training models
if you’re like me and and originally cut your teeth with the caret
package, tidymodels
is its successor from the same person (Max Kuhn, praise his name)
fun fact: tidymodels
is basically just a GitHub organization:
Recall a sample predictive modeling pipeline.
flowchart LR raw[Raw Data] --> clean[Clean Data] clean --> train[Training Set] clean --> valid[Validation Set] train --> preprocessor(Preprocessor) preprocessor --> resamples[Bootstraps] resamples --> model(glmnet) model --> features(Impute + Normalize) features --> tuning(Tuning) tuning --> valid preprocessor --> valid valid --> evaluation[Model Evaluation] train --> final(Model) valid --> final
Breaking this pipeline down into key parts, we have:
splitting/resampling (train/valid, bootstraps, cross validation)
preprocessing (imputation/normalization)
model specification (glmnet, random forest)
tuning over parameters (mtry, penalty)
model assessment (rmse, log loss)
Each of these correspond to a key concept/package in tidymodels
splitting/resampling (train/valid, bootstraps, cross validation) -> rsets
from rsample
preprocessing (imputation/normalization) -> recipes
model specification (glmnet, random forest) -> models
from parsnip
tuning over parameters (mtry, penalty) -> tune
and dials
model assessment (rmse, log loss) -> yardstick
tidymodels
conceptsrecipes
models
from parsnip
workflows
splits/resamples
from rsample
metrics
from yardstick
and tune
model
is a specification (from parsnip
) that defines the type of model to be trained (linear model, random forest), its mode (classification, regression), and its underlying engine (lm, stan_lm, ranger, xgboost, lightgbm)parsnip
provides a standardized interface for specifiying models, which allows us to easily run different types of models without having to rewrite our code to accommodate differencesif you’ve ever been annoyed with having to create y
and x
matrices for glmnet or ranger, parsnip
is something of a lifesaver
a linear model with lm
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
#>
#> Model fit template:
#> stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
a linear model with glmnet
#> Linear Regression Model Specification (regression)
#>
#> Main Arguments:
#> penalty = 0
#>
#> Computational engine: glmnet
#>
#> Model fit template:
#> glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(),
#> family = "gaussian")
a random forest with ranger
(specifying tuning over the number of trees and number of randomly selected variables)
#> Random Forest Model Specification (classification)
#>
#> Main Arguments:
#> mtry = tune::tune()
#> trees = tune::tune()
#>
#> Computational engine: ranger
boosted trees with xgboost
#> Boosted Tree Model Specification (classification)
#>
#> Main Arguments:
#> mtry = tune::tune()
#> trees = 500
#> tree_depth = tune::tune()
#> sample_size = tune::tune()
#> stop_iter = 50
#>
#> Computational engine: xgboost
This allows us to easily fit models in a standardized way despite their engines requiring different formulas/syntax
to fit a model we simply pass along a formula and a dataset to fit()
fitting a linear model with lm
fitting a ridge regression with glmnet
linear_reg(mode = "regression", penalty = 0) |>
set_engine("glmnet") |>
fit(mpg ~ hp + wt, data = mtcars)
#> parsnip model object
#>
#>
#> Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = "gaussian")
#>
#> Df %Dev Lambda
#> 1 0 0.00 5.1470
#> 2 1 12.78 4.6900
#> 3 1 23.39 4.2730
#> 4 1 32.20 3.8940
#> 5 2 39.55 3.5480
#> 6 2 46.87 3.2320
#> 7 2 52.95 2.9450
#> 8 2 58.00 2.6840
#> 9 2 62.19 2.4450
#> 10 2 65.67 2.2280
#> 11 2 68.55 2.0300
#> 12 2 70.95 1.8500
#> 13 2 72.94 1.6850
#> 14 2 74.60 1.5360
#> 15 2 75.97 1.3990
#> 16 2 77.11 1.2750
#> 17 2 78.05 1.1620
#> 18 2 78.84 1.0580
#> 19 2 79.49 0.9645
#> 20 2 80.03 0.8788
#> 21 2 80.48 0.8007
#> 22 2 80.85 0.7296
#> 23 2 81.16 0.6648
#> 24 2 81.42 0.6057
#> 25 2 81.63 0.5519
#> 26 2 81.81 0.5029
#> 27 2 81.96 0.4582
#> 28 2 82.08 0.4175
#> 29 2 82.18 0.3804
#> 30 2 82.27 0.3466
#> 31 2 82.34 0.3158
#> 32 2 82.39 0.2878
#> 33 2 82.44 0.2622
#> 34 2 82.48 0.2389
#> 35 2 82.52 0.2177
#> 36 2 82.54 0.1983
#> 37 2 82.57 0.1807
#> 38 2 82.59 0.1647
#> 39 2 82.60 0.1500
#> 40 2 82.61 0.1367
#> 41 2 82.63 0.1246
#> 42 2 82.63 0.1135
#> 43 2 82.64 0.1034
#> 44 2 82.65 0.0942
#> 45 2 82.65 0.0859
#> 46 2 82.66 0.0782
#> 47 2 82.66 0.0713
#> 48 2 82.66 0.0649
#> 49 2 82.67 0.0592
#> 50 2 82.67 0.0539
#> 51 2 82.67 0.0491
#> 52 2 82.67 0.0448
#> 53 2 82.67 0.0408
#> 54 2 82.67 0.0372
#> 55 2 82.67 0.0339
recipes
capture steps for preprocessing data prior to training a model.
a recipe is a type of preprocessor that can dynamically apply transformations (imputation, normalization, dummies) to the data we are using to model.
we create them with recipe()
, typically specifying a formula and a dataset. we then add steps to recipe of the form step_
(step_mutate, step_impute, step_nzv, …)
library(splines2)
rec =
recipe(mpg ~ ., data = mtcars) |>
step_spline_b("hp", deg_free = 3) |>
step_interact(terms = ~ gear:wt) |>
step_normalize(all_numeric_predictors())
rec$var_info
#> # A tibble: 11 × 4
#> variable type role source
#> <chr> <list> <chr> <chr>
#> 1 cyl <chr [2]> predictor original
#> 2 disp <chr [2]> predictor original
#> 3 hp <chr [2]> predictor original
#> 4 drat <chr [2]> predictor original
#> 5 wt <chr [2]> predictor original
#> 6 qsec <chr [2]> predictor original
#> 7 vs <chr [2]> predictor original
#> 8 am <chr [2]> predictor original
#> 9 gear <chr [2]> predictor original
#> 10 carb <chr [2]> predictor original
#> 11 mpg <chr [2]> outcome original
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
using recipes involves two main steps:
prep()
bake()
prep()
preparing a recipe is kind of like training a model; it captures/estimates information on one dataset and will apply those same transformations to a new dataset
This is really important for things like normalization/imputation, as we want to apply the same transformations to unseen data that were used on the training set
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
#> # A tibble: 14 × 4
#> variable type role source
#> <chr> <list> <chr> <chr>
#> 1 cyl <chr [2]> predictor original
#> 2 disp <chr [2]> predictor original
#> 3 drat <chr [2]> predictor original
#> 4 wt <chr [2]> predictor original
#> 5 qsec <chr [2]> predictor original
#> 6 vs <chr [2]> predictor original
#> 7 am <chr [2]> predictor original
#> 8 gear <chr [2]> predictor original
#> 9 carb <chr [2]> predictor original
#> 10 mpg <chr [2]> outcome original
#> 11 hp_1 <chr [2]> predictor derived
#> 12 hp_2 <chr [2]> predictor derived
#> 13 hp_3 <chr [2]> predictor derived
#> 14 gear_x_wt <chr [2]> predictor derived
bake()
baking a recipe produces the dataframe/matrix that will be used in modeling
#> # A tibble: 5 × 14
#> cyl disp drat wt qsec vs am gear carb mpg hp_1
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.105 -0.571 0.568 -0.61 -0.777 -0.868 1.19 0.424 0.735 21 0.627
#> 2 -0.105 -0.571 0.568 -0.35 -0.464 -0.868 1.19 0.424 0.735 21 0.627
#> 3 -1.23 -0.99 0.474 -0.917 0.426 1.12 1.19 0.424 -1.12 22.8 0.093
#> 4 -0.105 0.22 -0.966 -0.002 0.89 1.12 -0.814 -0.932 -1.12 21.4 0.627
#> 5 1.01 1.04 -0.835 0.228 -0.464 -0.868 -0.814 -0.932 -0.503 18.7 0.839
#> # ℹ 3 more variables: hp_2 <dbl>, hp_3 <dbl>, gear_x_wt <dbl>
recipes are especially helpful for handling categorical features, as we can create easily steps for handling novel levels or pooling infrequent levels.
data(ames, package = "modeldata")
ames_rec <-
recipe(
Sale_Price ~ Neighborhood,
data = ames
) |>
step_novel(Neighborhood) |>
step_other(Neighborhood, threshold = 0.05, other = "Other") |>
step_dummy(all_nominal_predictors())
ames_rec |>
prep() |>
bake(new_data = NULL) |>
head(15)
#> # A tibble: 15 × 9
#> Sale_Price Neighborhood_College_…¹ Neighborhood_Old_Town Neighborhood_Edwards
#> <int> <dbl> <dbl> <dbl>
#> 1 215000 0 0 0
#> 2 105000 0 0 0
#> 3 172000 0 0 0
#> 4 244000 0 0 0
#> 5 189900 0 0 0
#> 6 195500 0 0 0
#> 7 213500 0 0 0
#> 8 191500 0 0 0
#> 9 236500 0 0 0
#> 10 189000 0 0 0
#> 11 175900 0 0 0
#> 12 185000 0 0 0
#> 13 180400 0 0 0
#> 14 171500 0 0 0
#> 15 212000 0 0 0
#> # ℹ abbreviated name: ¹Neighborhood_College_Creek
#> # ℹ 5 more variables: Neighborhood_Somerset <dbl>,
#> # Neighborhood_Northridge_Heights <dbl>, Neighborhood_Gilbert <dbl>,
#> # Neighborhood_Sawyer <dbl>, Neighborhood_Other <dbl>
workflows
bundle models
from parsnip and preprocessors
from recipes into one object, which can then be trained/tuned/fit with a single call.
combining a model and recipe into a workflow
mod =
linear_reg(mode = "regression") |>
set_engine("lm")
rec =
recipe(mpg ~ ., data = mtcars) |>
step_spline_b("hp", deg_free = 3) |>
step_interact(terms = ~ gear:wt) |>
step_normalize(all_numeric_predictors())
wflow =
workflow() |>
add_recipe(rec) |>
add_model(mod)
wflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 3 Recipe Steps
#>
#> • step_spline_b()
#> • step_interact()
#> • step_normalize()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
fitting a workflow with fit()
prepares our recipe and trains our model in one call
# fitting workflow
fit =
wflow |>
fit(mtcars)
# examining fit
fit |>
broom::tidy() |>
mutate_if(is.numeric, round, 3)
#> # A tibble: 14 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 20.1 0.402 50.0 0
#> 2 cyl 1.92 1.76 1.09 0.29
#> 3 disp -0.563 2.02 -0.278 0.784
#> 4 drat 0.019 0.95 0.02 0.984
#> 5 wt 4.76 3.84 1.24 0.231
#> 6 qsec 1.63 1.21 1.34 0.196
#> 7 vs 0.133 0.932 0.143 0.888
#> 8 am 0.467 0.939 0.497 0.625
#> 9 gear 6.05 2.37 2.56 0.02
#> 10 carb -0.298 1.21 -0.247 0.808
#> 11 hp_1 -1.64 0.786 -2.08 0.052
#> 12 hp_2 -1.02 1.02 -1.00 0.328
#> 13 hp_3 -1.36 1.20 -1.13 0.272
#> 14 gear_x_wt -6.69 2.91 -2.30 0.034
when we are using targets
to train predictive models, a workflow
will typically be the final step; it is the object we are trying to produce that can be used to predict new data
workflows
make model deployment relatively straightforward, as we just need to export/share/deploy our finalized workflow
we’ll go over this in a bit with the vetiver
package
tidymodels
conceptsrecipes
✓
models
from parsnip
✓
workflows
✓
splits/resamples
from rsample
metrics
from yardstick
and tune
splitting our data (train/valid, bootstraps, cross validation) is a standard part of training/assessing predictive models
the rsample
package provides a standardized way to do this that works directly with workflows
creating a train/validation split
creating bootstraps
#> # Bootstrap sampling
#> # A tibble: 10 × 2
#> splits id
#> <list> <chr>
#> 1 <split [32/13]> Bootstrap01
#> 2 <split [32/11]> Bootstrap02
#> 3 <split [32/11]> Bootstrap03
#> 4 <split [32/8]> Bootstrap04
#> 5 <split [32/10]> Bootstrap05
#> 6 <split [32/12]> Bootstrap06
#> 7 <split [32/10]> Bootstrap07
#> 8 <split [32/10]> Bootstrap08
#> 9 <split [32/12]> Bootstrap09
#> 10 <split [32/10]> Bootstrap10
creating cross validation folds
each individual row contains an rsplit
object, which has the original data stored as a single training/test split
these sets can be extracted via the functions rsample::training()
or rsample::testing()
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> Pontiac Firebird...3 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> Toyota Corona...4 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> Merc 450SE...5 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> Mazda RX4 Wag...7 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> Duster 360...10 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> Maserati Bora...12 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> Maserati Bora...13 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> Merc 240D...14 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> Chrysler Imperial...16 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> Fiat X1-9...17 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> Duster 360...18 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Duster 360...19 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Merc 450SE...20 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> Chrysler Imperial...21 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
#> Merc 450SE...24 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> Pontiac Firebird...25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> Merc 240D...26 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> Mazda RX4 Wag...27 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> Fiat X1-9...29 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> Chrysler Imperial...30 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> Duster 360...31 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Toyota Corona...32 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
nested rsplit
objects make it easy to do tidy evaluation for models across resamples, such as estimate models/parameters
fit_model = function(split) {
linear_reg(mode = "regression") |>
fit(mpg ~ wt + hp + disp, data = training(split)) |>
broom::tidy()
}
# fit model to 500 bootstraps and plot distribution of coefficients
estimates =
mtcars |>
rsample::bootstraps(times =500) |>
mutate(results = map(splits, fit_model)) |>
select(id, results)
estimates
#> # A tibble: 500 × 2
#> id results
#> <chr> <list>
#> 1 Bootstrap001 <tibble [4 × 5]>
#> 2 Bootstrap002 <tibble [4 × 5]>
#> 3 Bootstrap003 <tibble [4 × 5]>
#> 4 Bootstrap004 <tibble [4 × 5]>
#> 5 Bootstrap005 <tibble [4 × 5]>
#> 6 Bootstrap006 <tibble [4 × 5]>
#> 7 Bootstrap007 <tibble [4 × 5]>
#> 8 Bootstrap008 <tibble [4 × 5]>
#> 9 Bootstrap009 <tibble [4 × 5]>
#> 10 Bootstrap010 <tibble [4 × 5]>
#> # ℹ 490 more rows
for predictive modeling workflows, rsample
is typically used in conjunction with yardstick
and tune
to estimate model performance for a model or tune a model across parameters
we specify the type of metrics we want to use in a metric_set()
then we can fit our workflow across resamples and estimate its performance across these metrics
wflow |>
fit_resamples(
resamples = boots,
metrics = my_metrics
) |>
collect_metrics() |>
mutate_if(is.numeric, round, 3)
#> # A tibble: 3 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 ccc standard 0.535 10 0.106 Preprocessor1_Model1
#> 2 rmse standard 6.30 10 0.816 Preprocessor1_Model1
#> 3 rsq standard 0.489 10 0.081 Preprocessor1_Model1
tidymodels
conceptsrecipes
✓
models
from parsnip
✓
workflows
✓
splits/resamples
from rsample
✓
metrics
from yardstick
and tune
✓
I realize this is a lot to take in, but once we are familiar with these concepts it becomes much, much easier to standardize our predictive modeling so that we can easily train/test/deploy different kinds of models within our pipeline
flowchart LR raw[Raw Data] --> clean[Clean Data] clean --> train[Training Set] clean --> valid[Validation Set] train --> preprocessor(Preprocessor) preprocessor --> resamples[Bootstraps] resamples --> model(glmnet) model --> features(Impute + Normalize) features --> tuning(Tuning) tuning --> valid preprocessor --> valid valid --> evaluation[Model Evaluation] train --> final(Model) valid --> final
flowchart LR raw[Raw Data] --> clean[Clean Data] clean --> train[Training Set] clean --> valid[Validation Set] train --> preprocessor(Preprocessor) preprocessor --> resamples[Cross validation] resamples --> model(lightgbm) model --> features(Minimal) features --> tuning(Tuning) tuning --> valid preprocessor --> valid valid --> evaluation[Model Evaluation] train --> final(Model) valid --> final
targets
and predictive modelsLet’s walk through the process of building a targets
pipeline for a predictive model that we will look to deploy.
Suppose we were working on the rather famous nycflights13
dataset to train a model to predict whether departed flights would arrive late or on time
#|
library(nycflights13)
flights =
nycflights13::flights |>
mutate(arr_delay = case_when(arr_delay >=30 ~ 'late', TRUE ~ 'on_time'),
arr_delay = factor(arr_delay, levels = c("on_time", "late")),
date = as.Date(time_hour)
)
flights |>
select(date, arr_delay, dep_time, arr_time, carrier, origin, dest, air_time, distance) |>
head(5)
#> # A tibble: 5 × 9
#> date arr_delay dep_time arr_time carrier origin dest air_time distance
#> <date> <fct> <int> <int> <chr> <chr> <chr> <dbl> <dbl>
#> 1 2013-01-01 on_time 517 830 UA EWR IAH 227 1400
#> 2 2013-01-01 on_time 533 850 UA LGA IAH 227 1416
#> 3 2013-01-01 late 542 923 AA JFK MIA 160 1089
#> 4 2013-01-01 on_time 544 1004 B6 JFK BQN 183 1576
#> 5 2013-01-01 on_time 554 812 DL LGA ATL 116 762
We have one year’s worth of flights to examine, with information about the carrier, the origin, the destination, the departure time, etc.
our outcome is a binary variable arr_delay
indicating whether the flight was on time or late.
Our end goal is to produce a model that can be used to predict new data in production.
To get to this point, we will need to split our data, train models, estimate their performance, and select the best performing model
I have already started this process; I want you to now pick up where I left off.
https://github.com/ds-workshop/flights
split
branch.your-name/split
renv::restore()
targets::tar_glimpse()
targets::tar_make()
10:00
The split
branch contains the following pipeline:
What do we need to do next to get to our end goal of a finalized predictive model?
train_data
? How would you handle this missingness in a model?recipe
from train_data
with arr_delay as the outcome and air_time and distance as predictors10:00
We can create a recipe in the following way:
We can then see how this recipe prepares data if we prep it on our training set and then use bake.
#> # A tibble: 15 × 3
#> air_time distance arr_delay
#> <dbl> <dbl> <fct>
#> 1 0.562 0.491 on_time
#> 2 2.60 2.11 on_time
#> 3 -0.228 0.453 on_time
#> 4 -0.531 -0.419 on_time
#> 5 -0.250 -0.419 on_time
#> 6 -0.801 -0.851 on_time
#> 7 -0.488 -0.438 on_time
#> 8 -0.791 -0.734 on_time
#> 9 0.670 0.791 on_time
#> 10 1.85 2.02 on_time
#> 11 0.291 -0.0435 late
#> 12 -0.228 -0.131 on_time
#> 13 -0.401 -0.539 late
#> 14 -0.455 -0.429 on_time
#> 15 0.183 -0.0272 on_time
Now, we want to train a workflow.
model/baseline
branchtargets::tar_make()
10:00
The model/baseline
branch added new steps to the pipeline; we added a workflow
that we fit to the training set and assessed on the validation set.
Notice that we directly wrote these metrics to a csv in our project (targets-runs/valid_metrics.csv
), which we will then commit to our repository.
This will allow us to track model performance on our validation set using Git as we add new models/tinker with the original model.
dep_time
as a feature to the baseline modeltargets::tar_make()
10:00
The model/glmnet
branch added a more robust recipe to make use of more features, particularly categorical features. We then added a new workflow to the pipeline, which we trained and assess as before.
At this point, we have a decent first candidate a for a model based on the validation set. What do we need to do to finalize this model?
We’ll want to refit the workflow on the training + validation data, then assess its performance on the test set.
Then, we’ll refit to training + validation + test and prepare the model for deployment with vetiver
.
This pipeline produces a final workflow that we then turn into a vetiver_model
for the purpose of using the model in a production setting.
vetiver
provides a standardized way for bundling workflows with the information needed to version, store, and deploy them.
This branch is stable in the sense we could run the code from this branch to produce a model object that is ready to work in production.
We’ll talk about pinning a vetiver
model to a model board in just a little bit, just bear with me.
I’m slightly regretting the order in which I set this up but we must press onward.
How would we then train and evaluate a different model?
targets::tar_glimpse()
15:00
I added a workflow using boosted trees with lightgbm
and found it produced better results across the board than glmnet
.
Notice that I have this pipeline configured to use manual model selection; to update the final model you simply select your tuned model of choice to best_model, which is then refit and finalized.
If we navigate to the main
branch on the Github repository, we can see the following:
. .
Notice how it’s been kind of a pain to keep track of our model metrics? We have to checkout the branch, run the pipeline, and then read in the valid_metrics.csv file.
We can make our lives easier for viewing things like this using GitHub actions, which will automatically run based on a push or pull request.
# name: updating the README
#
# on:
# workflow_dispatch:
# push:
# branches: [ "main", "dev"]
#
# jobs:
# build:
# runs-on: ubuntu-latest
# permissions:
# contents: write
#
# strategy:
# matrix:
# r-version: ['4.4.1']
#
# steps:
# - name: Checkout repository
# uses: actions/checkout@v4
#
# - name: Set up Quarto
# uses: quarto-dev/quarto-actions/setup@v2
#
# - name: Set up R ${{ matrix.r-version }}
# uses: r-lib/actions/setup-r@v2
# with:
# r-version: ${{ matrix.r-version }}
# use-public-rspm: true
#
# - name: Install additional Linux dependencies
# if: runner.os == 'Linux'
# run: |
# sudo apt-get update -y
# sudo apt-get install -y libgit2-dev libglpk40
#
# - name: Setup renv and install packages
# uses: r-lib/actions/setup-renv@v2
# with:
# cache-version: 1
# env:
# RENV_CONFIG_REPOS_OVERRIDE: https://packagemanager.rstudio.com/all/latest
# GITHUB_PAT: ${{ secrets.GH_PAT}}
#
# - name: Render README
# shell: bash
# run: |
# git config --global user.name ${{ github.actor }}
# quarto render README.qmd
# git commit README.md -m 'Re-build README.qmd' || echo "No changes to commit"
# git push origin || echo "No changes to commit"
#
Currently, this just renders the README, which I have set to view the valid_metrics.csv and test_metrics.csv that are in the branch.
If we wanted to, for instance, see every committed version of valid_metrics.csv, we just have to configure it in the README.