Let’s go back to my ‘template’ for organizing an R repo.
├── _targets <- stores the metadata and objects of your pipeline
├── renv <- information relating to your R packages and dependencies
├── data <- data sources used as an input into the pipeline
├── src <- functions used in project/targets pipeline
| ├── data <- functions relating to loading and cleaning data
| ├── models <- functions involved with training models
| ├── reports <- functions used in generating tables and visualizations for reports
├── _targets.R <- script that runs the targets pipeline
├── renv.lock <- lockfile detailing project requirements and dependencies
Now that we’ve covered Git, GitHub, and renv, we can start talking about the third pillar here, which is the targets package.
The Problem
A predictive modeling workflow typically consists of a number of interconnected steps.
We typically build these pieces incrementally, starting from loading the data, preparing it, then ultimately training and assessing models.
The end result can look nice and tidy, and maybe you get really clever and assemble a series of scripts or notebooks that detail the steps in your project.
The Problem
Your project might end up looking something like this:
01-load.R
02-tidy.R
03-model.R
04-evaluate.R
05-deploy.R
06-report.R
And you might have some sort of meta script that runs them all.
And this is working fine… until you discover an issue with a function in 02-tidy.R, or want to make a change to how you’re evaluating the model in 03-evalaute.R.
The Problem
How do you insert a change into this process? Like, if you make a change to a function, how do you know what needs to be re run?
How many times do you just end up rerunning everything to be safe?
The Problem
This pattern of developing, changing, re-running can consume a lot of time, especially with time-consuming tasks like training models.
This is the basic motivation for the targets package:
The Problem
It might not be too bad when you’re actively working on a project, but suppose you’re coming back to something after a few months away.
Or suppose you look at someone else’s repo for the first time, and you have to try to figure out how to put the pieces together to produce their result.
We’d like an easier way to keep track of dependencies so that we are only re-running things when necessary, as well as provide others with a clear path to reproduce our work.
targets
what is targets
Data analysis can be slow. A round of scientific computation can take several minutes, hours, or even days to complete. After it finishes, if you update your code or data, your hard-earned results may no longer be valid. Unchecked, this invalidation creates a chronic Sisyphean loop
Launch the code.
Wait while it runs
Discover an issue.
Restart from scratch.
https://books.ropensci.org/targets/
The solution to this problem is to develop pipelines, which track dependencies between steps, or “targets”, of a workflow.
When running the pipeline, it first checks to see if the upstream targets have changed since the previous run.
If the upstream targets are up to date, the pipeline will skip them and proceed to running the next step.
If everything is up to date, the pipeline will skip everything and inform you that nothing changed.
Most pipeline tools, such as Make, are either language agnostic or depend on using Python.
targets lets you build Make-style pipelines using R
(a) api requests
(b) training models
Figure 1: pipeline examples
├── _targets <- stores the metadata and objects of your pipeline
├── renv <- information relating to your R packages and dependencies
├── data <- data sources used as an input into the pipeline
├── src <- functions used in project/targets pipeline
| ├── data <- functions relating to loading and cleaning data
├── _targets.R <- script that runs the targets pipeline
├── renv.lock <- lockfile detailing project requirements and dependencies
targets adds two main pieces to a project:
targets.R is the script that will implement our pipeline. This is what we will build and develop.
_targets is a folder containing metadata for the steps defined in targets.R, as well as cached objects from the latest run of the pipeline
Note: by default, _targets objects are stored locally. But you can configure targets to store objects in a cloud bucket (GCP/AWS)
When you use targets locally, it will store objects from the latest run of the pipeline. If you use a cloud bucket for storage, you can enable versioning so that all runs are stored.
What is targets.R?
Running targets::use_targets() will create a template for the targets.R script, which all follow a similar structure.
# Created by use_targets().# Follow the comments below to fill in this target script.# Then follow the manual to check and run the pipeline:# https://books.ropensci.org/targets/walkthrough.html#inspect-the-pipeline# Load packages required to define the pipeline:library(targets)# library(tarchetypes) # Load other packages as needed.# Set target options:tar_option_set(packages =c("tibble") # Packages that your targets need for their tasks.# format = "qs", # Optionally set the default storage format. qs is fast.)# Run the R scripts in the R/ folder with your custom functions:tar_source()# tar_source("other_functions.R") # Source other scripts as needed.# Replace the target list below with your own:list(tar_target(name = data,command =tibble(x =rnorm(100), y =rnorm(100))# format = "qs" # Efficient storage for general data objects. ),tar_target(name = model,command =coefficients(lm(y ~ x, data = data)) ))
An Example - Star Wars
Going back to our Star Wars sentiment analysis, we can build a simple targets pipeline to recreate what we did earlier. The basic steps of our pipeline will look something like this:
Load Star Wars text data
Clean and prepare dialogue
Get sentences from dialogue
Calculate sentiment
This is what the resulting pipeline will look like:
Or, we can visualize the pipeline using tar_glimpse().
tar_visnetwork() provides a more detailed breakdown of the pipeline, including the status of individual targets, as well as the functions and where they are used.
We then run the pipeline using tar_make(), which will detail the steps that are being carried out and whether they were re-run or skipped.
targets::tar_make()
#> v skipped target starwars
#> v skipped target sentences
#> v skipped target sentiment
#> v skipped pipeline [0.044 seconds]
We can then load the objects using tar_read() or tar_load().
I should have known better than to trust the logic of a half-sized thermocapsulary dehousing assister...
5
1
17
0.06063391
iv
a new hope
6
LUKE
Hurry up!
6
1
2
0.00000000
iv
a new hope
6
LUKE
Come with me!
6
2
3
0.00000000
This might seem like a lot of overhead for little gain; if re-running is relatively painless, then is the it worth the time to set up a pipeline?
I, and the author of the package, will argue that yes, yes it is.
Embracing Functions
targets expects users to adopt a function-oriented style of programming. User-defined R functions are essential to express the complexities of data generation, analysis, and reporting.
https://books.ropensci.org/targets/functions.html
Traditional data analysis projects consist of imperative scripts, often with with numeric prefixes.
01-data.R
02-model.R
03-plot.R
To run the project, the user runs each of the scripts in order.
As we’ve previously discussed, this type of approach inherently creates problems with dependencies and trying to figure out which pieces need to be rerun.
But even more than that, this approach doesn’t do a great job explaining what exactly is happening with a project, and it can be a pain to test.
Every time you look at it, you need to read it carefully and relearn what it does. And test it, you need to copy the entire block into the R console.
For example, rather than write a script that loads, cleans, and outputs the Star Wars data, I simply wrote two functions, which we can easily call and run as needed to get the data.
…instead of invoking a whole block of text, all you need to do is type a small reusable command. The function name speaks for itself, so you can recall what it does without having to mentally process all the details again.
Embracing functions makes it easier for us to track dependencies, explain our work, and build in small pieces that can be tested and put together to complete the larger project.
It also can really help when we have time consuming steps.
targets demo - College Football and Elo Ratings
View repo organization for https:://github.com/ds-workshop/cfb_elo
Examine targets.R
Show _targets metadata
Show _targets objects
Your Turn
Fork and clone https:://github.com/ds-workshop/cfb_elo
Read the README and follow its instructions
Create a new branch
Make a change to the pipeline and run it
Commit and push your changes
15:00
The targets pipeline in cfb_elo looked like this:
We configured the pipeline to make API calls to get the full history of college football games, then ran a time-consuming function to calculate Elo ratings for all teams across all available games.
We were then able to develop a simple model to examine the value of home field advantage and predict the spread of games.
Let’s not get too distracted by this, but check out how the spread predictions from a simple Elo model compare to Vegas for week 1 of the 2024 season.
using targets
There are a couple additional things we need to cover about targets before we move onto building more complex pipelines for the purpose of putting models and projects into production.
Recall that running a targets pipeline creates a _targets folder within our project folder.
_targets/meta/
_targets/objects/
The meta folder contains metadata relating to the objects you created in your pipeline. This is what determines if your pipeline is up to date and tracks its lineage across different runs.
Note:
This meta folder and its contents are committed to GitHub, as it is required to run your pipeline, but also by committing it you are storing the history of how your pipeline has changed during development.
The objects folder contains the actual objects created by runs of your pipeline. These are the most up to date versions of the objects in your pipeline from your most recent run with tar_make() and can be loaded using tar_read() and tar_load().
Importantly, objectsare not committed to GitHub. These objects are the various artifacts of your pipeline runs (data, models, etc) and can be quite large. Git is not intended to handle diffs for objects of this nature, so committing these would be a bad idea.
You’ll notice that, by default, you can’t even commit _targets/objects to your repository; this is because by default these are ignored with a special .gitignore.
By default, the objects you create in your pipeline will be stored locally - that is, on the machine running the pipeline. This means, by default, that pipeline runs are isolated from each other. 1
If you want to fiddle with adding a new step to my pipeline, you will have to re-run the entire pipeline; the objects stored on my machine will not be available to you.
This also means that, locally, the stored objects are always from the last time a target was run via tar_make()
This means that targets does not, by default, have data version control; you are not storing multiple versions of your objects as your pipeline changes. You are always overwriting old output with new output.
However, we can configure targets to export the objects to a shared cloud location so that:
objects are no longer isolated to the machine of the run
multiple versions of objects are stored using cloud versioning to preserve the lineage of our pipeline
At the top of our _targets.R script, we have options we use to define the pipeline.
This includes setting the packages that should be loaded during the entire pipeline, the format of the targets to be saved, and the location, or repository to store the objects.
By default, this is set to “local”.
# Set target options:tar_option_set(packages =c("tidyverse"),format ="qs",repository ="local")
But we can set the repository to a cloud storage location (AWS, GCP), which will then export our objects and their metadata to a cloud bucket.
# Set target options:tar_option_set(packages =c("tidyverse", "cfbfastR"),format ="qs",# for cloud storageresources =tar_resources(gcp =tar_resources_gcp(bucket ="cfb_models",prefix ='data' ) ),repository ="gcp")
This is what I tend to do for my own projects, as it shifts all of my storage to the cloud and I can pick up and work on pipelines between different workstations without needing to re-run the pipeline everytime.
It also stores the lineage of my work historically, so that I can easily revert to past versions if needed.
However, using the cloud introduces an added wrinkle of requiring authentication around our pipeline, which we will cover later.
Sadly, at the minute, targets is only set up to use AWS and GCP out of the box; Azure is in development but would currently require some custom configuration.
putting it all together
git + targets + renv for predictive modeling projects
Let’s revisit some of the original motivations for this workshop.
How do I share my code with you, so that you can run my code, make changes, and let me know what you’ve changed?
How can a group of people work on the same project without getting in each other’s way?
How do we ensure that we are running the same code and avoid conflicts from packages being out of date?
How can we run experiments and test out changes without breaking the current project?
How do we take a project into production?
targets and predictive models
Let’s talk about building predictive modeling pipelines in targets, the thing most of us are ultimately employed to do.
As with any other type of project, we want to write code that is transparent, reproducible, and allows for collaborative development and testing.
In principal, Git/GitHub and renv are the most important pieces for allowing us to do this; we are not required to use targets for training/deploying models. I would
But I have found its functional, Make-style approach to be well suited for managing the predictive modeling life cycle.
In the sections to come, we will be splitting/training/finalizing/deploying predictive models in a targets pipeline.
Most of the examples we’re going to work on will assume some level of familiarity with tidymodels. What is everyone’s famililarity with tidymodels?
Again, in principal, you do not have to use tidymodels in pipelines, but they provide a standardized way to train models that naturally works well with functional programming.
Therefore:
a crash course in tidymodels
tidymodels refers to a suite of R packages that bring the design philosophy and grammar of the tidyverse to training models
if you’re like me and and originally cut your teeth with the caret package, tidymodels is its successor from the same person (Max Kuhn, praise his name)
fun fact: tidymodels is basically just a GitHub organization:
model specification (glmnet, random forest) -> models from parsnip
tuning over parameters (mtry, penalty) -> tune and dials
model assessment (rmse, log loss) -> yardstick
key tidymodels concepts
recipes
models from parsnip
workflows
splits/resamples from rsample
metrics from yardstick and tune
models
a model is a specification (from parsnip) that defines the type of model to be trained (linear model, random forest), its mode (classification, regression), and its underlying engine (lm, stan_lm, ranger, xgboost, lightgbm)
parsnip provides a standardized interface for specifiying models, which allows us to easily run different types of models without having to rewrite our code to accommodate differences
if you’ve ever been annoyed with having to create y and x matrices for glmnet or ranger, parsnip is something of a lifesaver
a linear model with lm
linear_reg() |>set_engine("lm") |>translate()
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
#>
#> Model fit template:
#> stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
recipes capture steps for preprocessing data prior to training a model.
a recipe is a type of preprocessor that can dynamically apply transformations (imputation, normalization, dummies) to the data we are using to model.
we create them with recipe(), typically specifying a formula and a dataset. we then add steps to recipe of the form step_ (step_mutate, step_impute, step_nzv, …)
#> # A tibble: 11 × 4
#> variable type role source
#> <chr> <list> <chr> <chr>
#> 1 cyl <chr [2]> predictor original
#> 2 disp <chr [2]> predictor original
#> 3 hp <chr [2]> predictor original
#> 4 drat <chr [2]> predictor original
#> 5 wt <chr [2]> predictor original
#> 6 qsec <chr [2]> predictor original
#> 7 vs <chr [2]> predictor original
#> 8 am <chr [2]> predictor original
#> 9 gear <chr [2]> predictor original
#> 10 carb <chr [2]> predictor original
#> 11 mpg <chr [2]> outcome original
rec$steps
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
recipes
using recipes involves two main steps:
preparing recipes on a dataset with prep()
applying recipes to a dataset with bake()
preparing recipes on a dataset with prep()
preparing a recipe is kind of like training a model; it captures/estimates information on one dataset and will apply those same transformations to a new dataset
This is really important for things like normalization/imputation, as we want to apply the same transformations to unseen data that were used on the training set
prepped = rec |>prep()prepped$steps
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
prepped$term_info
#> # A tibble: 14 × 4
#> variable type role source
#> <chr> <list> <chr> <chr>
#> 1 cyl <chr [2]> predictor original
#> 2 disp <chr [2]> predictor original
#> 3 drat <chr [2]> predictor original
#> 4 wt <chr [2]> predictor original
#> 5 qsec <chr [2]> predictor original
#> 6 vs <chr [2]> predictor original
#> 7 am <chr [2]> predictor original
#> 8 gear <chr [2]> predictor original
#> 9 carb <chr [2]> predictor original
#> 10 mpg <chr [2]> outcome original
#> 11 hp_1 <chr [2]> predictor derived
#> 12 hp_2 <chr [2]> predictor derived
#> 13 hp_3 <chr [2]> predictor derived
#> 14 gear_x_wt <chr [2]> predictor derived
applying recipes to a dataset with bake()
baking a recipe produces the dataframe/matrix that will be used in modeling
recipes are especially helpful for handling categorical features, as we can create easily steps for handling novel levels or pooling infrequent levels.
when we are using targets to train predictive models, a workflow will typically be the final step; it is the object we are trying to produce that can be used to predict new data
workflows make model deployment relatively straightforward, as we just need to export/share/deploy our finalized workflow
we’ll go over this in a bit with the vetiver package
key tidymodels concepts
recipes ✓
models from parsnip ✓
workflows ✓
splits/resamples from rsample
metrics from yardstick and tune
rsample
splitting our data (train/valid, bootstraps, cross validation) is a standard part of training/assessing predictive models
the rsample package provides a standardized way to do this that works directly with workflows
nested rsplit objects make it easy to do tidy evaluation for models across resamples, such as estimate models/parameters
Code
fit_model =function(split) {linear_reg(mode ="regression") |>fit(mpg ~ wt + hp + disp, data =training(split)) |> broom::tidy()}# fit model to 500 bootstraps and plot distribution of coefficientsestimates = mtcars |> rsample::bootstraps(times =500) |>mutate(results =map(splits, fit_model)) |>select(id, results)estimates
for predictive modeling workflows, rsample is typically used in conjunction with yardstick and tune to estimate model performance for a model or tune a model across parameters
we specify the type of metrics we want to use in a metric_set()
#> # A tibble: 3 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 ccc standard 0.535 10 0.106 Preprocessor1_Model1
#> 2 rmse standard 6.30 10 0.816 Preprocessor1_Model1
#> 3 rsq standard 0.489 10 0.081 Preprocessor1_Model1
key tidymodels concepts
recipes ✓
models from parsnip ✓
workflows ✓
splits/resamples from rsample ✓
metrics from yardstick and tune ✓
I realize this is a lot to take in, but once we are familiar with these concepts it becomes much, much easier to standardize our predictive modeling so that we can easily train/test/deploy different kinds of models within our pipeline
The model/baseline branch added new steps to the pipeline; we added a workflow that we fit to the training set and assessed on the validation set.
Notice that we directly wrote these metrics to a csv in our project (targets-runs/valid_metrics.csv), which we will then commit to our repository.
This will allow us to track model performance on our validation set using Git as we add new models/tinker with the original model.
Your Turn
Add dep_time as a feature to the baseline model
Run the pipeline; how did the results change?
Commit your results to your branch
Create a pull request for your-branch/split into the upstream split
Then, checkout the model/glmnet branch
Create a new branch your-name/glmnet
Run the pipeline targets::tar_make()
What model was used?
How did the model perform?
10:00
The model/glmnet branch added a more robust recipe to make use of more features, particularly categorical features. We then added a new workflow to the pipeline, which we trained and assess as before.
At this point, we have a decent first candidate a for a model based on the validation set. What do we need to do to finalize this model?
We’ll want to refit the workflow on the training + validation data, then assess its performance on the test set.
Then, we’ll refit to training + validation + test and prepare the model for deployment with vetiver.
This pipeline produces a final workflow that we then turn into a vetiver_model for the purpose of using the model in a production setting.
vetiver provides a standardized way for bundling workflows with the information needed to version, store, and deploy them.
This branch is stable in the sense we could run the code from this branch to produce a model object that is ready to work in production.
We’ll talk about pinning a vetiver model to a model board in just a little bit, just bear with me.
I’m slightly regretting the order in which I set this up but we must press onward.
How would we then train and evaluate a different model?
Your Turn
Checkout the stable/model branch
Create a new branch your-name/challenger
Examine the pipeline with targets::tar_glimpse()
Run the pipeline
Add a competing workflow to the pipeline: use the existing recipe with a new model specification, or create a new recipe with the existing model specification. The world is your oyster.
Train and evaluate your new workflow on the validation set. Does it outperform the existing model?
Commit your results
Create a pull request for your-name/challenger into the upstream dev branch
15:00
I added a workflow using boosted trees with lightgbm and found it produced better results across the board than glmnet.
Notice that I have this pipeline configured to use manual model selection; to update the final model you simply select your tuned model of choice to best_model, which is then refit and finalized.
If we navigate to the main branch on the Github repository, we can see the following:
. .
Notice how it’s been kind of a pain to keep track of our model metrics? We have to checkout the branch, run the pipeline, and then read in the valid_metrics.csv file.
We can make our lives easier for viewing things like this using GitHub actions, which will automatically run based on a push or pull request.