We’ve been working on a project for ages. We have a core, legacy script that has gone through about 30 different iterations with edits from dozens of people as the project has changed over the years. We need to audit how that script has changed over time.
We made a change to one of the core functions of our project and everything seemed to be great. But two weeks later we discovered a change to a helper function broke something in our monthly report. We’re now trying to figure out what we changed and how to patch it.
Someone recently mentioned that lightgbm + linear trees offers a really nice improvement in both training time and performance over XGBoost. They want to test it in our project, and then evaluate whether this should become the new model.
Version Control
For each one of these scenarios, we intuitively want something resembling version control. We want to tinker with making a change, but we don’t want that change to overwrite or break our existing code.
People implement their own approaches to version control all the time.
We make a copy of the original, then create a new copy that we begin to edit and work on without breaking the original source.
We’ve all probably come up with some version of crazy, half-baked version syntax to control the various versions of projects/files.
So why Git?
If you’re like me, at some point you thought to yourself, maybe I could be that guy who uses Git and talks about commits and pull requests and really knows what he’s doing with dev vs prod environments rather than just putting _dev and _test at the end of important files.
I’m going to improve.
I’m going to be more than I’ve ever been.
I’m going to use Git.
The Reality
And about thirty minutes later it was so unclear how any of this would help you that you just punted and decided to keep working as you always have, warts and all, because Git clearly comes from The Bad Place.
But!
The Reality
So many of the frustrations with writing code, making changes, and storing the history of your work go away once we implement version control.
There’s a reason Git is used everywhere. It is the life jacket in a sea of stashing files everywhere with poor naming conventions and no lineage or history.
Why would a statistician use a version control system, such as Git? And what is the point of hosting your work online, e.g., on GitHub? Could the gains possibly justify the inevitable pain? I say yes, with the zeal of the converted.
Doing your work becomes tightly integrated with organizing, recording, and disseminating it. It’s not a separate, burdensome task you are tempted to neglect.
Collaboration is much more structured, with powerful tools for asynchronous work and managing versions.
By using common mechanics across work modes (research, teaching, analysis), you achieve basic competence quickly and avoid the demoralizing forget-relearn cycle.
Despite this zeal, she does make an important note:
Now the bad news: Git was built neither for the exact usage described here, nor for broad usability. You will undoubtedly notice this, so it’s best to know in advance.
Git was not designed for data science projects. This will at times create slightly wonky implementations and workarounds that feel frustrating.
Things get more complicated once we start trying to version control data/models, which is a whole topic in and of itself.
With and Without Git: An Example
Was Luke Really That Whiny?
Suppose we have a project we are working on. We are interested in running text/sentiment analysis on the scripts of the original Star Wars trilogy.
We want to know things such as, who are the most positive/negative characters in A New Hope? A lot of people have claimed Luke was really whiny; is that the case?
So, naturally, we go and get the script from A New Hope in text form.
We then calculate sentiment across all characters to get a sense of how negative Luke really was.
sentiment analysis via tokenization and afinn
Apparently, pretty negative. Let’s look at some dialogue.
character
line_number
value
dialogue
a new hope
LUKE
966
-6
You worry about those fighters! I'll worry about the tower!
LUKE
967
-4
Artoo... that, that stabilizer's broken loose again! See if you can't lock it down!
LUKE
117
-3
This R2 unit has a bad motivator. Look!
LUKE
220
-3
Wait, there's something dead ahead on the scanner. It looks like our droid... hit the accelerator.
LUKE
228
-3
Sand People! Or worse! Come on, let's have a look. Come on.
LUKE
238
-3
I think my uncle knew him. He said he was dead.
LUKE
263
-3
How did my father die?
LUKE
274
-3
I can't get involved! I've got work to do! It's not that I like the Empire. I hate it! But there's nothing I can do about it right now. It's such a long way from here.
LUKE
320
-3
Go on, go on. I can't understand how we got by those troopers. I thought we were dead.
LUKE
354
-3
You bet I could. I'm not such a bad pilot myself! We don't have to sit here and listen...
LUKE
496
-3
Then he must have gotten lost, been part of a convoy or something...
LUKE
566
-3
But he didn't know she was here. Look, will you just find a way back into the detention block?
LUKE
570
-3
But they're going to kill her!
LUKE
597
-3
Prisoner transfer from Block one-one-three-eight.
LUKE
729
-3
What good will it do us if he gets himself killed? Come on!
LUKE
886
-3
I'm hit, but not bad.
LUKE
986
-3
I've lost Artoo!
LUKE
38
-2
But there was a lot of firing earlier...
LUKE
80
-2
And I'm stuck here...
LUKE
84
-2
I know, but he's got enough vaporators going to make the place pay off. He needs me for just one more season. I can't leave him now.
This is interesting enough, so we save the script, that produces this analysis, called sentiment.R.
Then we think, maybe we should see what happens if we calculate sentiment in a different way. Were Han and Ben really that negative? Even simple methods of sentiment can vary quite a bit depending on which lexicon you use, so we should try a couple.
But we might want to stick our first approach, so we decide to write a whole new section to our code, or just add a new script entirely, sentiment_bing.R.
sentiment analysis via tokenization and bing
We get a pretty similar result, so we’re feeling okay about ourselves and less okay about Luke.
But then someone says, we shouldn’t rely on such crude methods for calculating sentiment. We should use a more sophisticated method, via the sentimentr package.
So we want to edit our original sentiment.R script and switch over to using this new package. This one forces us to add some new packages, and rewrite some of our visualization scripts to get the same type of visualization, so we create a new script, sentiment_algorithm.R.
We’re also slightly worried that we’ve forgotten what we originally started with, so we’re gonna make a sentiment_original.R file. Just so we have it.
But anyway, we’ll edit our code for the third time and run it again.
sentiment analysis via sentimentr
This gives us very different results, so we really need to dive into the data bit here to figure out what’s going on.
We decide to compare how our original method (left) calculates sentiment for the entire script of A New Hope compared to sentimentr (right). That means we need to go add a visualization to each of our original scripts, so we go edit sentiment_original.R and sentiment_algorithm.R.
These are very different, so now we go down a rabbit hole of digging into what we’re getting out of the sentimentr package. We take a look at Luke’s dialogue line by line.
We realize that we shouldn’t be calculating sentiment at the line-level and then aggregating, because short positive statements potentially end up getting as much weight as longer complaints.
With this method we really should look at the estimated sentiment across a character’s entire dialogue to get a sense of their tone.
So we implement a change, shifting away from aggregation by summing to using the average sentiment over all lines. We then calculate the estimated sentiment across all characters.
estimated sentiment by character via sentimentr
Is Luke whiny? Well, it depends. This is the type of hard hitting analysis that I deliver for my clients.
What we are left with after a fairly simple analyses is a messy, entangled set of files with absolutely no sense of history or organization.
This is a mess for us to figure out, imagine if someone else is supposed to come along and work with this code. Where do they start?
How would this look if we were using Git?
The end result is a bit cleaner. We basically just have the one script to worry about. If we’re really curious about what we need to do, we check the README (yes, you are expected to read these).
The end result of our work is the current state of the project, which we store in a repository.
If we want to see the work that we did up to this point, all we have to do is look at the history of that script and the various changes we made to in in the form of commits.
Visualized as a timeline from left to right, the history of our work might look something like this. Each commit is a snapshot of our files at a specific point in time.
The differences between these commits allow us to easily view how our script changed as we worked on it.
Because git is tremendously helpful even just for one individual.
It removes the mental baggage of worrying about editing your code and remembering what you did. It allows you to make changes, track the history of your project, and document everything you did along the way.
Where it starts to get even more helpful is for enabling collaboration within a team.
With and Without Git: Predictive Modeling
Git for Data Science
Suppose we were working on a predictive modeling project instead of looking at Star Wars scripts.
Think about the pieces involved in a typical predictive modeling project.
Loading data
Cleaning data
Splitting data
Feature engineering
Model specification
Tuning parameters
Model evaluation
Model selection
Model deployment
We typically don’t build all of this at once and have everything finalized from the get go.
We build incrementally, typically testing and experimenting with different pieces along the way.
We might start out the project by training a simple baseline model.
Then maybe we decide to add in some feature engineering and tune a ridge regression over 25 bootstraps, which requires normalization and imputation.
Then maybe we decide to try out a more flexible model like lightgbm with minimal feature engineering.
And so on, and so on.
Git for Data Science
In each of these cases, we have code that we have executed and results associated with that code.
As before, we could try to store a bunch of scripts and track all of the results in different folders.
Or, we could use Git to track our code and the results of our experiments.
This can start to get complicated.
We’ll start with the basics.
What We Need to Know About Git
So, what the heck is Git?
As with many great things in life, Git began with a bit of creative destruction and fiery controversy.
The Linux kernel is an open source software project of fairly large scope. During the early years of the Linux kernel maintenance (1991–2002), changes to the software were passed around as patches and archived files. In 2002, the Linux kernel project began using a proprietary [Distributed Version Control System] called BitKeeper.
In 2005, the relationship between the community that developed the Linux kernel and the commercial company that developed BitKeeper broke down, and the tool’s free-of-charge status was revoked. This prompted the Linux development community (and in particular Linus Torvalds, the creator of Linux) to develop their own tool based on some of the lessons they learned while using BitKeeper.
Git was originally developed for the purpose of helping developers work in parallel on their software projects.
Git manages and tracks a set of files - referred to as a repository - in a highly structured way.
Git
Though originally intended for software development, Git is now used by data scientists in a variety of different ways to track the odds and ends that go into data science projects.
GitHub
GitHub is a hosting service that stores your Git-based projects in a remote location.
Storing your code on GitHub allows you to share/sync your work with others (as well as have a safe back up for when you inevitably mess up your local repository).
We’ll focus on GitHub (because it’s what I use, but there are other options out there as well).
GitHub has additional features for managing projects and automating aspects of a project, we’ll touch on that later.
Git Basics
Git Basics
Repository: a directory in which file history is preserved
Git Basics
Repository: a directory in which file history is preserved
Clone: downloading an existing repository
Git Basics
Repository: a directory in which file history is preserved
Clone: downloading an existing repository
Local: “on your personal machine”
Git Basics
Repository: a directory in which file history is preserved
Clone: downloading an existing repository
Local: “on your personal machine”
Remote: “on the official server”
Git Basics
Repository: a directory in which file history is preserved
Clone: downloading an existing repository
Local: “on your personal machine”
Remote: “on the official server”
Branch: A version of the directory
Git Basics
Repository: a directory in which file history is preserved
Clone: downloading an existing repository
Local: “on your personal machine”
Remote: “on the official server”
Branch: A version of the directory
Commit: A change made to a version of the directory
Git Basics
Repository: a directory in which file history is preserved
Clone: downloading an existing repository
Local: “on your personal machine”
Remote: “on the official server”
Branch: A version of the directory
Commit: A change made to a version of the directory
Push: Uploads your work to the ‘official’ remote server
Git Basics
Repository: a directory in which file history is preserved
Clone: downloading an existing repository
Local: “on your personal machine”
Remote: “on the official server”
Branch: A version of the directory
Commit: A change made to a version of the directory
Push: Uploads your work to the ‘official’ remote server
Fetch/Pull: Checks for available updates on a remote
Git Basics
Repository: a directory in which file history is preserved
Clone: downloading an existing repository
Local: “on your personal machine”
Remote: “on the official server”
Branch: A version of the directory
Commit: A change made to a version of the directory
Push: Uploads your work to the ‘official’ remote server
Fetch/Pull: Checks for available updates on a remote
Switch/Checkout: Switches your local copy to a version of the directory
Git Basics
Repository: a directory in which file history is preserved
Clone: downloading an existing repository
Local: “on your personal machine”
Remote: “on the official server”
Branch: A version of the directory
Commit: A change made to a version of the directory
Push: Uploads your work to the ‘official’ remote server
Fetch/Pull: Checks for available updates on a remote
Switch/Checkout: Switches your local copy to a version of the directory
Pull Request: A request to merge one branch into another
Oh My Git
Git Basics - Demo
Repo organization for https:://github.com/ds-workshop/starwars
Viewing history
Making a change (Let’s examine Vader insteaed of Luke)
Creating a branch
Adding a change
Pushing the change
Viewing the change
Creating a New Repo
Creating a New Repo
Git is highly structured way of managing a set of files, called a repository.
We’ll often work by cloning an already established repository in order introduce changes, meaning that we are inheriting a set of files. . . .
But it’s really worth knowing how to create a repository from scratch.
Creating a New Repo
We want to create a new project called git-started.
We want to create a new GitHub repository.
We want to create a new RStudio project.
We want to connect RStudio to GitHub so our project is connected with our repository.
We can do this in a couple of different ways, starting in either GitHub or RStudio.
GitHub -> RStudio
Go to GitHub.com and sign in with your account
Click on Repositories.
Click on New repository.
GitHub -> RStudio
Go to GitHub.com and sign in with your account
Click on Repositories.
Click on New repository.
Name the repository
Initialize the repository with a README
GitHub -> RStudio
Go to GitHub.com and sign in with your account
Click on Repositories.
Click on New repository.
Name the repository
Initialize the repository with a README
Click Create repository.
GitHub -> RStudio
Go to GitHub.com and sign in with your account
Click on Repositories.
Click on New repository.
Name the repository
Initialize the repository with a README
Click Create repository.
This creates a new repository on GitHub, but we still need to connect it to RStudio. To do this, we need to clone this repo from RStudio.
GitHub -> RStudio
Go to GitHub.com and sign in with your account
Click on Repositories.
Click on New repository.
Name the repository
Initialize the repository with a README
Click Create repository.
Open a New Project in RStudio
Create from Version Control
GitHub -> RStudio
Go to GitHub.com and sign in with your account
Click on Repositories.
Click on New repository.
Name the repository
Initialize the repository with a README
Click Create repository.
Open a New Project on RStudio
Create from Version Control
Paste in your Repository URL from GitHub
GitHub -> RStudio
Go to GitHub.com and sign in with your account
Click on Repositories.
Click on New repository.
Name the repository
Initialize the repository with a README
Click Create repository.
Open a New Project on RStudio
Create from Version Control
Set Repository URL to link of GitHub repo
Set name and location of project
Create project
GitHub -> RStudio
These steps inside of RStudio can also be taken care of by using the usethis package, but I tend to just go through the process each time.
These steps inside of RStudio can also be taken care of by using the usethis package, but I tend to just go through the process each time.
Under the hood, this last bit is essentially just:
git clone “https://github.com/YOU/YOUR_REPO.git”
Your Turn
Create a new Git repository on GitHub, git-started
Add a description to the project (whatever you would like)
Initialize this repository with a README
Create a new RStudio project
Connect this RStudio project to your GitHub repo
Add a new script, or make a change to the README; what happens?
10:00
RStudio -> Github
Creating a new project with GitHub first then cloning with RStudio is what I would tend to recommend.
It’s the same process as cloning repositories from other people, plus it takes care of some pieces behind the scenes.
You can however also start by first creating an RStudio project, then initialize a GitHub repository second. This process can be useful to know if you want to set up a GitHub repo for an existing project.
RStudio -> Github
Create a new RStudio project.
Check ‘Create a git repository’
RStudio -> Github
Create a new RStudio project
Check ‘Create a git repository’
usethisthis::use_git() to initialize a local repository, add and commit your initial files
usethis::use_github() to create a repository on GitHub and connect your R project
RStudio -> Github
Commits
We now have a repository, but we want Git to track our files.
I created my git-started repo from RStudio -> GitHub, which means I didn’t initialize a README. I want to add a README to the repo.
I create the file in my working directory, which causes it to appear in my Git tab in RStudio.
Checking the file adds it to staging.
We can then commit the file.
Three States of (Tracked) Files
Files in your repository can generally take on one of three states: . . .
Modified means that you made changes to a file but have not committed those changes to your repository yet.
Staged means that you have marked a modified file in its current version to go into your next commit.
Committed means that the snapshot is safely stored in your local repository
If you haven’t yet added a file but it’s in your working directory, it will appear as “untracked”. It will only be added to your repository if you add it, by first staging it and then committing it.
Let’s go back to the first image. I don’t know how important it is to dwell on this, but if we want to get an understanding of what’s happening under the hood with Git, we can get a sense of how Git works.
Three Sections of Git
This last image highlights the three main sections of a Git project - what you’re going through painstaking effort to set up.
The Working Directory (Tree) is a single checkout of one version of a project
The Staging Area refers to a file (index) that lives in your .git directory that tracks information about what will go into your next commit
The Git Directory is where Git stores the metadata and object database for your project
If you open up a the folder where you created git-started, you’ll notice that there’s a hidden .git folder.
If you open up a the folder where you created git-started, you’ll notice that there’s a hidden .git folder.
Yeah, don’t mess with that; that’s basically what you’re configuring when you initialize a repo, add/commit files, and sync with a remote.
Cloning a Repo
Cloning someone else’s repo from GitHub operates in much the same way as before.
Copy the GitHub Repository URL
Open a New Project on RStudio
Create from Version Control
Paste Repository URL from GitHub repo
Set name and location of project
Create project
Your Turn
Create a new R project
Clone the repo at https:://github.com/ds-workshop/starwars
Run README.qmd` (notice: what packages do you need to install?)
10:00
Branches and Pull Requests
Branches and Pull Requests
A typical commit history for one branch might look something like this. We have a series of commits that tracks the history of the project from left to right.
We could just keep working out of one branch, tracking the history of the project via our commits.
This has some appeal because of its simplicity; if we ever want to see our previous work, we just flip back through our history of commits.
How often should we commit our code?
Using a Git commit is like using anchors and other protection when climbing. If you’re crossing a dangerous rock face you want to make sure you’ve used protection to catch you if you fall. Commits play a similar role: if you make a mistake, you can’t fall past the previous commit.
Coding without commits is like free-climbing: you can travel much faster in the short-term, but in the long-term the chances of catastrophic failure are high! Like rock climbing protection, you want to be judicious in your use of commits.
Committing too frequently will slow your progress; use more commits when you’re in uncertain or dangerous territory. Commits are also helpful to others, because they show your journey, not just the destination.
Suppose you’re at a point in your project where you’re not certain about the next direction you want to take. You could keep making a series of small commits and then revert all the way back to where you were originally.
But a better approach is to use branches. Branching amounts to creating a detour from the main stream of commits; it allows you to work without any fear of disrupting the work in main.
Once you’ve completed your work, you can then choose to merge it back into main; this can be done via a pull request on GitHub.
Or, you can just stop working on the branch and go back to working on main, letting the branch become stale.
Branching allows teams of developers to do their work on separate branches without overwriting or getting in the way of each other’s work.
Typically, it is common to block direct commits to main and only allow new commits to main via pull requests.
This pattern requires creating new branches, such as dev, where developers commit their work. dev is then merged back into main pending an approval process.
With larger development teams, it’s more common to see branching strategies involving a number of common branchdes main, dev, feature, hotfix.
main: releases, most controlled branch
dev: where completed work is staged for release
feature: in-progress work; mostly a sandbox for individual developers
Guidelines
Delete branches when merging
Disallow direct commits to main
Minimize direct commits to develop
Require reviewer approval on pull requests into main
Limit PR approval to project leads
Use tags to mark milestones/releases
How does data science differ from traditional software development?
Your Turn
Open the starwars project you have previously cloned
Create a new branch called feature or dev
Make a change to README.qmd, or add a new script
Commit those changes to your branch
Create a pull request for your branch into main
10:00
Cloning vs Forking
Okay so we can see what you all did in your own separate repos, but there’s a problem. Let’s say I wanted to merge a change that you made into the original Star Wars repo - right now, there’s no way to do that.
When you create a Git repo locally, you eventually need to connect that repo to a remote location.
You typically create a copy of that repo at your remote location (GitHub), which is owned by you and you have full access to push/pull/merge to your heart’s content.
The remote is typically known as origin.
Cloning someone else’s repo creates a local copy of their repo, where the origin is owned by someone else.
In this case, you can pull and execute code, but you have no way of pushing changes to it; the owner of that repo has configured origin to be read-only for others.
So what the heck? If Git and GitHub are supposed to be collaborative, how do we configure things so that we can push and pull and work within the same repository?
One option is to just provide permissions to origin by adding others as collaborators. The source repo is owned by someone else, or an organization, but you have permissions to make changes.
The other option is to use forking, where you fork the original repo to create a copy for yourself that becomes your remote origin, where you have full permissions to read and write.
The original source repo is typically referred to as upstream - you can pull changes from it but you can’t push directly to it.
If you want your changes to make it upstream, you push your code to your origin, then create a pull request for the repo owner to consider merging in your changes.
Fork-and-Clone - Demo
Fork-and-clone a repo (https://github.com/jennybc/bingo)
Configure the source stream as the upstream remote
Configure local main to track upstream/main
Make a change (add annoying data science questions)
Commit a change (push to origin)
Submit a pull request
Forking and cloning directly from Github entails a couple of additional steps to ensure that you are tracking the original source repo.
The first step is to add an additional remote that is pointed to the source repo. The original (source) repo in this case is typically referred to as upstream. This can be done by running:
After adding the upstream remote, you will then want to look for any changes that have happened to the source
git fetch upstream
You’ll probably want to set your local main branch to track the upstream/main branch, so you can easily see if any changes have occurred in the source repo. This is optional, but in a fork-and-clone situation it’s the standard.
git branch –set-upstream-to upstream/main
Alternatively, you can always use the helpful functions from usethis, which will take care of the upstream tracking on one go. You simply need to open a new RStudio session (note: not in a project) and run the following:
How do we resolve this situation? The point of using Git + GitHub is that we’re never really in trouble, we can always just go back to a previous commit.
So how do we do that?
What has been our process for making any sort of change to a repository?
Checkout a branch (a commit)
Create a new branch
Make a change
Commit the change
Push the change
Merge (pull request) the change to the original branch
So, how would we “undo” a commit?
With pull requests, we always have the option to revert directly on Github.
Under the hood, this amounts to:
Checking out the repository at the previous commit
Creating a new branch revert-name-of-last-pr
Merging revert-name-of-last-pr into main
This adds the pull request we made, then reverted, to the history of the project.
We could use rebase to try to “clean up” this history, but in practice I’d prefer to keep the lineage.