Or, How I Learned to Stop Worrying and Love the Git
The aim of this workshop is to gain familiarity and experience with tools that will enable collaborative open-source data science development.
Learning a little bit about version control package/environment management, and pipelines will go a long way towards solving the headaches of modern data science projects.
This workshop leans heavily on R/RStudio and the Posit ecosystem, but the principles we will cover apply equally to Python and R.
Our goal is to create consistent, repeatable patterns for data science project development, iteration, and delivery. To this end, we are going to cover three main topics:
I am far from the first person to cover these topics. I highly recommend bookmarking each of the following resources, as we will be covering pieces of these throughout the workshop:
Cookiecutter Data Science - a flexible, standardized project structure for organizing data science repositories
You are familiar with R.
You have RStudio installed.
You have a Github Account
You have installed Quarto.
A crash course in version control and Git and its application to data science projects.
Project and environment dependencies with renv
Building pipelines with targets
CAVEATS
There is no simple, you-won’t-believe-how-easy-it-is, experts-hate-him! trick that will make us experts in Git, CI/CD, DevOps, pipelines, etc.
As with most things in life, the only way to get better is through practice and repeated trial and error.
But, if maximum likelihood and gradient boosting have taught us anything about learning:
I have a ridiculous, ever-evolving personal project: predicting boardgames that I might want to add to my boardgame collection.
Do these projects represent the height of data science maturity and sophistication?
Do these projects represent the height of data science maturity and sophistication?
Not at all. But it will help illustrate the typical setting of data science projects.
Suppose I wanted you to take a look at something in my work and and see if you could write a more efficient function.
Or maybe you could train a better model.
Or maybe you could create a better report.
Or maybe you could do the whole thing differently and save me a ton of time and/or make better predictions.
How do I share my code with you, so that you can run my code, make changes, and let me know what you’ve changed?
How can a group of people work on the same project without getting in each other’s way?
How can we run experiments and test out changes without breaking the current project?
How do we ensure that we are running the same code and avoid conflicts from packages being out of date?
Can we predict which board games Phil wants to buy?
Find out next week on Dragon Ball Z