What They Didn’t Teach You
About Data Science

Or, How I Learned to Stop Worrying and Love the Git

Background

Goals

The aim of this workshop is to gain familiarity and experience with tools that will enable collaborative open-source data science development.

Learning a little bit about version control package/environment management, and pipelines will go a long way towards solving the headaches of modern data science projects.

This workshop leans heavily on R/RStudio and the Posit ecosystem, but the principles we will cover apply equally to Python and R.

Topics

Our goal is to create consistent, repeatable patterns for data science project development, iteration, and delivery. To this end, we are going to cover three main topics:

References

I am far from the first person to cover these topics. I highly recommend bookmarking each of the following resources, as we will be covering pieces of these throughout the workshop:

I’m assuming

You are familiar with R.

You have RStudio installed.

You have a Github Account

You have installed Quarto.

We’ll cover

A crash course in version control and Git and its application to data science projects.

Project and environment dependencies with renv

Building pipelines with targets

CAVEATS

  • I am NOT a software engineer.
  • Like, at all. Everything we will cover in these workshops comes from my experiences, both good and bad, as a consultant in data science. I have seen some shit.
  • We will only be covering a small portion of these topics - our goal is not to become experts in Git, but to learn the little bit of Git that will help us be better data scientists.

  • The DevOps folks will surely judge us all for not being experts in their craft. This is fine. We will accept this and move on. It’s fine.
  • It’s fine.
  • It’s fine.

A reminder

There is no simple, you-won’t-believe-how-easy-it-is, experts-hate-him! trick that will make us experts in Git, CI/CD, DevOps, pipelines, etc.

As with most things in life, the only way to get better is through practice and repeated trial and error.

But, if maximum likelihood and gradient boosting have taught us anything about learning:

  • start somewhere
  • make mistakes
  • learn from those mistakes
  • do better the next time

Let’s start at the end: a (ridiculous) data science project

I have a ridiculous, ever-evolving personal project: predicting boardgames that I might want to add to my boardgame collection.

This is the pipeline that scrapes data from BoardGameGeek and populates a cloud data warehouse (GCP/BigQuery)

This is the pipeline that trains models to predict how the BoardGameGeek community is going to rate games

This is the repo for training a user specific model and creating a user report to predict games in their collection

Motivation

Do these projects represent the height of data science maturity and sophistication?

Motivation

Do these projects represent the height of data science maturity and sophistication?

Not at all. But it will help illustrate the typical setting of data science projects.

Suppose I wanted you to take a look at something in my work and and see if you could write a more efficient function.

Or maybe you could train a better model.

Or maybe you could create a better report.

Or maybe you could do the whole thing differently and save me a ton of time and/or make better predictions.

Challenge

How do I share my code with you, so that you can run my code, make changes, and let me know what you’ve changed?

How can a group of people work on the same project without getting in each other’s way?

How can we run experiments and test out changes without breaking the current project?

How do we ensure that we are running the same code and avoid conflicts from packages being out of date?

Challenge

Can we predict which board games Phil wants to buy?

Find out next week on Dragon Ball Z

To get to this point, we’ll need to cover:

  1. Git (and Github)
  2. renv
  3. targets