Gyges’s BoardGameGeek Collection

This notebook contains a set of analyses for the selected user’s boardgamegeek collection. The bulk of the analysis is focused on building a user-specific predictive model to predict the games that Gyges is likely to own. This enables to ask questions like, based on the games the user currently owns, what games are a good fit for their collection? What upcoming games are they likely to purchase?

This analysis is based on data from BoardGameGeek that was last updated on 2021-12-20.

1 Collection Overview

We can look at a basic description of the number of games that the user owns, has rated, has previously owned, etc.

What years has the user owned/rated games from? While we can’t see when a user added or removed a game from their collection, we can look at their collection by the years in which their games were published.

1.1 What types of games does Gyges own?

We can look at the most frequent types of categories, mechanics, designers, and artists that appear in a user’s collection.

2 Modeling Gyges’s Collection

We’ll examine a predictive model trained on a user’s collection for games published through 2019. How many games has the user owned/rated/played in the training set (games prior to 2019)?

There are two main (binary) outcomes we will be modeling for the user.

The first, own refers to whether the user currently lists a game as owned in their collection. The second, played refers to whether the user currently owns, has rated, or previously owned a game. This means the latter will generally have a larger list of games, but may still be a useful category to examine for people who play lots of games without necessarily owning them.

We will train predictive models to learn the probability that the user will own or play individual games based on their features.

2.1 Coefficients for Gyges

We can examine coefficients from the trained modes, which are penalized logistic regressions fit to our two main outcomes. Positive values indicate that a feature increases a user’s probabilility of owning/rating a game, while negative values indicate a feature decreases the probability.

2.2 Visualizing Predictors for Gyges’s Collection

Why did the model identify these features? We can make density plots of the important features for predicting whether the user owned a game. Blue indicates the density for games owned by the user, while grey indicates the density for games not owned by the user.

Binary predictors can be difficult to see with this visualization, so we can also directly examine the percentage of games in a user’s collection with a predictor vs the percentage of all games with that predictor.

3 Assess Model’s Performance on Training Set

Before predicting games in upcoming years, we can examine how well the model did and what games it liked in the training set. In this case, we used resampling techniques (cross validation) to ensure that the model had not seen a game before making its predictions.

3.1 Separation Plot

An easy way to examine the performance of classification model is to view a separation plot. We plot the predicted probabilities from the model for every game (from resampling) from lowest to highest. We then overlay a blue line for any game that the user does own. A good classifier is one that is able to separate the blue (games owned by the user) from the white (games not owned by the user), with most of the blue occurring at the highest probabilities (right side of the chart).

3.2 Top Games for Gyges from Training Set

We can display this information in table form, displaying the 100 games with the highest probability of ownership, adding a blue line when the user does own the game.

We can also more formally assess how well the model did in resampling by looking at the area under the receiver operating characteristic. A perfect model would receive a score of 1, while a model that cannot predict the outcome will default to a score of 0.5. The extent to which something is a good score depends on the setting, but generally anything in the .8 to .9 range is very good while the .7 to .8 range is perfectly acceptable.

Another way to think about the model performance is to view its lift, or its ability to detect the positive outcomes over that of a null model. High lift indicates the model can much more quickly find all of the positive outcomes (in this case, games owned or played by the user), while a model with no lift is no better than random guessing.

3.3 Most and Least Likely Games

What games does the model think Gyges is most likely to own that are not in their collection?

What games does the model think Gyges is least likely to own that are in their collection?

3.4 Top Games by Year

Top 25 games most likely to be owned by the user in each year, highlighting in blue the games that the user has owned/played.

3.5 Interactive Predictions from Resampling

Interactive table for predictions from resampling.

4 Validating the Model on 2020

How well did a model trained on a user’s collection through 2019 perform in predicting games for the user from 2020?

Table of top 25 games from 2020, highlighting games that the user owns.

5 Predicting Upcoming Games (2020 and On) for Gyges

Examine the top games for the test set.

5.1 Interactive Table for Upcoming Games 2020 and On