BGG Models
  • Predictions
  • Methodology
  • Pipeline

On this page

  • 1 About
    • 1.1 Methodology
    • 1.2 Data
  • 2 Data
    • 2.1 Outcomes
      • 2.1.1 Number of Ratings
    • 2.2 Assessment
      • 2.2.1 Hurdle
      • 2.2.2 Ratings
    • 2.3 Features

Methodology

Author

Phil Henrickson

Published

May 28, 2025

1 About

The goal of this project was to train models to predict new and upcoming releases once information about them becomes available on BoardGameGeek. How do you train a model to predict new releases?

I use historical data from BoardGameGeek (BGG) to train a number of predictive models for community ratings. I train a series of models for this purpose.

  • I first classify games based on their probability of achieving a minimum number of ratings on BGG.
  • I then estimate each game’s complexity (average weight) in order to predicts its number of user ratings and average rating.
  • I then use these estimates to predict the expected Geek Rating.

1.1 Methodology

I develop and train models using a training-validation approach based around the year in which games were published. I create a training set of games published prior to 2022 and evaluated its performance in predicting games published from 2022 to 2023.

graph LR
    %% Data Sources and Storage
    BGG[BoardGameGeek API] -->|Data Collection| GCS[(Google Cloud Storage)]
    GCS -->|Data Loading| PrepData[Preprocessed Games Data]
    
    %% Data Splitting
    PrepData -->|Time-based Split| DataSets[Training/Validation/Testing Sets]
    
    %% Main Model Components
    DataSets -->|Training| Models[Model Training]
    
    %% Model Types
    Models --> HurdleModel[Hurdle Model<br>Classification]
    Models --> WeightModel[Complexity Model<br>Regression]
    Models --> RatingModels[Rating Models<br>Regression]
    
    %% Evaluation and Deployment
    HurdleModel --> Evaluation[Model Evaluation]
    WeightModel --> Evaluation
    RatingModels --> Evaluation
    
    Evaluation --> Tracking[Performance Tracking]
    Evaluation --> Deployment[Model Deployment]
    
    %% Styling
    classDef storage fill:#f9f,stroke:#333,stroke-width:2px;
    classDef model fill:#bbf,stroke:#333,stroke-width:1px;
    classDef data fill:#dfd,stroke:#333,stroke-width:1px;
    
    class GCS,VetiverModels storage;
    class HurdleModel,WeightModel,RatingModels,Models model;
    class PrepData,DataSets,NewGames,Predictions data;

1.2 Data

Show the code
# data
# raw pull from bgg
tar_load(games_raw)
# split for train/valid/test
tar_load(split)

# get training set
train_data =
    split |>
    training()

# get validation set
valid_data =
    split |>
    validation()

# full data
full_data =
    bind_rows(train_data, valid_data)

2 Data

As of my most recent data job, there are 129616 individual games on BGG. For each of these games, I have information about the game (playing time, mechanisms, components, publishers, artists, designers, etc), as well as information voted on by the BGG community (average rating, average weight).

Show the code
full_data |>
    sample_n(1000) |>
    visdat::vis_dat()

2.1 Outcomes

The data is at the game level, where I observe the BGG community’s aggregated ratings for individual games. This means I do not have data on the underlying ratings for games, only the average, standard deviation, or sum of the distribution.

I examine four different community outcomes for games: average weight rating (complexity), number of user ratings, average user rating, and geek rating. Only a subset of games have received enough votes by the BGG community to receive a geek rating, which is a bayesian average based on the number of ratings as well as the average rating.

Show the code
full_data |>
    filter_geek() |>
    log_ratings() |>
    plot_outcomes_distributions() +
    theme_set(theme_light() + theme(legend.position = 'top'))

Each of these BGG outcomes (average weight, average, user ratings) is related to each other in some way, which is important to keep in mind as we think about modeling these outcomes.

The average weight tends to be highly correlated with the average rating, while not being correlated with the number of user ratings. The geek rating is a function of the average and user ratings, which means it is also then correlated with the average weight.

Show the code
full_data |>
    plot_outcomes_relationships() +
    theme_set(theme_light() + theme(legend.position = 'top'))

2.1.1 Number of Ratings

I train and evaluate the models primarily on games that have achieved at least 25 ratings, as this is the required number of user ratings for a game to receive a Geek rating.

The majority of games on BoardGameGeek do not actually receive this number of ratings, especially in recent years as the number of games published has dramatically increased. My main aim is to predict games that are set to be published and available to purchase, and many of the over 100k games on BGG have not been been published or distributed.

Ideally, I would have a variable that measures this directly, but I instead rely on achieving a minimum number of ratings as a heuristic.

Show the code
plot_games_by_split = function(data) {
    plot_data =
        data |>
        bggUtils:::unnest_outcomes() |>
        inner_join(
            data |>
                bggUtils:::unnest_info()
        ) |>
        add_hurdle() |>
        mutate(
            yearpublished = case_when(
                yearpublished < 1950 ~ 1949,
                TRUE ~ yearpublished
            )
        ) |>
        mutate(
            hurdle = case_when(
                hurdle == 'yes' ~ '>25 ratings',
                hurdle == 'no' ~ '<25 ratings'
            )
        ) |>
        group_by(yearpublished, hurdle) |>
        count()

    plot_data |>
        ggplot(aes(x = yearpublished, y = n, fill = hurdle)) +
        geom_col() +
        scale_color_manual(values = c("grey60", "navy"))
}

plot_games_by_split(games) +
    scale_fill_manual(values = c('coral', 'navy'))

In predicting new games, I first use a hurdle model to predict whether games are expected to receive enough ratings to be assigned a Geek rating (25 ratings). This model is trained on the full universe of games, whereas the others are trained on games with at least 25 ratings.

The model for predicting the average weight is trained on games that have received 25 ratings and have also received at least 5 votes on their complexity.

Show the code
bind_rows(
    as_tibble(hurdle_fit$metadata$user$data) |>
        group_by(model = 'hurdle') |>
        count(),
    as_tibble(average_fit$metadata$user$data) |>
        group_by(model = 'average') |>
        count(),
    as_tibble(averageweight_fit$metadata$user$data) |>
        group_by(model = 'averageweight') |>
        count(),
    as_tibble(usersrated_fit$metadata$user$data) |>
        group_by(model = 'usersrated') |>
        count()
) |>
    ungroup() |>
    rename(games = n) |>
    arrange(desc(games)) |>
    gt::gt()
model games
hurdle 103528
average 27812
usersrated 27812
averageweight 17315

2.2 Assessment

How did the models perform in predicting games? I evaluated the model on games published in 2022 and 2023.

Show the code
valid_predictions |>
    group_by(yearpublished) |>
    count() |>
    ungroup() |>
    gt::gt()
yearpublished n
2022 5053
2023 5205

2.2.1 Hurdle

I first used the hurdle model to classify whether games would receive at least 25 ratings.

I set the probability threshold at 0.16 - this was the point that maximized the F2 score and minimized false negatives. For the purpose of the hurdle model, including a game that did not receive a Geek rating is much worse than missing a game that did. The main purpose of the model is filter out low-quality games.

  • Threshold
  • Metrics
  • Confusion Matrix
  • False Negatives
  • False Positives
Show the code
valid_predictions |>
    ggplot(aes(x = .pred_hurdle_yes, fill = hurdle)) +
    geom_density(alpha = 0.5) +
    scale_color_manual() +
    theme(legend.title = element_text()) +
    xlab("Pr(User Ratings >= 25)") +
    scale_fill_manual(values = c("coral", "navy")) +
    guides(fill = guide_legend(title = 'User Ratings >=25')) +
    geom_vline(xintercept = hurdle_threshold, linetype = 'dashed')

Show the code
prob_metrics = metric_set(yardstick::roc_auc, yardstick::pr_auc)

prob_hurdle_metrics =
    valid_predictions |>
    group_by(outcome = 'hurdle') |>
    prob_metrics(truth = hurdle, .pred_hurdle_yes, event_level = 'second')

valid_hurdle_metrics |>
    bind_rows(prob_hurdle_metrics) |>
    gt::gt() |>
    gt::tab_options(quarto.disable_processing = T) |>
    gt::fmt_number(columns = c(".estimate"), decimals = 3) |>
    gtExtras::gt_theme_espn()
outcome .metric .estimator .estimate
hurdle bal_accuracy binary 0.763
hurdle kap binary 0.424
hurdle mcc binary 0.476
hurdle f1_meas binary 0.636
hurdle f2_meas binary 0.768
hurdle precision binary 0.494
hurdle recall binary 0.891
hurdle j_index binary 0.526
hurdle roc_auc binary 0.861
hurdle pr_auc binary 0.737
Show the code
valid_predictions |>
    conf_mat(hurdle, .pred_hurdle_class) |>
    autoplot(type = 'heatmap')

Which games that the model classified as “no” received 25+ ratings?

Show the code
valid_predictions |>
    filter(.pred_hurdle_class == 'no') |>
    filter(usersrated >= 25) |>
    select(yearpublished, game_id, name, usersrated, bayesaverage) |>
    arrange(desc(usersrated)) |>
    mutate(bayesaverage = round(bayesaverage, 3)) |>
    mutate(across(all_of(c("yearpublished", "game_id")), ~ as.factor(.x))) |>
    DT::datatable()

Which games that the model classified as “yes” did not receive 25+ ratings?

Show the code
valid_predictions |>
    filter(.pred_hurdle_class == 'yes') |>
    filter(usersrated < 25) |>
    select(yearpublished, game_id, name, usersrated, .pred_bayesaverage) |>
    arrange(desc(usersrated)) |>
    mutate(.pred_bayesaverage = round(.pred_bayesaverage, 3)) |>
    mutate(across(all_of(c("yearpublished", "game_id")), ~ as.factor(.x))) |>
    DT::datatable()

2.2.2 Ratings

I evaluate predictions for the average weight rating, the average rating, the number of user ratings, and the Geek rating.

  • Games with >25 Ratings
  • Games with <25 Ratings
Show the code
plot_hurdle_yes

Show the code
plot_hurdle_no

Show the code
targets_tracking_details(metrics = valid_metrics, details = details) |>
    select(
        model,
        minratings,
        outcome,
        any_of(c("rmse", "mae", "mape", "rsq", "ccc"))
    ) |>
    filter(minratings == 25) |>
    select(minratings, everything()) |>
    gt::gt() |>
    gt::tab_options(quarto.disable_processing = T) |>
    gtExtras::gt_theme_espn()
minratings model outcome rmse mae mape rsq ccc
25 glmnet average 0.675 0.498 7.374 0.294 0.487
25 lightgbm averageweight 0.437 0.336 18.019 0.706 0.827
25 glmnet+glmnet bayesaverage 0.285 0.159 2.647 0.430 0.649
25 glmnet usersrated 1941.387 446.031 154.763 0.122 0.335

What were the model’s top predictions in the validation set?

Show the code
valid_predictions |>
    filter(.pred_hurdle_class == 'yes') |>
    select(-starts_with(".pred_hurdle")) |>
    slice_max(.pred_bayesaverage, n = 150, with_ties = F) |>
    predictions_dt(
        games = games,
        lazy_load = TRUE,
        pageLength = 10
    ) |>
    add_colors()

2.3 Features

Which features were influential for predicting each BGG outcome?

Show the code
average_plot =
    average_fit |>
    extract_vetiver_features() |>
    plot_model_features() +
    labs(title = 'Average Rating')

averageweight_plot =
    averageweight_fit |>
    extract_vetiver_features() |>
    plot_model_features() +
    labs(title = 'Average Weight')

usersrated_plot =
    usersrated_fit |>
    extract_vetiver_features() |>
    plot_model_features() +
    labs(title = 'Users Rated')
  • Average Weight
  • Average
  • Users Rated