Skip to content

Commit

Permalink
xg update
Browse files Browse the repository at this point in the history
  • Loading branch information
danmorse314 committed Sep 6, 2022
1 parent f94dfb5 commit 0fba962
Show file tree
Hide file tree
Showing 15 changed files with 382 additions and 11 deletions.
6 changes: 4 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: hockeyR
Title: Collect and Clean Hockey Stats
Version: 1.0.0
Version: 1.1.0
Authors@R:
person(given = "Daniel",
family = "Morse",
Expand Down Expand Up @@ -33,7 +33,9 @@ Imports:
tidyr,
tidyselect,
utils,
zoo
zoo,
stats,
xgboost
Suggests:
ggimage,
ggplot2,
Expand Down
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Generated by roxygen2: do not edit by hand

export("%>%")
export(calculate_xg)
export(get_current_rosters)
export(get_draft_class)
export(get_game_ids)
Expand All @@ -19,4 +20,5 @@ export(scrape_day)
export(scrape_game)
export(scrape_season)
importFrom(magrittr,"%>%")
importFrom(stats,predict)
importFrom(utils,type.convert)
11 changes: 10 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# hockeyR 1.0.0.1000
# hockeyR 1.1.0

## New addition:
* Play-by-play data loaded through `load_pbp` includes new column for expected goals
* Details on & code to create the hockeyR expected goals model can be found [here](https://github.com/danmorse314/hockeyR-models)
* The `scrape_game` function has been adjusted to automatically add expected goals to the output

## New function:
* `calculate_xg` adds expected goals column to pbp data (used inside `scrape_game`, not necessary to use this to get expected goal values)

## Fixes:
* Changed the `player_id` column in `get_draft_class` to `prospect_id` - proper NHL `player_id` column only returns with `player_details` set to `TRUE`

# hockeyR 1.0.0
Expand Down
75 changes: 75 additions & 0 deletions R/calculate_xg.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#' Calculate hockeyR expected goals (xG)
#'
#' @description Uses the hockeyR expected goals model to calculate xG for any pbp data frame generated by hockeyR
#'
#' @param pbp A play-by-play data frame, previously returned by hockeyR::scrape_game
#'
#' @return The original supplied play-by-play data with a column for expected goals appended
#' @export
#'
#' @examples
#' \dontrun{
#' pbp <- load_pbp(2022) |> dplyr::select(-xg)
#' pbp_preds <- calculate_xg(pbp)
#' }
calculate_xg <- function(pbp){

# get model features
model_data <- prepare_xg_data(pbp)

# make 5v5 predictions
preds_5v5 <- stats::predict(
xg_model_5v5,
xgboost::xgb.DMatrix(
data = model_data |>
dplyr::filter(strength_state == "5v5") |>
dplyr::select(dplyr::all_of(xg_model_5v5$feature_names)) |>
data.matrix(),
label = model_data |>
dplyr::filter(strength_state == "5v5") |>
dplyr::select(goal) |>
data.matrix()
)
) |>
dplyr::as_tibble() |>
dplyr::rename(xg = value) |>
dplyr::bind_cols(
dplyr::select(
dplyr::filter(model_data, strength_state == "5v5"),
event_id)
)

# make ST predictions
preds_st <- stats::predict(
xg_model_st,
xgboost::xgb.DMatrix(
data = model_data |>
dplyr::filter(strength_state != "5v5") |>
dplyr::select(dplyr::all_of(xg_model_st$feature_names)) |>
data.matrix(),
label = model_data |>
dplyr::filter(strength_state != "5v5") |>
dplyr::select(goal) |>
data.matrix()
)
) |>
dplyr::as_tibble() |>
dplyr::rename(xg = value) |>
dplyr::bind_cols(
dplyr::select(
dplyr::filter(model_data, strength_state != "5v5"),
event_id)
)

# combine
preds <- dplyr::bind_rows(preds_5v5, preds_st) |>
# attach xg column to original pbp data
dplyr::right_join(pbp, by = "event_id") |>
# fix penalty shots
dplyr::mutate(xg = ifelse(
secondary_type != "Penalty Shot" | is.na(secondary_type), xg, xg_model_ps
)) |>
dplyr::arrange(event_id)

return(preds)
}
6 changes: 5 additions & 1 deletion R/globals.R
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@ utils::globalVariables(
"amateurLeague","amateurTeam","full_name","full_team_name","jersey_number",
"person","picks","prospect","prospectCategory","ranks","roundNumber",
"rounds","shortName","year","duration_seconds","home_abbreviation",
"away_abbreviation","home_id","away_id","prospect_id","nhl_player_id"
"away_abbreviation","home_id","away_id","prospect_id","nhl_player_id",
"away_skaters","empty_net","event_team_skaters","event_zone",
"feature","goal","home_skaters","last_event_team","last_event_type",
"last_event_zone","last_value","last_x","last_y","na","opponent_team_skaters",
"period_type","shot_type","time_since_last","type_value","val","value","xg"
)
)
7 changes: 7 additions & 0 deletions R/hockeyR-package.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#' @keywords internal
"_PACKAGE"

## usethis namespace: start
#' @importFrom stats predict
## usethis namespace: end
NULL
120 changes: 120 additions & 0 deletions R/prepare_xg_data.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
#' Prepare xG data
#'
#' @description Helper function to prepare hockeyR pbp data for xG calculations
#'
#' @param x A play-by-play data frame generated by hockeyR before xG is calculated
#'
#' @return A tibble; pbp data with xG model mutations along with identifiers for game and strength state
#'
#' @examples
#' \dontrun{
#' pbp <- load_pbp(2022) |> dplyr::select(-xg)
#' model_data <- prepare_xg_data(pbp)
#' }
prepare_xg_data <- function(x){

model_df <- x |>
# filter out shootouts
dplyr::filter(period_type != "SHOOTOUT") |>
# remove penalty shots
dplyr::filter(secondary_type != "Penalty Shot" | is.na(secondary_type)) |>
# remove shift change events, which were excluded from model
dplyr::filter(event_type != "CHANGE") |>
# add model feature variables
dplyr::group_by(game_id) |>
dplyr::mutate(
last_event_type = dplyr::lag(event_type),
last_event_team = dplyr::lag(event_team),
time_since_last = game_seconds - dplyr::lag(game_seconds),
last_x = dplyr::lag(x),
last_y = dplyr::lag(y),
distance_from_last = round(sqrt(((y - last_y)^2) + ((x - last_x)^2)),1),
event_zone = dplyr::case_when(
x >= -25 & x <= 25 ~ "NZ",
(x_fixed < -25 & event_team == home_name) |
(x_fixed > 25 & event_team == away_name) ~ "DZ",
(x_fixed > 25 & event_team == home_name) |
(x_fixed < -25 & event_team == away_name) ~ "OZ"
),
last_event_zone = dplyr::lag(event_zone)
) |>
dplyr::ungroup() |>
# filter to only unblocked shots
dplyr::filter(event_type %in% c("SHOT","MISSED_SHOT","GOAL")) |>
# get rid off oddball last_events
# ie "EARLY_INTERMISSION_START"
dplyr::filter(
last_event_type %in% c(
"FACEOFF","GIVEAWAY","TAKEAWAY","BLOCKED_SHOT","HIT",
"MISSED_SHOT","SHOT","STOP","PENALTY","GOAL"
)
) |>
# add more feature variables
dplyr::mutate(
era_2011_2013 = ifelse(
season %in% c("20102011","20112012","20122013"),
1, 0
),
era_2014_2018 = ifelse(
season %in% c("20132014","20142015","20152016","20162017","20172018"),
1, 0
),
era_2019_2021 = ifelse(
season %in% c("20182019","20192020","20202021"),
1, 0
),
era_2022_on = ifelse(
as.numeric(season) > 20202021, 1, 0
),
# these are only for the ST model
event_team_skaters = ifelse(event_team == home_name, home_skaters, away_skaters),
opponent_team_skaters = ifelse(event_team == home_name, away_skaters, home_skaters),
total_skaters_on = event_team_skaters + opponent_team_skaters,
event_team_advantage = event_team_skaters - opponent_team_skaters,
# these are in 5v5 model
rebound = ifelse(last_event_type %in% c("SHOT","MISSED_SHOT","GOAL") & time_since_last <= 2, 1, 0),
rush = ifelse(last_event_zone %in% c("NZ","DZ") & time_since_last <= 4, 1, 0),
cross_ice_event = ifelse(
# indicates goalie had to move from one post to the other
last_event_zone == "OZ" &
((last_y > 3 & y < -3) | (last_y < -3 & y > 3)) &
# need some sort of time frame here to indicate shot was quick after goalie had to move
time_since_last <= 2, 1, 0
),
# fix missing empty net vars
empty_net = ifelse(is.na(empty_net) | empty_net == FALSE, FALSE, TRUE),
shot_type = secondary_type,
goal = ifelse(event_type == "GOAL", 1, 0)
) |>
dplyr::select(season, game_id, event_id, strength_state, shot_distance, shot_angle, empty_net, last_event_type:goal) |>
# one-hot encode some categorical vars
dplyr::mutate(type_value = 1, last_value = 1) |>
tidyr::pivot_wider(names_from = shot_type, values_from = type_value, values_fill = 0) |>
tidyr::pivot_wider(
names_from = last_event_type, values_from = last_value, values_fill = 0, names_prefix = "last_"
) |>
janitor::clean_names() |>
dplyr::select(
-last_event_team, -event_zone, -last_event_zone, -event_team_skaters, -opponent_team_skaters
)

if("na" %in% names(model_df)){
model_df <- dplyr::select(model_df, -na)
}

`%not_in%` <- purrr::negate(`%in%`)

missing_feats <- dplyr::tibble(
feature = xg_model_5v5$feature_names
) |>
dplyr::filter(feature %not_in% names(model_df)) |>
dplyr::mutate(val = 0) |>
tidyr::pivot_wider(names_from = feature, values_from = val)

if(length(missing_feats) > 0){
model_df <- dplyr::bind_cols(model_df, missing_feats)
}

return(model_df)

}
16 changes: 16 additions & 0 deletions R/scrape_game.R
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
#' @return A tibble containing event-based play-by-play data for an individual
#' NHL game. The resulting data will have columns for:
#' \describe{
#' \item{xg}{Numeric expected goal value for unblocked shot events}
#' \item{event}{String defining the event}
#' \item{event_type}{String with alternate event definition; in all caps}
#' \item{secondary_type}{String defining secondary event type}
Expand Down Expand Up @@ -63,6 +64,7 @@
#' \item{away_goalie}{String name of away goalie on ice}
#' \item{game_id}{Integer value of assigned game ID}
#' \item{event_idx}{Numeric index for event}
#' \item{event_id}{Numeric id for event -- more specified than event_idx}
#' \item{event_player_1_id}{Integer value of the player ID for the primary event player}
#' \item{event_player_1_link}{String value of the NHL.com player link for the primary event player}
#' \item{event_player_1_season_total}{Integer value for the total events for the primary event player this season}
Expand Down Expand Up @@ -863,6 +865,20 @@ scrape_game <- function(game_id){

}

# add event_id
pbp_full <- pbp_full |>
dplyr::mutate(
event_idx = stringr::str_pad(event_idx, width = 4, side = "left", pad = 0),
event_id = as.numeric(paste0(game_id,event_idx)),
secondary_type = ifelse(
stringr::str_detect(dplyr::lead(description), "PS -") &
event_type %in% c("SHOT","MISSED_SHOT","GOAL"),
"Penalty Shot", secondary_type
)
)
# add xg
pbp_full <- calculate_xg(pbp_full)

return(pbp_full)

}
Binary file modified R/sysdata.rda
Binary file not shown.
23 changes: 22 additions & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,28 @@ All variables available in the raw play-by-play data are included, along with a

The `shot_distance` and `shot_angle` are measured in feet and degrees, respectively. The variables `x_fixed` and `y_fixed` are transformations of the `x` and `y` event coordinates such that the home team is always shooting to the right and the away team is always shooting to the left. For full details on the included variables, see the [`scrape_game`](https://github.com/danmorse314/hockeyR/blob/master/R/scrape_game.R) documentation.

As mentioned above, an easy way to create a shot plot is through the [sportyR](https://github.com/sportsdataverse/sportyR) package. You can also use the included `team_colors_logos` data to add color and team logos to your plots.
#### NEW in hockeyR v1.1.0: Expected Goals

As of `hockeyR` v1.1.0, a new column has been added to the play-by-play data: Expected goals! The `hockeyR` package now includes its own public expected goals model, and every unblocked shot in the play-by-play data now has an `xg` value. A full description of the model, including the code used to construct it and the testing results, can be found in the [hockeyR-models](https://github.com/danmorse314/hockeyR-models) repository. Users can now investigate additional statistics, such as player goals above expectation without having to create their own entire model.

```{r xg-example}
pbp %>%
filter(event_type %in% c("SHOT","MISSED_SHOT","GOAL")) %>%
group_by(player = event_player_1_name, id = event_player_1_id) %>%
summarize(
team = last(event_team_abbr),
goals = sum(event_type == "GOAL"),
xg = round(sum(xg, na.rm = TRUE),1),
gax = goals - xg,
.groups = "drop"
) |>
arrange(-xg) |>
slice(1:10)
```

#### Shot Plots

An easy way to create a shot plot is through the [sportyR](https://github.com/sportsdataverse/sportyR) package. You can also use the included `team_colors_logos` data to add color and team logos to your plots.

```{r shot-plot-example}
# get single game
Expand Down
Loading

0 comments on commit 0fba962

Please sign in to comment.