🌀️ NOTE: This repository has been extended into an open-source Python package! Check out the unravelsports repository or simply
pip install unravelsports
⚠️ NOTE: If you are having trouble automatically loading the data within Jupyter Notebook please find all relevant data files (and the associated paper) in the Google Drive folder
This repository is provided alongside the paper: "A Graph Neural Network deep-dive into successful counterattacks"[pdf] (Mirror 1). This paper is one of the finalists of the 2023 MIT Sloan Sports Analytics Conference Research Paper Competition. It contains an interactive Python Jupyter Notebook and all relevant datasets (see below) for training GNNs using the Spektral library.
The purpose of this research is to build gender-specific, first-of-their-kind Graph Neural Networks to model the likelihood of a counterattack being successful and to uncover what factors make them successful in both men's and women's professional soccer. These models are trained on a total of 20,863 frames of algorithmically identified counterattacking sequences from synchronized StatsPerform on-ball event and SkillCorner spatiotemporal (broadcast) tracking data. The data - easily accessible within the Counterattack Jupyter Notebook - is derived from 632 games of MLS (2022), NWSL (2022) and international women’s soccer (2020-2022).
With this data we demonstrate that gender-specific Graph Neural Networks outperform architecturally identical gender-ambiguous models in predicting the successful outcome of counterattacks. We show, using Permutation Feature Importance, that byline to byline speed, angle to the goal, angle to the ball and sideline to sideline speed are the node features with the highest impact on model performance.
Our MIT Sloan Sports Analytics Conference 2023 presentation has made its way to YouTube, see below:
The architecture of the GNN model provided in the Counterattack Jupyter Notebook is depicted in the image below
Our preferred Graph configuration is depicted below. As you can see in Counterattack Jupyter Notebook you are free to choose your own configuration by selecting from an assortment of Adjaceny Matrices, Edge Features and Node Features, more on this below
To install Jupyter using pip, first check if pip is updated in your system. Use the following command in the command prompt to update pip
python -m pip install --upgrade pip
Once upgraded, use the following command to install Jupyter
python -m pip install jupyter
Use the following command to launch Jupyter Notebook
jupyter notebook
Once launched, you should be able to see the following screen
Make sure the requirement.txt file is in the directory. Navigate to the directory using command prompt and type the following command
pip install -r requirements.txt
Or for MacOS
pip install -r requirements_macos.txt
-
Navigate to the location where you have cloned the GitHub repository and open the interactive notebook.
-
Open the .ipynb file.
-
Run the first cell by clicking the play button or using the shortcut
Shift + Enter
through your keyboard. If all the libraries are installed succesffuly this code block should execute without throwing any errors. -
Run the subsequent cell blocks to load the data.
-
Choose a file between men's data, women's data and combined dataset using the dropdown list.
- Men's dataset - MLS 2022 season
- Women's dataset - NWSL 2022 season + International Women's soccer
- Combined dataset - Combination of both men and women's data
-
Choose the adjacency matrix from the dropdown list.
- normal - connects attacking players amongst themselves, defensive players amongst themselves and the attacking and defending players are conencted through the ball.
- delaunay - connects a few attacking players and defending players in a delaunay matrix fashion
- dense - connects all the players and the ball to each other
- dense_ap - connects all the attacking players to each other and defensive players.
- dense_dp - connects all the defending players to each other and attacking players.
-
Choose the Edge features and Node Features using the checkboxes.
-
Edge Feature options:
- Player Distance - Distance between two players connected to each other
- Speed Difference - Speed difference between two players connected to each other
- Positional Sine angle - Sine of the angle created between two players in the edge
- Positional Cosine angle - Cosine of the angle created between two players in the edge
- Velocity Sine angle - Sine of the angle created between the velocity vectors of two players in the edge
- Velocity Cosine angle - Coine of the angle created between the velocity vectors of two players in the edge
-
Node Feature options:
- x coordinate - x coordinate on the 2D pitch for the player / ball
- y coordinate - y coordinate on the 2D pitch for the player / ball
- vx - Velocity vector's x coordinate
- vy - Velocity vector's y coordinate
- Velocity - magnitude of the velocity
- Velocity Angle - angle made by the velocity vector
- Distance to Goal - distance of the player from the goal post
- Angle with Goal - angle made by the player with the goal
- Distance to Ball - distance from the ball (always 0 for the ball)
- Angle with Ball - angle made with the ball (always 0 for the ball)
- Attacking Team Flag - 1 if the team is attacking, 0 if not and for the ball
- Potential Receiver - 1 if player is a potential receiver, 0 otherwise
-
-
Update the graph neural training network configurations as per your requirement. You may also try chaning the network layers.
-
Start the model training. It should stop after the epochs are completed. Check your training logloss score. If it is not satisfactory rerun the training block.
-
Use the block for testing model logloss and ROC-AUC curve.
-
Further, you may opt to look at model calibration and also calculate the Expected Calibration Error (More details in the notebook text blocks).
-
It is also possible to check which features contribute the most to the model performance. (Note: The
Attacking Team Flag
checkbox from the Node Features needs to be selected to calculate the feature importance.) Choose between attacking and defending players and note the differences via the box plot.
- Change the model architecture by introducing different configurations for the Convolutional layers, Pooling layers and Hyper Parameters.
- Choose different adjacency matrices and different combinations of node features and edge features.
- Create a "Gender-Aware" model as suggested by StatsBomb. Perhaps this can be achieved by adding a node feature to each node in each graph specifying the gender.
- Use the Imbalanced Dataset to build a robust (gendered) counterattacking success prediction model
- Analyze the features and their importance to counterattack success more in-depth. This can help uncover for example, for which players on what position it is important to have a high (or low)
vx
etc. - Analyze other aspects of counterattacks. For example: what is the impact of having more players behind the ball when defending, or how much is the impact of space reduction around the player on the ball.
- etc.
If you use any of the data or files within this repository, please cite our paper.
@inproceedings{sahasrabudhe2023graph,
title={A Graph Neural Network deep-dive into successful counterattacks},
author={Sahasrabudhe, Amod and Bekkers, Joris},
booktitle={17th Annual MIT Sloan Sports Analytics Conference. Boston, MA, USA: MIT},
pages={15},
year={2023}
}
We provide two different types of datasets:
This dataset is used within our research paper. It contains only Graph representations of tracking data frames that are associated with an event that happens within a counterattacking sequence.
Dataset | Competitions | Sample Size (Individual Graphs) |
---|---|---|
Women | NWSL (2022) Int. Friendlies SheBelieves Olympics |
3,720 |
Men | MLS (2022) | 17,143 |
Combined | All | 20,863 |
Every Graph of an event that is part of a counterattacking sequence where eventually the ball reaches the opposing team's penalty area is labeled a success (1).
This dataset has 50% successful and 50% unsuccessful labels.
{
"normal": {
"a": [AdjMatrixGraph0, AdjMatrixGraph1, ... AdjMatrixGraphN],
"x": [NodeFeatureMatrixGraph0, NodeFeatureMatrixGraph1, ... NodeFeatureMatrixGraphNN],
"e": [EdgeFeatureMatrixGraph0, EdgeFeatureMatrixGraph1, ... EdgeFeatureMatrixGraphN]
}, <- similar dictionaries for each other Adjacency Matrix Type ("delaunay", "dense", "dense_ap", "dense_dp")
"binary": [[LabelGraph0], [LabelGraph1]...[LabelGraphN]]
}
The data loading process is automated within the Counterattack Jupyter Notebook, but it can also be obtained through the links below.
Note: Node and Edge feature names for these datasets are listed in Counterattack Jupyter Notebook
This dataset is an additional dataset that was not used within the research paper. It contains Graph representations of every tracking data frame that is associated with a counterattacking sequence.
Dataset | Competitions | Sample Size (Individual Graphs) |
---|---|---|
Women | NWSL (2022) Int. Friendlies SheBelieves Olympics |
103,381 |
Men | MLS (2022, approx. 32% of games) | 104,628 |
Every Graph frame that is part of a counterattacking sequence that leads to a goal is labeled a success (1).
This dataset has approx. 5% successful and approx. 95% unsuccessful labels.
The structure is slightly different, as we only provide the normal
adjacency matrix.
We also provide an id
label. This label number is the same for all frames that belong to the same sequence.
Use this Sequence ID value to split the test & train data such that we do not add parts of a single sequence to both the test and train set.
Doing this would result in leaking information between test and train set. To do this splitting see split_sequences.py.
{
"a": [AdjMatrixGraph0, AdjMatrixGraph1, ... AdjMatrixGraphN],
"x": [NodeFeatureMatrixGraph0, NodeFeatureMatrixGraph1, ... NodeFeatureMatrixGraphNN],
"e": [EdgeFeatureMatrixGraph0, EdgeFeatureMatrixGraph1, ... EdgeFeatureMatrixGraphN]
"label": [[LabelGraph0], [LabelGraph1]...[LabelGraphN]],
"id": [SequenceIDGraph0, SequenceIDGraph1, ... SequenceIDGraphN],
"node_feature_names": ["x", "y", "vx",... "att_team"],
"edge_feature_names": ["distance", ...]
}
- Python 3.9+