Machine learning methods for predictive analysis of team performance in sports

Introduction

The objective of this study is to use data relating to the past sports performance of individual players in the Volleyball Serie A, to predict at the beginning of the championship which teams will access the final phase of the championship, the Playoffs, following the Regular Season phase. . For this work, the data relating to the men's volleyball Serie A seasons from the 2001/02 to the 2017/18 season were taken into consideration. Specifically, the data regarding the performance of each individual athlete season by season were considered. Each team is therefore represented by the set of players who make up the squad at the beginning of the season. The aim of the work was to identify supervised learning models capable of predicting future events using information on past events.

Fragment of the dataset used

Analysis of the Results

The statistical classifications (also called Metrics), obtained through the Confusion Matrix, which were used for this project are the following:

Accuracy;
Balanced_Accuracy;
Precision;
Recall;
F1_score;
Error (an introduced metric created specifically for this project).

Supervised Learning Models Used

The supervised learning models that were used for this project are as follows:

Logistic Regression;
SVC with linear kernel (Support Vector Classification - an extension of SVM);
SVC with RBF (Radial Basis Function) kenrel (Support Vector Classification - an extension of SVM).

In particular, for the SVC model with RBF kernel four different implementations have been made (for more information read the report.pdf).

Logistic Regression

For this model, various parameters are available, the ones we have focused on most are:

C is the penalty parameter of the error term. In our case it takes value

$C \in [2^{-8}, 2^{-7}, ..., 2^{7}]$

solver indicates the algorithm to be used in the optimization problem. In our case it is "lbfgs".
max_iter indicates the maximum number of iterations for the solver to converge. In our case it was assigned a value of 200.
the other parameters have their default value

To Run this Model

$ python3 LogisticRegression.py

SVC with Linear Kernel

For this model, various parameters are available, the ones we have focused on most are:

C is the penalty parameter of the error term. In our case it takes value

$C \in [2^{-8}, 2^{-7}, ..., 2^{7}]$

max_iter indicates the maximum number of iterations for the solver to converge. In our case it was assigned a value of 20000.
the other parameters have their default value

To Run this Model

$ python3 LinearSVC.py

SVC with RBF Kernel

For this model, various parameters are available, the ones we have focused on most are:

C is the penalty parameter of the error term. In our case it takes value

$C \in [2^{-8}, 2^{-7}, ..., 2^{7}]$

kernel specifies the type of kernel to be used in the algorithm. It can be "linear", "poly", "rbf" and "sigmoid". In our case it has value "rbf" or Gaussian kernel.
gamma γ is a kernel coefficient for "rbf" types. Possible values for this variable are

$\gamma \in \{ \gamma_{0} *2^{i} | i = -8, ..., 7 \} \quad with \quad \gamma_{0} = \frac{1}{numeroFeatures} = \frac{1}{20}$

the other parameters have their default value

To Run this Model

Four different implementations of this model have been created (for more information see the report.pdf)

To run the third implementation

$ python3 NoLinearSVC.py

To run the fourth implementation

$ python3 NoLinearSVC_with_Probability.py

Example of Output

Example of output for the 2008 test with this implementation

Comparison of the Results

The comparison of the results obtained with the various models was made in terms of the F1_score metric. Below is the table summarizing the results obtained:

MODEL	F1_score
Logistic Regression	75,9%
SVC with Linear Kernel	73,3%
SVC with RBF Kernel first implementation	82,5%
SVC with RBF Kernel second implementation	80,9%
SVC with RBF Kernel third implementation	81,4%
SVC with RBF Kernel fourth implementation	81%

Libraries Needed

To run the code you need the following libraries:

Library	Version
numpy	>= 1.19.4
pandas	>= 1.1.5
scikit-learn	>= 0.24.0
scipy	>= 1.3.1

The code has been tested with MacOS Catalina (version 10.15.2).

License

MIT License. See LICENSE file for further information.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Readme_Documents		Readme_Documents
Report		Report
.gitignore		.gitignore
LICENSE		LICENSE
LinearSVC.py		LinearSVC.py
LogisticRegression.py		LogisticRegression.py
NoLinearSVC.py		NoLinearSVC.py
NoLinearSVC_with_Probability.py		NoLinearSVC_with_Probability.py
README.md		README.md
VolleyballDataframe.csv		VolleyballDataframe.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine learning methods for predictive analysis of team performance in sports

Introduction

Analysis of the Results

Supervised Learning Models Used

Logistic Regression

To Run this Model

SVC with Linear Kernel

To Run this Model

SVC with RBF Kernel

To Run this Model

Example of Output

Comparison of the Results

Libraries Needed

License

About

Releases

Packages

Languages

License

ocrim1996/VolleyballPlayoffPrediction

Folders and files

Latest commit

History

Repository files navigation

Machine learning methods for predictive analysis of team performance in sports

Introduction

Analysis of the Results

Supervised Learning Models Used

Logistic Regression

To Run this Model

SVC with Linear Kernel

To Run this Model

SVC with RBF Kernel

To Run this Model

Example of Output

Comparison of the Results

Libraries Needed

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages