The objective of this study is to use data relating to the past sports performance of individual players in the Volleyball Serie A, to predict at the beginning of the championship which teams will access the final phase of the championship, the Playoffs, following the Regular Season phase. . For this work, the data relating to the men's volleyball Serie A seasons from the 2001/02 to the 2017/18 season were taken into consideration. Specifically, the data regarding the performance of each individual athlete season by season were considered. Each team is therefore represented by the set of players who make up the squad at the beginning of the season. The aim of the work was to identify supervised learning models capable of predicting future events using information on past events.
Fragment of the dataset used
The statistical classifications (also called Metrics), obtained through the Confusion Matrix, which were used for this project are the following:
- Accuracy;
- Balanced_Accuracy;
- Precision;
- Recall;
- F1_score;
- Error (an introduced metric created specifically for this project).
The supervised learning models that were used for this project are as follows:
- Logistic Regression;
- SVC with linear kernel (Support Vector Classification - an extension of SVM);
- SVC with RBF (Radial Basis Function) kenrel (Support Vector Classification - an extension of SVM).
In particular, for the SVC model with RBF kernel four different implementations have been made (for more information read the report.pdf
).
For this model, various parameters are available, the ones we have focused on most are:
- C is the penalty parameter of the error term. In our case it takes value
-
solver indicates the algorithm to be used in the optimization problem. In our case it is "lbfgs".
-
max_iter indicates the maximum number of iterations for the solver to converge. In our case it was assigned a value of 200.
-
the other parameters have their default value
$ python3 LogisticRegression.py
For this model, various parameters are available, the ones we have focused on most are:
- C is the penalty parameter of the error term. In our case it takes value
-
max_iter indicates the maximum number of iterations for the solver to converge. In our case it was assigned a value of 20000.
-
the other parameters have their default value
$ python3 LinearSVC.py
For this model, various parameters are available, the ones we have focused on most are:
- C is the penalty parameter of the error term. In our case it takes value
-
kernel specifies the type of kernel to be used in the algorithm. It can be "linear", "poly", "rbf" and "sigmoid". In our case it has value "rbf" or Gaussian kernel.
-
gamma γ is a kernel coefficient for "rbf" types. Possible values for this variable are
- the other parameters have their default value
Four different implementations of this model have been created (for more information see the report.pdf
)
- To run the third implementation
$ python3 NoLinearSVC.py
- To run the fourth implementation
$ python3 NoLinearSVC_with_Probability.py
Example of output for the 2008 test with this implementation
The comparison of the results obtained with the various models was made in terms of the F1_score metric. Below is the table summarizing the results obtained:
MODEL | F1_score |
---|---|
Logistic Regression | 75,9% |
SVC with Linear Kernel | 73,3% |
SVC with RBF Kernel first implementation | 82,5% |
SVC with RBF Kernel second implementation | 80,9% |
SVC with RBF Kernel third implementation | 81,4% |
SVC with RBF Kernel fourth implementation | 81% |
To run the code you need the following libraries:
Library | Version |
---|---|
numpy | >= 1.19.4 |
pandas | >= 1.1.5 |
scikit-learn | >= 0.24.0 |
scipy | >= 1.3.1 |
The code has been tested with MacOS Catalina (version 10.15.2).
MIT License. See LICENSE file for further information.