MLHAT: Multi-Label Hoeffding Adaptive Trees for classification in multi-label data streams

Associated repository with complementary material to the paper Hoeffding adaptive trees for multi-label classification on data streams, published in Knowledge-Based Systems journal:

Source code of the MLHAT proposal
Datasets used in the experimentation
Complete tables of results
Source code for a reproductible experimentation

Source code

The purpose of this repository is to make public and accessible the source code of MLHAT, a proposal for multi-instance classification in data streams based on incremental decision trees, specifically using the principle of Hoeffding adaptive trees. The code is available under the src folder with the following structure:

src
│   requirements.yml
│   tutorial.ipynb
│
└───multioutput
│   │   mlhat.py > Source code of the main algorithm of MLHAT.
│   │   mlht.py > Additionally to the main contribution of the work, here is an implementation of the multi-label Hoeffding tree of J. Read et al. (2012)
│   │
│   └───mlhat_nodes > Implementations of the branches and leaaf nodes used in the proposal.
│   │
│   │   binary_relevance.py, label_combination.py, majority.py > Auxiliar implementations of multi-label classifiers for the leaf classifiers of MLHAT.
│
└───evaluate, linear_model, metrics, neighbors, stream > Minor adaptations from the River library needed for our work.

The development environment is based on Python >=3.10, with a special mention to the library River as the base for development of MLHAT. The complete list of libraries to replicate the environment is available in requirements.yml.

The Jupyter notebook tutorial.ipnb describes a complete tutorial for using the presented library, including loading data, building the tree in a streaming environment, evaluating and visualizing results.

Datasets

The performance of MLHAT has been validated on a large selection of multi-label datasets that have been adapted to work in a streaming environment. The datasets belong to different sources and contexts, but all are publicly available in the Multi-Label Classification Dataset Repository.

Dataset	Instances	Features	Labels	Cardinality	Density
Flags	194	19	7	3.39	0.48
WaterQuality	1060	16	14	5.07	0.36
Emotions	593	72	6	1.87	0.31
VirusGO	207	749	6	1.22	0.20
Birds	645	260	19	1.01	0.05
Yeast	2417	103	14	4.24	0.30
Scene	2407	294	6	1.07	0.18
Gnegative	1392	440	8	1.05	0.13
Plant	978	440	12	1.08	0.09
Human	3106	440	14	1.19	0.08
Yelp	10806	671	5	1.64	0.33
Medical	978	1449	45	1.25	0.03
Eukaryote	7766	440	22	1.15	0.05
Slashdot	3782	1079	22	1.18	0.05
Enron	1702	1001	53	4.27	0.08
Hypercube	100000	100	10	1.00	0.10
Langlog	1460	1004	75	15.94	0.21
Stackex	1675	585	227	2.41	0.01
Tmc	28596	500	22	2.22	0.10
Ohsumed	13929	1002	23	0.81	0.04
D20ng	19300	1006	20	1.42	0.07
Mediamill	43907	120	101	4.38	0.04
Corel16k	13766	500	153	2.86	0.02
Bibtex	7395	1836	159	2.40	0.02
Nuswidec	269648	129	81	1.87	0.02
Nuswideb	269648	501	81	1.87	0.02
Imdb	120919	1001	28	1.00	0.04
YahooSociety	14512	31802	27	1.67	0.06
Eurlex	19348	5000	201	2.21	0.01

Moreover, a set of synthetic datasets have been generated to test the performance of MLHAT under specific concept drifts and other different conditions. The datasets have been generated using the multi-label meta-generator provided by the MOA framewok and made publicly available in our repository under the folder synthdata

Dataset	Instances	Features	Labels	Generator	Drift type	Drift width
SynTreeSud	50000	20 num.+10 cat.	8	Random Tree	sudden	1
SynRBFSud	50000	80 numeric	25	Random RBF	sudden	1
SynHPSud	50000	30 numeric	8	Hyper Plane	sudden	1
SynTreeGrad	50000	20 num.+10 cat.	8	Random Tree	gradual	500
SynRBFGrad	50000	80 numeric	25	Random RBF	gradual	500
SynHPGrad	50000	30 numeric	8	Hyper Plane	gradual	500
SynTreeInc	50000	20 num.+10 cat.	8	Random Tree	incremental	275
SynRBFInc	50000	80 numeric	25	Random RBF	incremental	275
SynHPInc	50000	30 numeric	8	Hyper Plane	incremental	275
SynTreeRec	50000	20 num.+10 cat.	8	Random Tree	recurrent	1
SynRBFRec	50000	80 numeric	25	Random RBF	recurrent	1
SynHPRec	50000	30 numeric	8	Hyper Plane	recurrent	1

Moreover, all of then are available ready to use in this framework under the folder src/ml_datasets. The datasets employed are the following, and more information about how to load them is presented in the tutorial.ipnb.

Results

The results associated to the complete experimentation carried out in this work are available in the results folder. The following shows the average result of all the datasets and algorithms included in the experimentation with respect to 12 multi-label classification evaluation metrics following the prequential evaluation, with our MLHAT obtaining the best result in 11 of them.

Algorithm	Subset Acc.	Hamming loss	Example-based F1	Example-based Precision	Example-based Recall	Micro F1	Micro Precision	Micro Recall	Macro F1	Macro Precision	Macro Recall	Time (s)
MLHAT	0.2646	0.1301	0.4755	0.6195	0.4892	0.4855	0.5309	0.4556	0.3871	0.4691	0.3661	14994
KNN	0.2018	0.1273	0.3321	0.4781	0.3234	0.3744	0.5170	0.3220	0.2609	0.4035	0.2309	102362
NB	0.0993	0.1842	0.2338	0.3794	0.3157	0.2479	0.3392	0.3211	0.1754	0.2198	0.2582	7578
AMR	0.2111	0.1177	0.3343	0.5716	0.3263	0.3728	0.6513	0.3207	0.2626	0.4467	0.2356	29882
HT	0.1512	0.1232	0.2687	0.5127	0.2608	0.3176	0.5336	0.2574	0.2003	0.3424	0.1770	42975
HAT	0.1746	0.1195	0.3076	0.5352	0.3008	0.3612	0.5494	0.2972	0.2319	0.3659	0.2056	43651
EFDT	0.1620	0.1236	0.3066	0.4996	0.3028	0.3603	0.5095	0.3015	0.2413	0.3465	0.2138	36942
SGT*	0.0337	0.2562	0.1221	0.4381	0.2103	0.1392	0.2115	0.2122	0.1016	0.1638	0.1991	298466
MT*	0.1224	0.1285	0.1920	0.5126	0.1789	0.2223	0.5606	0.1750	0.1335	0.2922	0.1232	186185
ARF	0.1967	0.1126	0.3178	0.5507	0.3008	0.3730	0.6886	0.2972	0.2440	0.4710	0.2081	29308
AMF*	0.1935	0.1247	0.2987	0.6457	0.2834	0.3367	0.6681	0.2790	0.2267	0.4547	0.2029	131491
ABALR	0.2189	0.1156	0.3283	0.5958	0.3178	0.3638	0.7057	0.3113	0.2509	0.4808	0.2254	12584
ABOLR	0.2162	0.1164	0.3276	0.5931	0.3182	0.3616	0.6959	0.3115	0.2512	0.4682	0.2270	13868
MLBELS*	0.2046	0.1858	0.4595	0.4854	0.5452	0.4703	0.4579	0.5173	0.3090	0.4518	0.3650	77813
MLHT	0.1419	0.1723	0.2495	0.2792	0.2419	0.2523	0.2992	0.2343	0.1139	0.2497	0.1522	1458
MLHTPS	0.1211	0.1343	0.1900	0.4596	0.1856	0.2096	0.4164	0.1823	0.1337	0.3058	0.1327	29517
iSOUPT	0.2154	0.1175	0.3284	0.5938	0.3202	0.3710	0.6908	0.3149	0.2507	0.4608	0.2247	6486

Reproductible experimentation

All the experimentation has been run in Python, using for the comparative analysis the implementations available in River of the main classification algorithms in multi-label data streams, with the default parameters proposed by the authors. The methods used, their parameters and the reference implementation used are detailed below.

Algorithm	Family	Parameters	Implementation reference
MLHAT	Multilabel tree	`grace_period=200, delta=1e-05, tau=0.05, split_criterion=MLInfoGain, leaf_prediction=MLNBA, splitter=MLGaussian, nb_threshold=0, drift_detector=ADWIN, drift_window_threshold=50, switch_significance=0.05, poisson_rate=1.0`	This repository
MLHT	Multilabel tree	`grace_period=200, delta=1e-05, tau=0.05, split_criterion=MLInfoGain, leaf_prediction=MLNBA, splitter=MLGaussian, nb_threshold=0, leaf_classifier=MajorityLabelset`	https://github.com/Waikato/moa/blob/master/moa/src/main/java/moa/classifiers/multilabel/MultilabelHoeffdingTree.java
MLHTPS	Multilabel tree	`grace_period=200, delta=1e-05, tau=0.05, split_criterion=MLInfoGain, leaf_prediction=MLNBA, splitter=MLGaussian, nb_threshold=0, leaf_classifier=PrunedSet(HoeffdingTree)`	https://github.com/Waikato/moa/blob/master/moa/src/main/java/moa/classifiers/multilabel/MultilabelHoeffdingTree.java
iSOUPT	Multilabel tree	`grace_period=200, delta=1e-5, tau=0.05, leaf_prediction=LogisticRegression, model_selector_decay=0.95, splitter=TEBST, min_samples_split=5`	https://riverml.xyz/0.15.0/api/tree/iSOUPTreeRegressor/
MLBELS	Multilabel neural network	`N1=3, N2=25, N3=1, max_learners=100, shrink_coef=0.8, regularization_coef=2**-30, batch_size=50, preprocess=True, tau=1.5`	https://github.com/sepehrbakhshi/ML-BELS
BR+HT	Tree	`grace_period=200, delta=1e-05, tau=0.05, split_criterion=InfoGain, leaf_prediction=NBA, splitter=Gaussian, nb_threshold=0`	https://riverml.xyz/0.15.0/api/tree/HoeffdingTreeClassifier/
BR+EFDT	Tree	`grace_period=200, delta=1e-05, tau=0.05, split_criterion=InfoGain, leaf_prediction=NBA, splitter=Gaussian, nb_threshold=0, min_samples_reevaluate=20`	https://riverml.xyz/0.15.0/api/tree/ExtremelyFastDecisionTreeClassifier/
BR+HAT	Tree	`grace_period=200, delta=1e-05, tau=0.05, split_criterion=InfoGain, leaf_prediction=NBA, splitter=Gaussian, nb_threshold=0, drift_detector=ADWIN, drift_window_threshold=50, switch_significance=0.05`	https://riverml.xyz/0.15.0/api/tree/HoeffdingAdaptiveTreeClassifier/
BR+MT	Tree	`step=0.1, loss=log, use_aggregator=True, dirichlet=0.5, split_pure=False`	https://riverml.xyz/0.15.0/api/forest/AMFClassifier/
BR+SGT	Tree	`grace_period=200, delta=1e-5, init_pred=0.0, lambda_value=0.1, gamma=1.0`	https://riverml.xyz/0.15.0/api/tree/SGTClassifier/
BR+ARF	Forest	`n_models=10, max_features=sqrt, grace_period=50, delta=0.01, tau=0.05, split_criterion=InfoGain, leaf_prediction=NBA, splitter=Gaussian, nb_threshold=0, lambda_value=6, drift_detector=ADWIN, warning_detector=ADWIN`	https://riverml.xyz/0.15.0/api/forest/ARFClassifier/
BR+AMF	Forest	`n_models=10, step=1.0, use_aggregation=True, dirchlet=0.5, split_pure=False`	https://riverml.xyz/0.15.0/api/forest/AMFClassifier/
BR+kNN	Distance-based	`n_neighbors=5, window_size=200, min_distance_keep=0.0, weighted=True, cleanup_every=0, distance_func=Euclidean, softmax=False`	https://riverml.xyz/0.15.0/api/neighbors/KNNClassifier/
BR+NB	Bayesian	-	https://riverml.xyz/0.15.0/api/naive-bayes/GaussianNB/
BR+AMR	Rules	`n_min=200, delta=1e-5, tau=0.05, pred_model=LogisticRegression, splitter=TEBST, drift_detector=ADWIN, fading_factor=0.99, anomaly_threshold=-0.75, m_min=30, min_samples_split=5`	https://riverml.xyz/0.15.0/api/rules/AMRules/
BR+OBA	Ensemble	`model=LogisticRegression, n_models=10, drift_detector=ADWIN`	https://riverml.xyz/0.15.0/api/ensemble/ADWINBaggingClassifier/
BR+OBOA	Ensemble	`model=LogisticRegression, n_models=10, drift_detector=ADWIN`	https://riverml.xyz/0.15.0/api/ensemble/ADWINBoostingClassifier/

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
results		results
src		src
synthdata		synthdata
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLHAT: Multi-Label Hoeffding Adaptive Trees for classification in multi-label data streams

Source code

Datasets

Results

Reproductible experimentation

About

Releases

Packages

Languages

License

aestebant/mlhat

Folders and files

Latest commit

History

Repository files navigation

MLHAT: Multi-Label Hoeffding Adaptive Trees for classification in multi-label data streams

Source code

Datasets

Results

Reproductible experimentation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages