You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Comparative Analysis of Text Mining and Clustering Techniques for Assessing Functional Dependency between Manual Test Cases
Appendix
The supplementary appendix materials for the article
Performance comparison of Different Text Mining and
Clustering Techniques for Functional Dependency are
provided in the upcoming pages.
Plotting UMAP results
Figures 1 and 2 illustrate the results of utilizing 7 different string distance algorithms for the text mining
where the Agglomerative algorithm is used for the clustering in Figure 1. A total of 5 clusters were achieved
and mirrored by the the Agglomerative clustering algorithm in Figures 1a, 1b, 1c and 1d respectively.
The results of using some normalized compression
distance algorithms for text mining and DBSCAN
and HDBSCAN algorithms are presented in Figures 3
and 4. As emphasized before the HDBSCAN algorithm can provide a cluster of the non-clusterable data
points which can be interpreted as independent test
cases in this study. Generally, the HDBSCAN algorithm provides more clusters compared to all other
utilized clustering algorithms. As we can see in Figures 3a, 3b, 3c, 4a and 4b more than 200 clusters are
generated where each color represent a unique cluster.
However, the combination of the same text mining
method with the DBSCAN leads to having all test
cases inside of one cluster mirrored in Figure 3d. The
visualization results of employing two machine learning approaches are mirrored in Figure 5, where Fig-
ure 5a represents the combination of the Doc2Vec with
Agglomerative and Figure 5b indicates the combination of SBERT with Affinity respectively.
(a) Overlap coefficient with Agglomerative.
(b) Ratcliff-Obershelp with Agglomerative.
(c) Jaro with Agglomerative.
(d) Levenshtein with Agglomerative.
Figure 1 - String distance algorithms are employed for text mining.
(a) Jaccard with Affinity.
(b) Sorensen–Dice coefficient with Affinity.
(c) $q$-gram with DBSCAN.
Figure 2 - String distance algorithms are employed for text mining.
(a) bzip with HDBSCAN.
(b) Deflate with HDBSCAN.
(c) gzip with HDBSCAN.
(d) XZ with DBSCAN.
Figure 3 - Normalized compression distance algorithms are employed for text mining.
(a) Zlip with HDBSCAN.
(b) Zstd with HDBSCAN.
Figure 4 - Normalized compression distance algorithms are employed for text mining.
(a) Doc2Vec with Agglomerative.
(b) SBERT with Affinity.
Figure 5 - Machine learning algorithms are employed for text mining.
The results of sensitivity analysis using the Mantel model
Matrix of Mantel correlations between the employed
text mining algorithms is presented in Figure 6.
Figure 6 - Matrix of Mantel correlations between the employed text mining algorithms (both tokenized and non-
tokenized version) distances between all pairs of 784 source points.The rows and columns of the matrix represent
each of the 28 text mining algorithms. The color of the cell corresponds to the magnitude of the Mantel $r_M$
correlation between the algorithms distances, indicated by the intersection of the row and column.