Skip to content

Latest commit

 

History

History
119 lines (109 loc) · 6.28 KB

File metadata and controls

119 lines (109 loc) · 6.28 KB

Comparative Analysis of Text Mining and Clustering Techniques for Assessing Functional Dependency between Manual Test Cases

Appendix

The supplementary appendix materials for the article Performance comparison of Different Text Mining and Clustering Techniques for Functional Dependency are provided in the upcoming pages.

Plotting UMAP results

Figures 1 and 2 illustrate the results of utilizing 7 different string distance algorithms for the text mining where the Agglomerative algorithm is used for the clustering in Figure 1. A total of 5 clusters were achieved and mirrored by the the Agglomerative clustering algorithm in Figures 1a, 1b, 1c and 1d respectively. The results of using some normalized compression distance algorithms for text mining and DBSCAN and HDBSCAN algorithms are presented in Figures 3 and 4. As emphasized before the HDBSCAN algorithm can provide a cluster of the non-clusterable data points which can be interpreted as independent test cases in this study. Generally, the HDBSCAN algorithm provides more clusters compared to all other utilized clustering algorithms. As we can see in Figures 3a, 3b, 3c, 4a and 4b more than 200 clusters are generated where each color represent a unique cluster. However, the combination of the same text mining method with the DBSCAN leads to having all test cases inside of one cluster mirrored in Figure 3d. The visualization results of employing two machine learning approaches are mirrored in Figure 5, where Fig- ure 5a represents the combination of the Doc2Vec with Agglomerative and Figure 5b indicates the combination of SBERT with Affinity respectively.

Overlap coefficient with Agglomerative. (a) Overlap coefficient with Agglomerative. Ratcliff-Obershelp with Agglomerative. (b) Ratcliff-Obershelp with Agglomerative.
Jaro with Agglomerative. (c) Jaro with Agglomerative. Levenshtein with Agglomerative. (d) Levenshtein with Agglomerative.
Figure 1 - String distance algorithms are employed for text mining.
Jaccard with Affinity. (a) Jaccard with Affinity. Sorensen–Dice coefficient with Affinity. (b) Sorensen–Dice coefficient with Affinity.
q-gram with DBSCAN. (c) $q$-gram with DBSCAN.
Figure 2 - String distance algorithms are employed for text mining.
bzip with HDBSCAN. (a) bzip with HDBSCAN. Deflate with HDBSCAN. (b) Deflate with HDBSCAN.
gzip with HDBSCAN. (c) gzip with HDBSCAN. Levenshtein with Agglomerative. (d) XZ with DBSCAN.
Figure 3 - Normalized compression distance algorithms are employed for text mining.
Zlip with HDBSCAN. (a) Zlip with HDBSCAN. Zstd with HDBSCAN. (b) Zstd with HDBSCAN.
Figure 4 - Normalized compression distance algorithms are employed for text mining.
Doc2Vec with Agglomerative. (a) Doc2Vec with Agglomerative. SBERT with Affinity. (b) SBERT with Affinity.
Figure 5 - Machine learning algorithms are employed for text mining.

The results of sensitivity analysis using the Mantel model

Matrix of Mantel correlations between the employed text mining algorithms is presented in Figure 6.

Matrix of Mantel correlations

Figure 6 - Matrix of Mantel correlations between the employed text mining algorithms (both tokenized and non- tokenized version) distances between all pairs of 784 source points.The rows and columns of the matrix represent each of the 28 text mining algorithms. The color of the cell corresponds to the magnitude of the Mantel $r_M$ correlation between the algorithms distances, indicated by the intersection of the row and column.