Quantifying or bounding UMAP error? #964
Replies: 1 comment
-
If you really want to measure the error in preserving distances after dimensionality reduction, then the RMS error of the distances in the high and low-dimensional space can be calculated in principle (effectively the same as the "STRESS" criterion in the multidimensional scaling literature). But the cost is obviously O(N^2) so it's not practical to do so with large datasets. Also, prepare for disappointment with UMAP because it doesn't attempt to reproduce distances. Methods like UMAP (and t-SNE, PaCMAP etc.) are effectively attempting to reproduce the k-nearest neighbor graph, so one way to measure the success of the layout is to compare the k-nearest neighbors of each object in the original space with those in the embedded space. You need to decide on a value (or many values) of k, though. For longer-range structure there are several methods in the literature: you could repeatedly sample same three points in the ambient space, measure their inter-point distances and then see if the resulting order of distances is preserved in the embedded space. You could calculate the Pearson correlation coefficient between pair-wise distances in the original and embedded spaces. My unsolicited opinion based on having calculated these measures on a large number of datasets and embedding methods is that they are not very satisfactory and tend to just measure how close your embedding is to what carrying out PCA would give you. |
Beta Was this translation helpful? Give feedback.
-
Hello! I'm pretty new to UMAP and dimensionality reduction. I'm wondering, are there ways to calculate the "error" of UMAP or any other dimensionality reduction techniques? For example, if I have 3 data points in a high dimensional space N-D, and I bring it down to 2-D, I can preserve all the distances and the error would be 0. But if I have 4, 10, or 100000 data points in an N-D space, I'd imagine the "error" of reducing the points down to 2-D could be increasingly large. Or maybe the higher the dimension N, the higher the error could be. Has anyone tried quantifying the error of these techniques? Hopefully this question makes sense.
Beta Was this translation helpful? Give feedback.
All reactions