Quantifying or bounding UMAP error? #964

harryli0088 · 2023-02-09T19:59:32Z

harryli0088
Feb 9, 2023

Hello! I'm pretty new to UMAP and dimensionality reduction. I'm wondering, are there ways to calculate the "error" of UMAP or any other dimensionality reduction techniques? For example, if I have 3 data points in a high dimensional space N-D, and I bring it down to 2-D, I can preserve all the distances and the error would be 0. But if I have 4, 10, or 100000 data points in an N-D space, I'd imagine the "error" of reducing the points down to 2-D could be increasingly large. Or maybe the higher the dimension N, the higher the error could be. Has anyone tried quantifying the error of these techniques? Hopefully this question makes sense.

jlmelville · 2023-03-03T05:42:50Z

jlmelville
Mar 3, 2023
Collaborator

If you really want to measure the error in preserving distances after dimensionality reduction, then the RMS error of the distances in the high and low-dimensional space can be calculated in principle (effectively the same as the "STRESS" criterion in the multidimensional scaling literature). But the cost is obviously O(N^2) so it's not practical to do so with large datasets. Also, prepare for disappointment with UMAP because it doesn't attempt to reproduce distances.

Methods like UMAP (and t-SNE, PaCMAP etc.) are effectively attempting to reproduce the k-nearest neighbor graph, so one way to measure the success of the layout is to compare the k-nearest neighbors of each object in the original space with those in the embedded space. You need to decide on a value (or many values) of k, though.

For longer-range structure there are several methods in the literature: you could repeatedly sample same three points in the ambient space, measure their inter-point distances and then see if the resulting order of distances is preserved in the embedded space. You could calculate the Pearson correlation coefficient between pair-wise distances in the original and embedded spaces. My unsolicited opinion based on having calculated these measures on a large number of datasets and embedding methods is that they are not very satisfactory and tend to just measure how close your embedding is to what carrying out PCA would give you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantifying or bounding UMAP error? #964

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Quantifying or bounding UMAP error? #964

harryli0088 Feb 9, 2023

Replies: 1 comment

jlmelville Mar 3, 2023 Collaborator

harryli0088
Feb 9, 2023

jlmelville
Mar 3, 2023
Collaborator