This is the implementation of the Dual Simplex method presented in this paper
Non-negative matrix factorization and deconvolution as dual simplex problem
Denis Kleverov, Ekaterina Aladyeva, Alexey Serdyukov, Maxim Artyomov
bioRxiv 2024.04.09.588652; doi: https://doi.org/10.1101/2024.04.09.588652
This in essence is an NMF algorithm which can factorize nonegative matrix V into two nonnegative matrices W and H.
The key feature is that it operates in a lower dimensional space of a Sinkhorn-transformed original matrix, which aligns both row and column data points of the original matrix via two interrelated geometrical simplex structures.
Therefore, in this space we can search only for K (K-1)-dimensional solution points (K is the number of components i.e. the number of columns/rows of W/H).
This method can be applied to:
- The general NMF problem, where it outperforms commonly used methods
- Bulk RNAseq deconvolution
- Single cell clustering
This is an R language package so you need to have R We tested our code using Rstudio or Rstudio server as IDE environments. We are actively using Bioconductor and devtools packages so you need it to install.
# in your R environment
install.packages("BiocManager")
install.packages("devtools")
Install from github
devtools::install_github("artyomovlab/DualSimplex")
Or alternatively install from your local directory with this repository
devtools::load_all("path_to_code_directory")
(This is not working yet) After the publication, it will be:
install.packages("DualSimplex")
Check our additional paper repository for more examples of NMF, bulk-RNAseq deconvolution and single cell clustering
library("DualSimplex")
library(dplyr)
N <- 100 # number of samples (e.g. mixtures)
M <- 10000 # number of features (e.g. genes)
K <- 3 # Number of pure components
sim <- create_simulation(n_genes = M,
n_samples = N,
n_cell_types = K,
with_marker_genes = FALSE)
sim <- sim %>% add_noise(noise_deviation = 0.2)
data_raw <- sim$data
true_W <- sim$basis
true_H <- sim$proportions
This performs Sinkhorn scaling, SVD projection, and data annotation
dso <- DualSimplexSolver$new()
dso$set_data(data_raw) # run Sinkhorn procedure
dso$project(K) # project to SVD space
dso$plot_projected("zero_distance", "zero_distance", with_solution = TRUE, use_dims = list(2:3)) # visualize the projection
dso$set_display_dims(list(2:3)) # remember the use_dims choice, to call just dso$plot_projected()
This is only if you are willing to remove points from your dataset
plane_distance_threshold <- 0.05 # Change here several times to see result, start with big and lower it
zero_distance_threshold <- 1
dso$distance_filter(plane_d_lt = plane_distance_threshold, zero_d_lt = zero_distance_threshold, genes = T)
dso$project(K)
dso$plot_projection_diagnostics() # See the distribution of points distances
dso$plot_svd_history() # observe changes in SVD variance explained
dso$init_solution("random")
dso$plot_projected("zero_distance", "zero_distance")
dso$optim_solution(
5000,
optim_config(
coef_hinge_H = 1,
coef_hinge_W = 1,
coef_der_X = 0.001,
coef_der_Omega = 0.001
)
)
dso$plot_projected("zero_distance", "zero_distance")
dso$plot_error_history()
solution <- dso$finalize_solution()
result_W <- solution$W
result_H <- solution$H
# Save
dso$save_state("directory_to_save")
# Load
dso <- DualSimplexSolver$from_state("directory_to_save")
- Denis Kleverov (@denis_kleverov) (linkedIn )
- Ekaterina Aladyeva (@AladyevaE)
- Alexey Serdyukov (email)
- prof. Maxim Artyomov (@maxim_artyomov) (email)
The following files in the R/
directory represent different stages
of DualSimplex pipeline:
0. simulation.R
1. annotation.R
2. filtering.R
3. sinkhorn.R
4. projection.R
5. initialization.R
6. optimization.R
7. post_analysis.R
8. benchmarking.R
Ideally, main logic functions in a stage shouldn't use functions from another stage, and a downstream stage should only use the objects generated on the previous stage as its input.
Then, either the user or DualSimplexSolver
use the main
functions from those packages to implement the whole control flow.
This rule of thumb leads to linear code logic and low code coupling, which makes it simple to debug and introduce changes.
Please document your code with roxygene2 comments (as it is done for rest of the package)
- Regenerate NAMESPACE and additional files
devtools::document()
- ensure standard devtools check is returning 0 errors
devtools::check()
- ensure package is installable from your repository
devtools::install_github("your_github_nickname"/DualSimplex@your_branch_name")