Skip to content

Latest commit

 

History

History
185 lines (151 loc) · 5.7 KB

README.md

File metadata and controls

185 lines (151 loc) · 5.7 KB

DualSimplex algorithm's R package

About the project

This is the implementation of the Dual Simplex method presented in this paper

Non-negative matrix factorization and deconvolution as dual simplex problem
Denis Kleverov, Ekaterina Aladyeva, Alexey Serdyukov, Maxim Artyomov
bioRxiv 2024.04.09.588652; doi: https://doi.org/10.1101/2024.04.09.588652

This in essence is an NMF algorithm which can factorize nonegative matrix V into two nonnegative matrices W and H.

The key feature is that it operates in a lower dimensional space of a Sinkhorn-transformed original matrix, which aligns both row and column data points of the original matrix via two interrelated geometrical simplex structures.

Therefore, in this space we can search only for K (K-1)-dimensional solution points (K is the number of components i.e. the number of columns/rows of W/H).

This method can be applied to:

  • The general NMF problem, where it outperforms commonly used methods
  • Bulk RNAseq deconvolution
  • Single cell clustering

Getting Started

Prerequisites

This is an R language package so you need to have R We tested our code using Rstudio or Rstudio server as IDE environments. We are actively using Bioconductor and devtools packages so you need it to install.

# in your R environment
install.packages("BiocManager")
install.packages("devtools")

Installation

Install from github

devtools::install_github("artyomovlab/DualSimplex")

Or alternatively install from your local directory with this repository

devtools::load_all("path_to_code_directory")

(This is not working yet) After the publication, it will be:

install.packages("DualSimplex")

Usage

Check our additional paper repository for more examples of NMF, bulk-RNAseq deconvolution and single cell clustering

Read/Generate the data

library("DualSimplex")
library(dplyr)

N <- 100 # number of samples (e.g. mixtures)
M <- 10000 # number of features (e.g. genes)
K <- 3 # Number of pure components

sim <- create_simulation(n_genes = M,
                         n_samples = N,
                         n_cell_types = K,
                         with_marker_genes = FALSE)
sim <- sim %>% add_noise(noise_deviation = 0.2)

data_raw <- sim$data
true_W <- sim$basis
true_H <- sim$proportions

Create a Solver object

This performs Sinkhorn scaling, SVD projection, and data annotation

dso <- DualSimplexSolver$new()
dso$set_data(data_raw) # run Sinkhorn procedure
dso$project(K) # project to SVD space
dso$plot_projected("zero_distance", "zero_distance", with_solution = TRUE, use_dims = list(2:3)) # visualize the projection
dso$set_display_dims(list(2:3)) # remember the use_dims choice, to call just dso$plot_projected()

(Optional) Filter the data/remove outliers

This is only if you are willing to remove points from your dataset

plane_distance_threshold <- 0.05 # Change here several times to see result, start with big and lower it
zero_distance_threshold <- 1
dso$distance_filter(plane_d_lt = plane_distance_threshold, zero_d_lt = zero_distance_threshold, genes = T)
dso$project(K)
dso$plot_projection_diagnostics() # See the distribution of points distances
dso$plot_svd_history() # observe changes in SVD variance explained

Identify simplex corners in the projected space

Initialize solution

dso$init_solution("random")
dso$plot_projected("zero_distance", "zero_distance")

Run optimization

dso$optim_solution(
    5000,
    optim_config(
        coef_hinge_H = 1,
        coef_hinge_W = 1,
        coef_der_X = 0.001, 
        coef_der_Omega = 0.001
    )
)
dso$plot_projected("zero_distance", "zero_distance")
dso$plot_error_history()

Get solution

solution <- dso$finalize_solution()
result_W <- solution$W
result_H <- solution$H

Save/Load the results

# Save
dso$save_state("directory_to_save")

# Load
dso <- DualSimplexSolver$from_state("directory_to_save")

Contacts

For developers

Code structure & Guidelines

The following files in the R/ directory represent different stages of DualSimplex pipeline:

0. simulation.R
1. annotation.R
2. filtering.R
3. sinkhorn.R
4. projection.R
5. initialization.R
6. optimization.R
7. post_analysis.R
8. benchmarking.R

Ideally, main logic functions in a stage shouldn't use functions from another stage, and a downstream stage should only use the objects generated on the previous stage as its input.

Then, either the user or DualSimplexSolver use the main functions from those packages to implement the whole control flow.

This rule of thumb leads to linear code logic and low code coupling, which makes it simple to debug and introduce changes.

Checking your new functions

Please document your code with roxygene2 comments (as it is done for rest of the package)

  • Regenerate NAMESPACE and additional files
devtools::document()
  • ensure standard devtools check is returning 0 errors
devtools::check()
  • ensure package is installable from your repository
devtools::install_github("your_github_nickname"/DualSimplex@your_branch_name")