DualSimplex algorithm's R package

About the project

This is the implementation of the Dual Simplex method presented in this paper

Non-negative matrix factorization and deconvolution as dual simplex problem
Denis Kleverov, Ekaterina Aladyeva, Alexey Serdyukov, Maxim Artyomov
bioRxiv 2024.04.09.588652; doi: https://doi.org/10.1101/2024.04.09.588652

This in essence is an NMF algorithm which can factorize nonegative matrix V into two nonnegative matrices W and H.

The key feature is that it operates in a lower dimensional space of a Sinkhorn-transformed original matrix, which aligns both row and column data points of the original matrix via two interrelated geometrical simplex structures.

Therefore, in this space we can search only for K (K-1)-dimensional solution points (K is the number of components i.e. the number of columns/rows of W/H).

This method can be applied to:

The general NMF problem, where it outperforms commonly used methods
Bulk RNAseq deconvolution
Single cell clustering

Getting Started

Prerequisites

This is an R language package so you need to have R We tested our code using Rstudio or Rstudio server as IDE environments. We are actively using Bioconductor and devtools packages so you need it to install.

# in your R environment
install.packages("BiocManager")
install.packages("devtools")

Installation

Install from github

devtools::install_github("artyomovlab/DualSimplex")

Or alternatively install from your local directory with this repository

devtools::load_all("path_to_code_directory")

(This is not working yet) After the publication, it will be:

install.packages("DualSimplex")

Usage

Check our additional paper repository for more examples of NMF, bulk-RNAseq deconvolution and single cell clustering

Read/Generate the data

library("DualSimplex")
library(dplyr)

N <- 100 # number of samples (e.g. mixtures)
M <- 10000 # number of features (e.g. genes)
K <- 3 # Number of pure components

sim <- create_simulation(n_genes = M,
                         n_samples = N,
                         n_cell_types = K,
                         with_marker_genes = FALSE)
sim <- sim %>% add_noise(noise_deviation = 0.2)

data_raw <- sim$data
true_W <- sim$basis
true_H <- sim$proportions

Create a Solver object

This performs Sinkhorn scaling, SVD projection, and data annotation

dso <- DualSimplexSolver$new()
dso$set_data(data_raw) # run Sinkhorn procedure
dso$project(K) # project to SVD space
dso$plot_projected("zero_distance", "zero_distance", with_solution = TRUE, use_dims = list(2:3)) # visualize the projection
dso$set_display_dims(list(2:3)) # remember the use_dims choice, to call just dso$plot_projected()

(Optional) Filter the data/remove outliers

This is only if you are willing to remove points from your dataset

plane_distance_threshold <- 0.05 # Change here several times to see result, start with big and lower it
zero_distance_threshold <- 1
dso$distance_filter(plane_d_lt = plane_distance_threshold, zero_d_lt = zero_distance_threshold, genes = T)
dso$project(K)
dso$plot_projection_diagnostics() # See the distribution of points distances
dso$plot_svd_history() # observe changes in SVD variance explained

Identify simplex corners in the projected space

Initialize solution

dso$init_solution("random")
dso$plot_projected("zero_distance", "zero_distance")

Run optimization

dso$optim_solution(
    5000,
    optim_config(
        coef_hinge_H = 1,
        coef_hinge_W = 1,
        coef_der_X = 0.001, 
        coef_der_Omega = 0.001
    )
)
dso$plot_projected("zero_distance", "zero_distance")
dso$plot_error_history()

Get solution

solution <- dso$finalize_solution()
result_W <- solution$W
result_H <- solution$H

Save/Load the results

# Save
dso$save_state("directory_to_save")

# Load
dso <- DualSimplexSolver$from_state("directory_to_save")

Contacts

Denis Kleverov (@denis_kleverov) (linkedIn )
Ekaterina Aladyeva (@AladyevaE)
Alexey Serdyukov (email)
prof. Maxim Artyomov (@maxim_artyomov) (email)

For developers

Code structure & Guidelines

The following files in the R/ directory represent different stages of DualSimplex pipeline:

0. simulation.R
1. annotation.R
2. filtering.R
3. sinkhorn.R
4. projection.R
5. initialization.R
6. optimization.R
7. post_analysis.R
8. benchmarking.R

Ideally, main logic functions in a stage shouldn't use functions from another stage, and a downstream stage should only use the objects generated on the previous stage as its input.

Then, either the user or DualSimplexSolver use the main functions from those packages to implement the whole control flow.

This rule of thumb leads to linear code logic and low code coupling, which makes it simple to debug and introduce changes.

Checking your new functions

Please document your code with roxygene2 comments (as it is done for rest of the package)

Regenerate NAMESPACE and additional files

devtools::document()

ensure standard devtools check is returning 0 errors

devtools::check()

ensure package is installable from your repository

devtools::install_github("your_github_nickname"/DualSimplex@your_branch_name")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DualSimplex algorithm's R package

About the project

Getting Started

Prerequisites

Installation

Usage

Read/Generate the data

Create a Solver object

(Optional) Filter the data/remove outliers

Identify simplex corners in the projected space

Initialize solution

Run optimization

Get solution

Save/Load the results

Contacts

For developers

Code structure & Guidelines

Checking your new functions

Files

README.md

Latest commit

History

README.md

File metadata and controls

DualSimplex algorithm's R package

About the project

Getting Started

Prerequisites

Installation

Usage

Read/Generate the data

Create a Solver object

(Optional) Filter the data/remove outliers

Identify simplex corners in the projected space

Initialize solution

Run optimization

Get solution

Save/Load the results

Contacts

For developers

Code structure & Guidelines

Checking your new functions