Merge branch 'master' into update_mcf

ntalluri · Jun 24, 2024 · dfd2e14 · dfd2e14
2 parents a9d6869 + 55979e8
commit dfd2e14
Show file tree

Hide file tree

Showing 12 changed files with 124 additions and 17 deletions.
diff --git a/.github/workflows/test-spras.yml b/.github/workflows/test-spras.yml
@@ -78,11 +78,12 @@ jobs:
       run: |
         docker pull reedcompbio/omics-integrator-1:latest
         docker pull reedcompbio/omics-integrator-2:v2
-        docker pull reedcompbio/pathlinker:latest
+        docker pull reedcompbio/pathlinker:v2
         docker pull reedcompbio/meo:latest
         docker pull reedcompbio/mincostflow:v2
-        docker pull reedcompbio/allpairs:latest
+        docker pull reedcompbio/allpairs:v2
         docker pull reedcompbio/domino:latest
+        docker pull reedcompbio/py4cytoscape:v2
     - name: Build Omics Integrator 1 Docker image
       uses: docker/build-push-action@v1
       with:
@@ -99,15 +100,15 @@ jobs:
         dockerfile: docker-wrappers/OmicsIntegrator2/Dockerfile
         repository: reedcompbio/omics-integrator-2
         tags: v2
-        cache_froms: reedcompbio/omics-integrator-2:v2
+        cache_froms: reedcompbio/omics-integrator-2:latest
         push: false
     - name: Build PathLinker Docker image
       uses: docker/build-push-action@v1
       with:
         path: docker-wrappers/PathLinker/.
         dockerfile: docker-wrappers/PathLinker/Dockerfile
         repository: reedcompbio/pathlinker
-        tags: latest
+        tags: v2
         cache_froms: reedcompbio/pathlinker:latest
         push: false
     - name: Build Maximum Edge Orientation Docker image
@@ -134,7 +135,7 @@ jobs:
         path: docker-wrappers/AllPairs/.
         dockerfile: docker-wrappers/AllPairs/Dockerfile
         repository: reedcompbio/allpairs
-        tags: latest
+        tags: v2
         cache_froms: reedcompbio/allpairs:latest
         push: false
     - name: Build DOMINO Docker image
@@ -153,7 +154,7 @@ jobs:
         dockerfile: docker-wrappers/Cytoscape/Dockerfile
         repository: reedcompbio/py4cytoscape
         tags: v2
-        cache_froms: reedcompbio/py4cytoscape:v2
+        cache_froms: reedcompbio/py4cytoscape:latest
         push: false
 
   # Run pre-commit checks on source files

diff --git a/config/config.yaml b/config/config.yaml
@@ -45,10 +45,8 @@ algorithms:
         params:
               include: true
               run1:
-                  r: [5]
                   b: [5, 6]
                   w: np.linspace(0,5,2)
-                  g: [3]
                   d: [10]
 
       - name: "omicsintegrator2"

diff --git a/doc/pathlinker.md b/doc/pathlinker.md
@@ -0,0 +1,77 @@
+# Running PathLinker
+Before running pathway reconstruction algorithms through SPRAS, it can be informative to see how they are typically run without SPRAS.
+This document describes PathLinker and how to run the PathLinker software.
+
+## Step 1: Read about PathLinker
+
+PathLinker takes as input (1) a weighted, directed protein-protein interaction (PPI) network, (2) two sets of nodes: a source set (representing receptors of a pathway of interest) and a target set (representing transcriptional regulators of a pathway of interest), and (3) an integer _k_. PathLinker efficiently computes the _k_-shortest paths from any source to any target and returns the subnetwork of the top _k_ paths as the pathway reconstruction.  Later work expanded PathLinker by incorporating protein localization information to re-score tied paths, dubbed Localized PathLinker (LocPL).
+
+**References:**
+- Ritz et al. Pathways on demand: automated reconstruction of human signaling networks.  _NPJ Systems Biology and Applications._ 2016. [doi:10.1038/npjsba.2016.2](https://doi.org/10.1038/npjsba.2016.2)
+- Youssef, Law, and Ritz. Integrating protein localization with automated signaling pathway reconstruction. _BMC Bioinformatics._ 2019. [doi:10.1186/s12859-019-3077-x](https://doi.org/10.1186/s12859-019-3077-x)
+
+We will focus on the original 2016 version of PathLinker.
+Start by reading this manuscript to gain a high-level understanding of its algorithm, functionality, and motivation.
+
+## Step 2: Get the PathLinker files
+
+From the PathLinker GitHub repository <https://github.com/Murali-group/PathLinker>, click Code and Download ZIP to download all the source and data files.
+Unzip them on your local file system into a directory you’ll use for this project.
+
+## Step 3: Install Anaconda (optional)
+
+If you do not already have Anaconda on your computer, install Anaconda so you can manage conda environments.
+If you haven’t used conda before for Python, this [blog post](https://astrobiomike.github.io/unix/conda-intro) gives an overview of conda and why it is useful.
+You don’t need most of the commands in it, but it can be a reference.
+You can download Anaconda from <https://www.anaconda.com/download> and follow the installation instructions.
+If you have the option to add Anaconda to your system `PATH` when installing it, which will make it your default version of Python, we recommend that.
+Anaconda will give you an initial default conda environment and some useful packages.
+A conda environment is an isolated collection of Python packages that enables you to have multiple versions of Python and packages installed in parallel for different projects, which often require different versions of their dependencies.
+
+## Step 4: Create a conda environment for PathLinker
+
+Create a new conda environment for PathLinker called "pathlinker" with Python 3.5 and install the dependencies from a requirements file into that environment.
+PathLinker uses old versions of Python and dependencies like networkx, a package for working with graphs in Python.
+The `requirements.txt` file contains an uninstallable package, so open it in a text editor and delete the line
+```
+pkg-resources==0.0.0
+```
+
+Then, from the project directory you created that contains the `requirements.txt` file from the PathLinker repository run:
+```
+conda create -n pathlinker python=3.5
+conda activate pathlinker
+pip install -r requirements.txt
+```
+The first command creates a new environment called "pathlinker" with the old version of Python.
+You only do this once.
+The second command activates that environment so you can use those packages.
+You need to do this second command every time before running PathLinker.
+The third command installs the required dependencies and is only needed once.
+
+## Step 5: Run PathLinker on example data
+
+Change (`cd`) into the example directory and try running the example command
+```
+python ../run.py sample-in-net.txt sample-in-nodetypes.txt
+```
+If it works, open the two input files and the output file(s).
+See if what PathLinker has done makes sense.
+The input network file lists one edge in the graph per line.
+The node types file tells you which nodes are the special sources and targets.
+
+## Step 6: Change the PathLinker options and see how the behavior changes
+
+PathLinker's command line interface does not have extensive documentation in its GitHub repository.
+We can see the supported command line arguments in the source code at <https://github.com/Murali-group/PathLinker/blob/master/run.py#L32>
+These lines list the argument, the default value, and a short description.
+For example, the `-k` argument sets the value of _k_ discussed in the manuscript, the number of shortest paths.
+Try setting this to a very small value like 1 or 2 or 3.
+That will make it easier to inspect the output files.
+Try adjusting other PathLinker arguments and observe the effects on the output.
+
+## Note for Windows users
+
+When running PathLinker or other software on Windows, the default Command Prompt is not recommended for command line programs.
+[Git Bash](https://gitforwindows.org/) is one recommended alternative terminal that is a standalone program.
+The [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/install) also runs a full Linux distribution through Windows, which includes a terminal.
diff --git a/docker-wrappers/AllPairs/Dockerfile b/docker-wrappers/AllPairs/Dockerfile
@@ -1,6 +1,9 @@
 # AllPairs wrapper
 FROM python:3.9-alpine3.16
 
+# bash is required for dsub in the All of Us cloud environment
+RUN apk add --no-cache bash
+
 WORKDIR /AllPairs
 
 RUN pip install networkx==2.6.3

diff --git a/docker-wrappers/AllPairs/README.md b/docker-wrappers/AllPairs/README.md
@@ -26,3 +26,7 @@ The Docker wrapper can be tested with `pytest -k test_ap.py` from the root of th
 ## Notes
 - The `all-pairs-shortest-paths.py` code is located locally in SPRAS (since the code is short). It is under `docker-wrappers/AllPairs`.
 - Samples of an input network and source/target file are located under test/AllPairs/input.
+
+## Versions:
+- v1: Initial version. Copies source file from SPRAS repository.
+- v2: Add bash, which is not available in Alpine Linux.
diff --git a/docker-wrappers/PathLinker/Dockerfile b/docker-wrappers/PathLinker/Dockerfile
@@ -4,7 +4,8 @@ FROM python:3.5.10-alpine
 
 # gettext is required for the envsubst command
 # See https://github.com/haskell/cabal/issues/6126 regarding wget
-RUN apk add --no-cache ca-certificates gettext wget
+# bash is required for dsub in the All of Us cloud environment
+RUN apk add --no-cache ca-certificates gettext wget bash
 
 WORKDIR /PathLinker
 COPY pathlinker-files.txt .

diff --git a/docker-wrappers/PathLinker/README.md b/docker-wrappers/PathLinker/README.md
@@ -27,6 +27,10 @@ docker run -w /data --mount type=bind,source=/${PWD},target=/data reedcompbio/pa
 This will run PathLinker on the test input files and write the output files to the root of the `spras` repository.
 Windows users may need to escape the absolute paths so that `/data` becomes `//data`, etc.
 
+## Versions:
+- v1: Initial version. Copies PathLinker source files from GitHub and pip installs packages from requirements file.
+- v2: Add bash, which is not available in Alpine Linux.
+
 ## TODO
 - Attribute https://github.com/Murali-group/PathLinker
 - Document usage
diff --git a/spras/allpairs.py b/spras/allpairs.py
@@ -94,8 +94,7 @@ def run(nodetypes=None, network=None, output_file=None, container_framework="doc
 
         print('Running All Pairs Shortest Paths with arguments: {}'.format(' '.join(command)), flush=True)
 
-        container_suffix = "allpairs"
-
+        container_suffix = "allpairs:v2"
         out = run_container(
                             container_framework,
                             container_suffix,

diff --git a/spras/analysis/ml.py b/spras/analysis/ml.py
@@ -37,7 +37,7 @@ def summarize_networks(file_paths: Iterable[Union[str, PathLike]]) -> pd.DataFra
     edge_tuples = []
     for file in file_paths:
         try:
-            # collecting and sorting the edge pairs per algortihm
+            # collecting and sorting the edge pairs per algorithm
             with open(file, 'r') as f:
                 lines = f.readlines()
 
@@ -61,9 +61,8 @@ def summarize_networks(file_paths: Iterable[Union[str, PathLike]]) -> pd.DataFra
             p = PurePath(file)
             edge_tuples.append((p.parts[-2], edges))
 
-        except FileNotFoundError:
-            print(file, ' not found during ML analysis')  # should not hit this
-            continue
+        except FileNotFoundError as exc:
+            raise FileNotFoundError(str(file) + ' not found during ML analysis') from exc
 
     # initially construct separate dataframes per algorithm
     edge_dataframes = []
@@ -81,8 +80,19 @@ def summarize_networks(file_paths: Iterable[Union[str, PathLike]]) -> pd.DataFra
     concated_df = pd.concat(edge_dataframes, axis=1, join='outer')
     concated_df = concated_df.fillna(0)
     concated_df = concated_df.astype('int64')
+
+    # don't do ml post-processing if there is an empty dataframe or the number of samples is <= 1
+    if concated_df.empty:
+        raise ValueError("ML post-processing cannot proceed because the summarize network dataframe is empty.\nWe "
+                      "suggest setting ml include: false in the configuration file to avoid this error.")
+    if min(concated_df.shape) <= 1:
+        raise ValueError(f"ML post-processing cannot proceed because the available number of pathways is insufficient. "
+                      f"The ml post-processing requires more than one pathway, but currently "
+                      f"there are only {min(concated_df.shape)} pathways.")
+
     return concated_df
 
+
 def create_palette(column_names):
     """
     Generates a dictionary mapping each column name (algorithm name)
@@ -141,7 +151,7 @@ def pca(dataframe: pd.DataFrame, output_png: str, output_var: str, output_coord:
 
     # saving the coordinates of each algorithm
     make_required_dirs(output_coord)
-    coordinates_df = pd.DataFrame(X_pca, columns = ['PC' + str(i) for i in range(1, components+1)])
+    coordinates_df = pd.DataFrame(X_pca, columns=['PC' + str(i) for i in range(1, components+1)])
     coordinates_df.insert(0, 'algorithm', columns.tolist())
     coordinates_df.to_csv(output_coord, sep='\t', index=False)
 

diff --git a/spras/pathlinker.py b/spras/pathlinker.py
@@ -115,7 +115,7 @@ def run(nodetypes=None, network=None, output_file=None, k=None, container_framew
 
         print('Running PathLinker with arguments: {}'.format(' '.join(command)), flush=True)
 
-        container_suffix = "pathlinker"
+        container_suffix = "pathlinker:v2"
         out = run_container(container_framework,
                             container_suffix,
                             command,

diff --git a/test/ml/input/test-data-single/single.txt b/test/ml/input/test-data-single/single.txt
@@ -0,0 +1 @@
+L	M	1	U
diff --git a/test/ml/test_ml.py b/test/ml/test_ml.py
@@ -2,6 +2,7 @@
 from pathlib import Path
 
 import pandas as pd
+import pytest
 
 import spras.analysis.ml as ml
 
@@ -25,6 +26,14 @@ def test_summarize_networks(self):
         dataframe.to_csv(OUT_DIR + 'dataframe.csv')
         assert filecmp.cmp(OUT_DIR + 'dataframe.csv', EXPECT_DIR + 'expected-dataframe.csv', shallow=False)
 
+    def test_summarize_networks_empty(self):
+        with pytest.raises(ValueError): #raises error if empty dataframe is used for post processing
+            ml.summarize_networks([INPUT_DIR + 'test-data-empty/empty.txt'])
+
+    def test_single_line(self):
+        with pytest.raises(ValueError): #raises error if single line in file s.t. single row in dataframe is used for post processing
+            ml.summarize_networks([INPUT_DIR + 'test-data-single/single.txt'])
+
     def test_pca(self):
         dataframe = ml.summarize_networks([INPUT_DIR + 'test-data-s1/s1.txt', INPUT_DIR + 'test-data-s2/s2.txt', INPUT_DIR + 'test-data-s3/s3.txt'])
         ml.pca(dataframe, OUT_DIR + 'pca.png', OUT_DIR + 'pca-variance.txt',