Skip to content

A Step By Step Guide to Building a MuSIC Map

Yue Qin edited this page Jun 23, 2021 · 71 revisions

MuSIC is a multi-scale map of human cell architecture constructed through integration of multiple data types. It is described in the following publication: Qin et al., “Mapping cell structure across scales by fusing protein images and interactions”.

This page describes a step-by-step guide to create the MuSIC map, including how to install the necessary packages and other dependencies; the specification of input data; and detailed documentation for each step and its parameters. An accompanying Jupyter notebook and a accompanying bash script instantiates these steps with the series of code executions that were used to generate MuSIC v1 presented in the paper. Notably, these steps can be used with other inputs and parameter settings to build other multi-scale cell models.

Table of contents

Requirements

Installation

Input data

Step 1: Generate gold-standard protein-protein proximity values

Step 2: Build random forest to predict protein-protein proximity from data embeddings

Step 3: Analyze proximity data to identify protein communities at progressive resolutions

Additional Resources

Requirements

  • Anaconda (optional but highly recommended)
  • APT packages: build-essential, python-dev, libxml2, libxml2-dev, zlib1g-dev, libigraph0-dev, libmpc-dev

Installation

  1. Create an Anaconda virtual environment. This is optional but highly recommended. Takes ~10 minutes.
conda create -n music python=3.6.2 anaconda
source activate music
  1. Download MuSIC and install dependencies.
git clone https://github.com/idekerlab/MuSIC.git
cd MuSIC
pip install -r ./installation/requirements.txt
  1. Install CliXO v1.0 and DDOT by running the following command line.
./installation/install.sh

Input Data

The MuSIC pipeline begins with a collection of input files. Each file specifies the embedding of proteins using a different data type. For each data type, provide an embedding file in the following format, one protein per row:

Embedding file format

  • Column 1: Unique ID for each row
  • Column 2: Protein name
  • Columns 3 to k: The embedding vector of that protein (of dimension k-2)
  • All columns comma delimited

For MuSIC v1, we demonstrated the MuSIC approach with embeddings from immunofluorescence (IF) images in the Human Protein Altas (HPA) and affinity purification mass spectrometry (AP-MS) protein-protein interaction data in the BioPlex resource. These files are provided here as examples:

  • HPA immunofluorescence image embedding:

    The GitHub repository provides the 1024-dimension IF embeddings for the 661 proteins used in MuSIC v1 (/Examples/IF_image_embedding.csv). Many proteins have multiple images, leading to a total of 1,451 images for this dataset. All images are available on the HPA website. Images were embedded using DenseNet, with code available here.

  • BioPlex protein embedding:

    The GitHub repository provides the 1024-dimension AP-MS embeddings for the 661 proteins used in MuSIC v1 (/Examples/APMS_embedding.MuSIC.csv). The BioPlex 2.0 network used for this embedding is available here. This network was embedded using Node2Vec, with code and documentation available here.

Note

  • The MuSIC pipeline can handle any number of data types (e.g. IF, APMS, etc.) and any dimension of protein embeddings (i.e. length of feature vector), but all proteins need to have same number of dimension within each individual data type.
  • For balanced learning, we recommend embedding proteins in the same number of dimensions for each of the different data types.

Step 1: Generate gold-standard protein-protein proximity values

This step 1 calculates the gold-standard protein-protein proximity used for random forest training in step 2. For an input list of proteins, all pairs of proteins are analyzed to identify the smallest known subcellular component containing each pair (uses the Cellular Component branch of the Gene Ontology). The physical diameter of this component is computed using the calibration function below, fit in the Qin et al paper. This diameter is returned as the protein-protein proximity for that pair of proteins. Thus, using the Calibration Function, we label every protein pair with a curated physical distance.

Usage

python calibrate_pairwise_distance.py --protein_file /path/to/file/contain/proteins/to/analyze 
			   	      --outprefix /path/to/output/folder/filePrefix 

# Example
python calibrate_pairwise_distance.py --protein_file ./Examples/toy/toy_proteins.txt --outprefix ./Examples/toy_output/toy
  • Required arguments for calibrate_pairwise_distance.py:

    • --protein_file Path to the file containing list of proteins to analyze. One protein per line.

    • --outprefix Full path to the folder where results will be saved in with unique file identifier.

  • Optional arguments for community_detection.py:

    • --C_file Path to precomputed matrix containing the size (number of proteins) of the smallest Gene Ontology component shared by the gene pair. Computing the protein pairwise C matrix can be time consuming when analyzing a large number (a few thousands) of proteins, so we also provide a pre-computed protein pairwise C matrix for all 18,395 proteins in the Gene Ontology (GO download Sept 25 2018) (https://www.dropbox.com/s/w9lxpnw39g64zs8/all_protein_min_GO_size.txt?dl=0).

Output file

  • $outprefix.calibrated_distance.csv This file stores the calibrated distance and proximity for every pair of input proteins that have GO annotation. Below are specific annotations for each column:
    • geneA: name of gene A
    • geneB: name of gene B
    • C: the number of proteins in the smallest GO cellular component to which both are annotated
    • D: protein-protein distance calibrated from GO
    • log10D: log10(D)
    • P: protein-protein proximity, -log10(D)

Step 2: Build random forest to predict protein-protein proximity from data embeddings

Using the curated protein-protein proximities as training labels (Step 1), this Step 2 constructs random forest regressors to predict the pairwise distance of any protein pair directly from its data embeddings (see Input data). Calibration

Usage

2.1. random_forest_samples.py: this script creates the input and output files used to train the random forests.

python random_forest_samples.py --outprefix /path/to/output/folder/filePrefix
			        --protein_file /path/to/file/contain/proteins/to/analyze
			        --emd_files /path/to/embedding/file1 /path/to/embedding/file2 ...
			        --emd_label emd1 emd2 ...

# Example
python random_forest_samples.py \
--outprefix ./Examples/toy_output/toy --protein_file ./Examples/toy/toy_proteins.txt \
--emd_files ./Examples/toy/toy_IF_image_embedding.csv ./Examples/toy/toy_APMS_embedding.csv \
--emd_label IF_emd APMS_emd --n_samples 1000
  • Required arguments for random_forest_samples.py:

    • --outprefix Full path to the folder where results will be saved in with unique file identifier. Note that this needs to be the same as previous calibration step.

    • --protein_file Path to the file containing list of proteins to analyze. One protein per line.

    • --emd_files Path to each embedding file generated from different data modalities.

    • --emd_label Label for each embedding file. Enter in the order of --emd_files. E.g., IF_emd, APMS_emd

  • Optional arguments for random_forest_samples.py:

    • --num_set Number of training sets for each embedding file. Enter in the order of --emd_files (default: auto).

      • For balanced training, each protein needs to have one unique embedding from each data modality. When a protein has multiple embeddings from a single data modality, such as multiple images per protein, the user can choose to create multiple training sets with each set containing one embedding randomly chosen from all available embeddings for the respective protein.
      • auto will automatically decide the number of training sets to generate for a data modality based on the maximum number of embeddings that a protein can have from the input files.
    • --n_samples Maximum number of samples to train/test random forest regressor in each fold of k-fold cross validation (default: 1000000).

    • --k Specify k for k-fold cross validation (default: 5).

2.2. run_random_forest.py: this script trains random forest regressors using training samples generated from the previous script and predicts proximity for every protein pair queried. To facilitate concurrent training of multiple random forest regressors with high performance computing, we here provide this stand-alone script for training and predicting random forest models.

python run_random_forest.py --outprefix /path/to/output/folder/filePrefix
			    --fold 1
			    --emd_label emd1 emd2 ...

# Example for doing 5-fold cross-validation
for ((fold = 1; fold <= 5; fold++));
do
    python run_random_forest.py --outprefix ./Examples/toy_output/toy \
    --fold $fold \
    --emd_label IF_emd APMS_emd;
done
  • Required arguments for run_random_forest.py:

    • --outprefix Full path to the folder where results will be saved in with unique file identifier. Note that this needs to be the same as previous calibration step.

    • --fold Specify which fold of k-fold cross validation to train.

    • --emd_label Label for each embedding file. Enter in the order of --emd_files. E.g., IF_emd, APMS_emd

  • Optional arguments for run_random_forest.py:

    • --train_set For each embedding data, specify which training set to use. In order with emd_label. Default to 1 for each emd_label.

    • --n_estimators The number of trees in the forest (default: 1000).

    • --max_depth The maximum depth of the tree (default: 30).

    • --n_jobs The number of jobs to run in parallel for training random forest regressor (default: 8).

2.3. random_forest_output.py: after all random forest models finished training, this script will take the average of all predicted protein-protein proximity for a given pair of protein and generate the protein pairwise proximity file used as input for Step 3.

python random_forest_output.py --outprefix /path/to/output/folder/filePrefix

# Example
python random_forest_output.py --outprefix ./Examples/toy_output/toy

Output file

  • $outprefix_predicted_proximity.txt: This file stores the predicted protein-protein proximity for all pairs of proteins provided by the user. This file is used as input for the pan-resolution community detection (Step 3).
    • column 1: name of gene A
    • column 2: name of gene B
    • column 3: predicted proximity between gene A and gene B

Note

The training time and memory footprint depends on the number of training samples and the dimensionality of the embeddings. For example, training of each random forest regressor in the original MuSIC study required ~1 day and >100 Gb memory with 24 threads. The amount of training time and memory can be decreased by using a smaller number of samples and/or embedding dimensions.

Step 3: Analyze proximity data to identify protein communities at progressive resolutions

The final step of the pipeline analyzes the pairwise protein proximities, predicted in Step 2, to detect communities of proximal proteins. Protein communities are identified at multiple resolutions, starting with those that form at the smallest protein-protein distances then progressively relaxing the distance threshold (multi-scale community detection). Communities at smaller distances were contained, in full or in part, inside larger communities as the threshold was relaxed, yielding a structural hierarchy.

Usage

python community_detection.py --outprefix /path/to/output/folder/filePrefix 
                              --clixo_i /path/to/clixo/inputFile
			      --predict_nm_size                         
bash <outprefix>.sh

# Example
python community_detection.py --outprefix ./Examples/toy_output/toy \
--clixo_i ./Examples/toy_output/toy_predicted_proximity.txt \
--predict_nm_size --keep_all_files

bash ./Examples/toy_output/toy.sh

Hierarchy is written in file $outprefix.louvain.ddot with specific protein assignment for each system available in file $outprefix.louvain.termStats.

  • Required arguments for community_detection.py:

    • --outprefix Full path to the folder where results will be saved in with unique file identifier.

    • --clixo_i Path to input similarity network for CliXO: a TSV file with three columns. The first two columns should be two strings for the node names (using numbers may cause problem); and the third column should be a value for the edge weight.

  • Optional arguments for community_detection.py:

    • --path_to_clixo Full path to CliXO folder.

    • --path_to_alignOntology Full path to alignOntology folder.

    • --clixo_a CliXO -a flag: for the step size of hierarchy construction; usually, a smaller value will create "deeper" hierarchies with more levels from leaves to the root (default: 0.1).

    • --clixo_b CliXO -b flag: for merging overlapping communities. Two existing communities will be merged if their similarity is above the threshold defined by this value. Usually a higher value will create a hierarchy with more smaller communities", which looks "broader" (default: 0.5).

    • --clixo_m CliXO -m flag: modularity cutoff. Communities lower than this threshold will be removed from the output. Increasing the value will reduce the number of communities (default: 0).

    • --clixo_z CliXO -z flag: modularity cutoff. Communities lower than this threshold will be removed from the output. Increasing the value will reduce the number of communities (default: 0).

    • --clixo_s CliXO -s flag: a cutoff of similarity score, if set, the program will terminate when it reaches this point, and stop looking for more terms from scores lower than this threshold (default: 0).

    • --minSystemSize Minimum number of proteins requiring each system to have (default: 4).

    • --ci_thre Threshold for containment index. Additional hierarchical parent-child containment relations will be assigned between pairs of systems having a containment index above this threshold (default: 0.75).

    • --ji_thre Threshold for Jaccard index. System having Jaccard index above this threshold with its parent system will be removed (default: 0.9).

    • --niter Number of iterations Louvain clustering will run to select partition with the best modularity (default: 1000).

    • --min_diff Minimum difference in number of proteins required for every parent-child pair (default: 1).

    • --keep_all_files When this flag is provided, all intermediate output files will be kept.

    • --n_samples Maximum number of samples to use for fitting linear model (default: 1000000).

    • --q_step Step for scanning best quantile to use (default: 0.1).

    • --predict_nm_size When this flag is provided, all systems will have an estimated size in nm. Note that this calcualtion requries _avgPred_ytrue.csv generated from the random_forest_output.py script in the previous step.

Output file

  • $outprefix.louvain.ddot: This file stores the hierarchical relationship among systems and genes:

    • column 1: the parent system
    • column 2: the child system or protein
    • column 3: property of child in the second column ("default" indicates column 2 is a system, "gene" indicates column 2 is a protein)
  • $outprefix.louvain.termStats: This file stores the specific protein assignment for each identified system.

    • column 1: unique identifier for each system
    • column 2 (Number_of_proteins): total number of proteins belonging to the system
    • column 3 (Proteins): comma separated list of proteins belonging to the system
    • column 4 (median_recal_nm): median of predicted distance, in nm, among all pairs of proteins in the system
    • column 5 (Estimated_size_in_nm): predicted size, in nm, of the system

Note

The community detection procedure is based primarily on the CliXO method (Clique Extracted Ontologies, details available at https://github.com/fanzheng10/CliXO-1.0). CliXO uses maximum clique finding which is NP-hard. Although CliXO uses heuristics to streamline this complexity, it can still take a long time to finish. For MuSIC v1, pan-resolution community detection took around 7 hours on a modern single processor laptop computer.

Additional Resources

MuSIC v1 was visualized and explored with Cytoscape (session file available here). Ideker lab has also developed a hierarchical visualization webapp, HiView, for visualizing hierarchical models and the underlying interaction data that support them.