-
Notifications
You must be signed in to change notification settings - Fork 8
A Step By Step Guide to Building a MuSIC Map
MuSIC is a multi-scale map of human cell architecture constructed through integration of multiple data types. It is described in the following publication: Qin et al., “Mapping cell structure across scales by fusing protein images and interactions”.
This page describes a step-by-step guide to create the MuSIC map, including how to install the necessary packages and other dependencies; the specification of input data; and detailed documentation for each step and its parameters. An accompanying Jupyter notebook and a accompanying bash script instantiates these steps with the series of code executions that were used to generate MuSIC v1 presented in the paper. Notably, these steps can be used with other inputs and parameter settings to build other multi-scale cell models.
Step 1: Generate gold-standard protein-protein proximity values
Step 2: Build random forest to predict protein-protein proximity from data embeddings
Step 3: Analyze proximity data to identify protein communities at progressive resolutions
- Anaconda (optional but highly recommended)
- APT packages: build-essential, python-dev, libxml2, libxml2-dev, zlib1g-dev, libigraph0-dev, libmpc-dev
- Create an Anaconda virtual environment. This is optional but highly recommended. Takes ~10 minutes.
conda create -n music python=3.6.2 anaconda
source activate music
- Download MuSIC and install dependencies.
git clone https://github.com/idekerlab/MuSIC.git
cd MuSIC
pip install -r ./installation/requirements.txt
- Install CliXO v1.0 and DDOT by running the following command line.
./installation/install.sh
The MuSIC pipeline begins with a collection of input files. Each file specifies the embedding of proteins using a different data type. For each data type, provide an embedding file in the following format, one protein per row:
- Column 1: Unique ID for each row
- Column 2: Protein name
- Columns 3 to k: The embedding vector of that protein (of dimension k-2)
- All columns comma delimited
For MuSIC v1, we demonstrated the MuSIC approach with embeddings from immunofluorescence (IF) images in the Human Protein Altas (HPA) and affinity purification mass spectrometry (AP-MS) protein-protein interaction data in the BioPlex resource. These files are provided here as examples:
-
The GitHub repository provides the 1024-dimension IF embeddings for the 661 proteins used in MuSIC v1 (/Examples/IF_image_embedding.csv). Many proteins have multiple images, leading to a total of 1,451 images for this dataset. All images are available on the HPA website. Images were embedded using DenseNet, with code available here.
-
The GitHub repository provides the 1024-dimension AP-MS embeddings for the 661 proteins used in MuSIC v1 (/Examples/APMS_embedding.MuSIC.csv). The BioPlex 2.0 network used for this embedding is available here. This network was embedded using Node2Vec, with code and documentation available here.
- The MuSIC pipeline can handle any number of data types (e.g. IF, APMS, etc.) and any dimension of protein embeddings (i.e. length of feature vector), but all proteins need to have same number of dimension within each individual data type.
- For balanced learning, we recommend embedding proteins in the same number of dimensions for each of the different data types.
This step 1 calculates the gold-standard protein-protein proximity used for random forest training in step 2. For an input list of proteins, all pairs of proteins are analyzed to identify the smallest known subcellular component containing each pair (uses the Cellular Component branch of the Gene Ontology). The physical diameter of this component is computed using the calibration function below, fit in the Qin et al paper. This diameter is returned as the protein-protein proximity for that pair of proteins. Thus, using the Calibration Function, we label every protein pair with a curated physical distance.
python calibrate_pairwise_distance.py --protein_file /path/to/file/contain/proteins/to/analyze
--outprefix /path/to/output/folder/filePrefix
# Example
python calibrate_pairwise_distance.py --protein_file ./Examples/toy/toy_proteins.txt --outprefix ./Examples/toy_output/toy
-
Required arguments for calibrate_pairwise_distance.py:
-
--protein_file
Path to the file containing list of proteins to analyze. One protein per line. -
--outprefix
Full path to the folder where results will be saved in with unique file identifier.
-
-
Optional arguments for community_detection.py:
-
--C_file
Path to precomputed matrix containing the size (number of proteins) of the smallest Gene Ontology component shared by the gene pair. Computing the protein pairwise C matrix can be time consuming when analyzing a large number (a few thousands) of proteins, so we also provide a pre-computed protein pairwise C matrix for all 18,395 proteins in the Gene Ontology (GO download Sept 25 2018) (https://www.dropbox.com/s/w9lxpnw39g64zs8/all_protein_min_GO_size.txt?dl=0).
-
-
$outprefix.calibrated_distance.csv
This file stores the calibrated distance and proximity for every pair of input proteins that have GO annotation. Below are specific annotations for each column:- geneA: name of gene A
- geneB: name of gene B
- C: the number of proteins in the smallest GO cellular component to which both are annotated
- D: protein-protein distance calibrated from GO
- log10D: log10(D)
- P: protein-protein proximity, -log10(D)
Using the curated protein-protein proximities as training labels (Step 1), this Step 2 constructs random forest regressors to predict the pairwise distance of any protein pair directly from its data embeddings (see Input data).
2.1. random_forest_samples.py: this script creates the input and output files used to train the random forests.
python random_forest_samples.py --outprefix /path/to/output/folder/filePrefix
--protein_file /path/to/file/contain/proteins/to/analyze
--emd_files /path/to/embedding/file1 /path/to/embedding/file2 ...
--emd_label emd1 emd2 ...
# Example
python random_forest_samples.py \
--outprefix ./Examples/toy_output/toy --protein_file ./Examples/toy/toy_proteins.txt \
--emd_files ./Examples/toy/toy_IF_image_embedding.csv ./Examples/toy/toy_APMS_embedding.csv \
--emd_label IF_emd APMS_emd --n_samples 1000
-
Required arguments for random_forest_samples.py:
-
--outprefix
Full path to the folder where results will be saved in with unique file identifier. Note that this needs to be the same as previous calibration step. -
--protein_file
Path to the file containing list of proteins to analyze. One protein per line. -
--emd_files
Path to each embedding file generated from different data modalities. -
--emd_label
Label for each embedding file. Enter in the order of--emd_files
. E.g., IF_emd, APMS_emd
-
-
Optional arguments for random_forest_samples.py:
-
--num_set
Number of training sets for each embedding file. Enter in the order of--emd_files
(default: auto).- For balanced training, each protein needs to have one unique embedding from each data modality. When a protein has multiple embeddings from a single data modality, such as multiple images per protein, the user can choose to create multiple training sets with each set containing one embedding randomly chosen from all available embeddings for the respective protein.
-
auto
will automatically decide the number of training sets to generate for a data modality based on the maximum number of embeddings that a protein can have from the input files.
-
--n_samples
Maximum number of samples to train/test random forest regressor in each fold of k-fold cross validation (default: 1000000). -
--k
Specify k for k-fold cross validation (default: 5).
-
2.2. run_random_forest.py: this script trains random forest regressors using training samples generated from the previous script and predicts proximity for every protein pair queried. To facilitate concurrent training of multiple random forest regressors with high performance computing, we here provide this stand-alone script for training and predicting random forest models.
python run_random_forest.py --outprefix /path/to/output/folder/filePrefix
--fold 1
--emd_label emd1 emd2 ...
# Example for doing 5-fold cross-validation
for ((fold = 1; fold <= 5; fold++));
do
python run_random_forest.py --outprefix ./Examples/toy_output/toy \
--fold $fold \
--emd_label IF_emd APMS_emd;
done
-
Required arguments for run_random_forest.py:
-
--outprefix
Full path to the folder where results will be saved in with unique file identifier. Note that this needs to be the same as previous calibration step. -
--fold
Specify which fold of k-fold cross validation to train. -
--emd_label
Label for each embedding file. Enter in the order of--emd_files
. E.g., IF_emd, APMS_emd
-
-
Optional arguments for run_random_forest.py:
-
--train_set
For each embedding data, specify which training set to use. In order withemd_label
. Default to 1 for each emd_label. -
--n_estimators
The number of trees in the forest (default: 1000). -
--max_depth
The maximum depth of the tree (default: 30). -
--n_jobs
The number of jobs to run in parallel for training random forest regressor (default: 8).
-
2.3. random_forest_output.py: after all random forest models finished training, this script will take the average of all predicted protein-protein proximity for a given pair of protein and generate the protein pairwise proximity file used as input for Step 3.
python random_forest_output.py --outprefix /path/to/output/folder/filePrefix
# Example
python random_forest_output.py --outprefix ./Examples/toy_output/toy
-
$outprefix_predicted_proximity.txt:
This file stores the predicted protein-protein proximity for all pairs of proteins provided by the user. This file is used as input for the pan-resolution community detection (Step 3).- column 1: name of gene A
- column 2: name of gene B
- column 3: predicted proximity between gene A and gene B
The training time and memory footprint depends on the number of training samples and the dimensionality of the embeddings. For example, training of each random forest regressor in the original MuSIC study required ~1 day and >100 Gb memory with 24 threads. The amount of training time and memory can be decreased by using a smaller number of samples and/or embedding dimensions.
The final step of the pipeline analyzes the pairwise protein proximities, predicted in Step 2, to detect communities of proximal proteins. Protein communities are identified at multiple resolutions, starting with those that form at the smallest protein-protein distances then progressively relaxing the distance threshold (multi-scale community detection). Communities at smaller distances were contained, in full or in part, inside larger communities as the threshold was relaxed, yielding a structural hierarchy.
python community_detection.py --outprefix /path/to/output/folder/filePrefix
--clixo_i /path/to/clixo/inputFile
--predict_nm_size
bash <outprefix>.sh
# Example
python community_detection.py --outprefix ./Examples/toy_output/toy \
--clixo_i ./Examples/toy_output/toy_predicted_proximity.txt \
--predict_nm_size --keep_all_files
bash ./Examples/toy_output/toy.sh
Hierarchy is written in file $outprefix.louvain.ddot
with specific protein assignment for each system available in file $outprefix.louvain.termStats
.
-
Required arguments for community_detection.py:
-
--outprefix
Full path to the folder where results will be saved in with unique file identifier. -
--clixo_i
Path to input similarity network for CliXO: a TSV file with three columns. The first two columns should be two strings for the node names (using numbers may cause problem); and the third column should be a value for the edge weight.
-
-
Optional arguments for community_detection.py:
-
--path_to_clixo
Full path to CliXO folder. -
--path_to_alignOntology
Full path to alignOntology folder. -
--clixo_a
CliXO -a flag: for the step size of hierarchy construction; usually, a smaller value will create "deeper" hierarchies with more levels from leaves to the root (default: 0.1). -
--clixo_b
CliXO -b flag: for merging overlapping communities. Two existing communities will be merged if their similarity is above the threshold defined by this value. Usually a higher value will create a hierarchy with more smaller communities", which looks "broader" (default: 0.5). -
--clixo_m
CliXO -m flag: modularity cutoff. Communities lower than this threshold will be removed from the output. Increasing the value will reduce the number of communities (default: 0). -
--clixo_z
CliXO -z flag: modularity cutoff. Communities lower than this threshold will be removed from the output. Increasing the value will reduce the number of communities (default: 0). -
--clixo_s
CliXO -s flag: a cutoff of similarity score, if set, the program will terminate when it reaches this point, and stop looking for more terms from scores lower than this threshold (default: 0). -
--minSystemSize
Minimum number of proteins requiring each system to have (default: 4). -
--ci_thre
Threshold for containment index. Additional hierarchical parent-child containment relations will be assigned between pairs of systems having a containment index above this threshold (default: 0.75). -
--ji_thre
Threshold for Jaccard index. System having Jaccard index above this threshold with its parent system will be removed (default: 0.9). -
--niter
Number of iterations Louvain clustering will run to select partition with the best modularity (default: 1000). -
--min_diff
Minimum difference in number of proteins required for every parent-child pair (default: 1). -
--keep_all_files
When this flag is provided, all intermediate output files will be kept. -
--n_samples
Maximum number of samples to use for fitting linear model (default: 1000000). -
--q_step
Step for scanning best quantile to use (default: 0.1). -
--predict_nm_size
When this flag is provided, all systems will have an estimated size in nm. Note that this calcualtion requries _avgPred_ytrue.csv generated from the random_forest_output.py script in the previous step.
-
-
$outprefix.louvain.ddot
: This file stores the hierarchical relationship among systems and genes:- column 1: the parent system
- column 2: the child system or protein
- column 3: property of child in the second column ("default" indicates column 2 is a system, "gene" indicates column 2 is a protein)
-
$outprefix.louvain.termStats
: This file stores the specific protein assignment for each identified system.- column 1: unique identifier for each system
- column 2 (Number_of_proteins): total number of proteins belonging to the system
- column 3 (Proteins): comma separated list of proteins belonging to the system
- column 4 (median_recal_nm): median of predicted distance, in nm, among all pairs of proteins in the system
- column 5 (Estimated_size_in_nm): predicted size, in nm, of the system
The community detection procedure is based primarily on the CliXO method (Clique Extracted Ontologies, details available at https://github.com/fanzheng10/CliXO-1.0). CliXO uses maximum clique finding which is NP-hard. Although CliXO uses heuristics to streamline this complexity, it can still take a long time to finish. For MuSIC v1, pan-resolution community detection took around 7 hours on a modern single processor laptop computer.
MuSIC v1 was visualized and explored with Cytoscape (session file available here). Ideker lab has also developed a hierarchical visualization webapp, HiView, for visualizing hierarchical models and the underlying interaction data that support them.