1 - Introduction
2 - Requirements
3 - Usage
4 - remove_H_atoms.py
4.1 - Scope
4.2 - Requirements
4.3 - Usage
4.4 - Arguments
4.5 - Output
5 - ResRel-MPI.py
5.1 - Scope
5.2 - Requirements
5.3 - Usage
5.4 - Arguments
5.5 - Output
6 - Hs-Hk-plot.py
6.1 - Scope
6.2 - Tasks
6.3 - "Density" Task
6.4 - "Bin" Task
6.5 - Output
6.3.1 - Requirements
6.4.1 - Requirements
6.3.2 - Usage
6.4.2 - Usage
6.3.3 - Arguments
6.4.3 - Arguments
7 - Examples
8 - Contacts
When coarsening biomolecules, the identification of the optimal number of sites to minimize information loss from an all-atom conformation is a challenging task. Several coarse-grained and multi-resolution models have been developed to tackle this issue, and one promising model is CANVAS (Coarse-grained Anisotropic Network model for VAriable resolution Simulation).
The CANVAS strategy leverages the blurred and approximate nature of coarse-grained models to identify effective sites based on a user-provided input, and determines the interactions among them based on the molecule’s structure and all-atom force field, making it unnecessary to run reference simulations. This strategy makes the parametrisation of the model practically instantaneous, and allows the modulation of the system’s resolution in a quasi-continuous manner across the structure, from all-atom to (very) coarse-grained. Most notably, the interaction between regions of the system at different resolution (including the solvent) is accounted for and straightforward to set up, allowing the seamless implementation in standard MD software packages (e.g. GROMACS or LAMMPS).
In CANVAS model three levels of resolution are employed: all-atom
where all the atoms of the system are token in account; medium-grained
where the backbone atoms are retained and treated as CG beads; and finally coarse-grained
where only the Cα atoms are kept and modelled as CG beads.
However, this approach requires prior knowledge of the system's chemistry and biology to determine which parts necessitate a fully atomistic description, namely in which part of the system the chemical details have a significant impact. Answering this question can be challenging.
Recently, a new method called Resolution and Relevance has been developed to identify the optimal resolution level that balances simplicity and informativeness. This framework, also known as critical variable selection, allows for the identification of important variables without prior knowledge or assumptions about their nature. The core idea behind this approach is that the generative model underlying empirical samples can be inferred from the distribution of their frequencies, i.e., the number of times different outcomes occur in the dataset.
Building upon the aforementioned approach, our goal is to identify the optimal number of sites for multi-resolution protein descriptions using the CANVAS model. This project involves the combination of three methods/software: Relevance and Resolution, Mapping Entropy, and CANVAS:
- The
Relevance and Resolution software
, written in Python, aims to determine the optimal number of sites for biomolecule coarse-graining. - The
Mapping Entropy tool
, written in C, takes the optimal number of sites as input and returns the site selection that minimizes information loss during the reduction of degrees of freedom in a system. - The
CANVAS model
, available on a GitHub repository and implemented in Python, allows for the modeling of biomolecules at three levels of resolution as described earlier. It requires the output from the Mapping Entropy tool as input.
In this section, we present the tool for identifying the optimal number of sites dubbed PrOpRe, an acronym for PRotein OPtimal REsolution identification tool
. Subsequently, this number will serve as input for the Mapping Entropy tool, which will provide the atom selection. Finally, an additional code will be necessary to find the CANVAS selection sites that are closest to the output of the Mapping Entropy tool. This automated process will facilitate the coarsening of proteins using the CANVAS model.
-
Python3
: it is a powerful interpreted, object-oriented, and high-level programming language known for its dynamic semantics. It is highly recommended to use Python 3.7 or 3.9 as they are the most suitable versions. If you're working on a Linux or macOS system, Python 3 should already be installed. However, if you're using Windows, the presence of Python 3 is not guaranteed. To install Python 3, you can follow the installation guide provided here. Please ensure that you are working with Python 3 (preferably 3.7 or 3.9) as executing the code with Python 2 may result in errors or unexpected behavior. -
Python3 libraries
: Python 3 comes with a wide range of built-in libraries that are installed by default. However, there are certain libraries that may need to be installed separately. Here are four libraries used in this code that typically require subsequent installation:
-
MDAnalysis
: It is an open source Python library that helps to quickly write your own analysis algorithm for studying trajectories produced by the most popular simulation packages. -
NumPy
: It stands for Numerical Python and it a fundamental library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently. It also has functions for working in domain of linear algebra and fourier transform. NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it freely. -
Matplotlib
: It is a low level graph plotting library in python that serves as a visualization utility created by John D. Hunter. It is open source and we can use it freely. Moreover, Matplotlib is mostly written in python, a few segments are written in C, Objective-C and Javascript for Platform compatibility. -
SciPy
: It is a free and open-source Python library used for scientific computing and technical computing. It was created by Travis Oliphant. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.
To install the lastest stable releases with conda do:
conda config --add channels conda-forge conda install mdanalysis conda install numpy conda install matplotlib conda install scipy
On the other hand, to install the latest stable release with pip or pip3 (which should be available in all Python installations) do:
pip3 install --upgrade MDAnalysis pip3 install numpy pip3 install matplotlib pip3 install scipy
-
The typical usage of the program consists in a call to remove_H_atoms.py
, ResRel-MPI.py
and Hs-Hk-plot.py
in succession by using Python3:
-
remove_H_atoms.py
: It has the preliminary purpose of removing all hydrogen atoms from both the reference file and the trajectory file. The reason for this step is that, in the calculation of the Root Square Deviation (RSD) map, which is a key component for computing the Resolution and Relevance, it is preferable to exclude hydrogen atoms as they are not heavy atoms. It is important to note that if the reference and trajectory files already exclude hydrogen atoms, you can ignore this code. Additional details can be found in Section 4. -
ResRel-MPI.py
: This is the core program beacuse has the scope of calculating the Relevance and Resolution points (changing the number of sites and exploring different mappings) by analyzing the RSD map among each frame and the other ones. The program generates an output file with three rows of data:- 1st row: values of Resolution (Hs);
- 2nd row: values of Relevance (Hk);
- 3rd row: number of retained sites for that specific Hs and Hk values.
-
Hs-Hk-plot
: this code serves a dual purpose:- Drawing a saving various plots related to Resolution and Relevance, slope, and histogram of frequencies.
- Calculating the optimal number of sites for a biomolecule based on an atomistic trajectory, with the aim of minimizing the loss of information when atoms are decimated.
Before running the Python scripts, it is important to read the next section carefully, as it provides a detailed explanation of each task and argument.
IMPORTANT: It is crucial to avoid moving the scripts outside the main folder
PrOpRe/
otherwise a fatal error occurs which will be displayed on the screen and the user-libraries, namelyPrOpRe/lib/check_errors.py
,PrOpRe/lib/general.py
,PrOpRe/lib/inp_out.py
andPrOpRe/lib/plot_func.py
will not be found.Moreover, please do not change the name of folders present in this repository. If you wish to conduct new tests, create your own folder within the main repository. You can nest folders within the main folder without any issues.
This script serves a preliminary yet significant purpose of removing all hydrogen atoms (with typical atom names such as H, H1, H2, HW,...) from both the reference file (commonly found in formats like gro, pdb, xyz, psf,...) and the trajectory file (tipically in format xtc, trr, dcd, gro, lammpstrj,...). Hydrogen atoms in proteins exhibit greater movement and rotation compared to heavy atoms such as carbon, nitrogen, and oxygen. This behavior is attributed to the significantly smaller mass of hydrogen atoms. Known as the reduced mass effect, this phenomenon plays a crucial role in protein dynamics. The mobility of hydrogen atoms can influence the three-dimensional structure and stability of the protein. Therefore, in the computation of the RSD map and the subsequent determination of Resolution and Relevance, it is advisable to exclude hydrogen atoms. If your reference and trajectory files already lack hydrogen atoms, you can ignore this code.
This script requires two mandatory files: the coordinate/topology file epresenting the all-atom structure of the biomolecule (gro, pdb, xyz, psf, ...) and the trajectory file in any format (xtc, trr, dcd, gro, lammpstrj, ...). No optional arguments are available.
In order to launch the remove_H_atoms.py scripts, the command-line is the following:
python3 remove_H_atoms.py -r <Coordinate FILE> -t <Trajectory FILE>
or:
python3 remove_H_atoms.py --ref <Coordinate FILE> --traj <Trajectory FILE>
To obtain a brief explanation of the arguments, you can execute the following command in your terminal: python3 remove_H_atoms.py -h
or python3 remove_H_atoms.py --help
. Additionally, if you wish to print a concise usage message, you can use either python3 remove_H_atoms.py
or python3 remove_H_atoms.py -u
.
However, it is strongly recommended to read the following section attentively before running the Python scripts as it provides a comprehensive explanation of each argument.
As described in Section 4, both the coordinate/topology file of the all-atom structure of the biomolecule and the trajectory file in any format are always mandatory. There are no optional arguments available. Here is a brief explanation of the aforementioned files:
-
Coordinate FILE
: This file is mandatory and it that contains the atom coordinates. It can be in formats such as xyz, gro, pdb, psf, etc. It provides information on the positions of the atoms and their connectivity. -
Trajectory FILE
: This file is mandatory and it contains the trajectory information of the biomolecule. It can be in formats such as trr, dcd, lammpstrj, gro, etc. It provides information on how the biomolecule moves over time.
The program produces two output files:
-
Reference_noH.gro
: This file contains the coordinate data of the biomolecule's all-atom structure after eliminating all the hydrogen atoms. It represents the updated coordinates of the remaining heavy atoms. -
Trajectory_noH.xtc
: This file comprises the trajectory data of the biomolecule after the removal of hydrogen atoms. It captures the time-dependent movement of the remaining heavy atoms over the course of the simulation.
This program serves as the core component for calculating Resolution-Relevance (Hs-Hk) points, which involves varying the number of retained sites and different mappings. The calculation process involves several steps:
- Calculation of the all-atom RSD map between each frame and the other frames, requiring alignment between each frame pair.
- Construction of a dendrogram based on the all-atom RSD map of the trajectory using the average linkage UPGMA algorithm.
- Cutting the dendrogram to determine the cutoff value that allows for the distinction of all atomistic conformations.
- Starting with a number of retained sites equal to Natoms - 1, a random mapping is proposed: in this context, "mapping" refers to the process of simplifying or reducing the complexity of a protein structure by selecting a subset of atoms as schematically shown in Figure 1.
- According with the mapping proposed, the _RSD map_ of such subset of atoms is calculated. Then, based on the pre-determined cutoff, the number of clusters at that cutoff is enstablished for this configuration, and the Hs-Hk point is computed.
- Steps 5 is repeated for a specified number M of mappings (by default, M = 50).
- Gradually reducing the number N of retained sites, steps 4, 5 and 6 are iterated until no atoms are retained.
- Finally, a complete curve of Hs-Hk points is drawn, ready for analysis, with the primary goal of calculating the optimal number of sites (look Section 6).
For better comprehension of the steps above mentioned, Figure 2 illustrates the flux diagram representing steps 4-5-6-7, demonstrating that two nested for-loops are required to calculate all the Hs-Hk points: the outer loop iterates over the number of retained sites (Ns), while the inner loop performs M random mappings at a fixed number of retained sites.
To run this script, two mandatory files are required: the coordinate/topology file of the biomolecule without hydrogen atoms gro, pdb, xyz, psf, ...) and the trajectory file in any format (lammpstrj, dcd, trr, xtc, ...). Additionally, five optional arguments can be specified:
-
Nmappings
: number of random mappings generated at each fixed number of retained sites. -
Nframes
: number of frames to be read in the trajectory file. -
Nstep
: step that describes the decrement in the number of sites to be retained, starting from Natoms - 1, during the calculation. -
ncpu
: number of CPUs to be used for parallelizing the calculation of the RSD map for each mapping. -
RestartFILE
: Restart FILE ("Hs-Hk-Nsites-${ProteinName}.txt") from which the calculation of Hs-Hk-N resumes if previosly interrupted.
To run the ResRel-MPI.py script, the command-line is the following:
python3 ResRel-MPI.py -r <Reference_noH.gro> -t <Trajectory_noH.xtc> [-m <NMappings>] [-f <Nframes>] [-s <Nsteps>] [-n <nCPU>] [-c <RestartFILE>]
or:
python3 remove_H_atoms.py --ref <Reference_noH.gro> --traj <Trajectory_noH.xtc> [--mapp NMappings>] [--frames <Nframes>] [--step <Nsteps>] [--ncpu <nCPU>] [--checkpoint <RestartFILE>]
NOTE: Please note that the "Reference_noH.gro" and "Trajectory_noH.xtc" files mentioned here refer to the output files obtained after running the
remove_H_atoms.py
script. It is crucial to remove hydrogen atoms from the files for accurate calculation of Resolution and Relevance points. Although this code does not throw an error if hydrogen atoms are present, their excessive movement and rotation can adversely affect the calculation. Please ensure that you have removed hydrogen atoms from the files before proceeding with the calculation.
To obtain a brief explanation of the arguments, you can execute the command python3 ResRel-MPI.py -h
or python3 ResRel-MPI.py --help
. Additionally, if you wish to print a concise usage message, you can use either python3 ResRel-MPI.py
or python3 ResRel-MPI.py -u
.
However, it is strongly recommended to read the following section attentively before running the Python scripts as it provides a comprehensive explanation of each argument.
In Section 5, it is emphasized that the coordinate file (Reference_noH.gro) and the trajectory (Trajectory_noH.xtc) of the all-atom structure of the biomolecule without hydrogen atoms are always required inputs. These files provide the necessary information for the calculation. On the other hand, the number of mappings at fixed number of sites (Nmappings
), the number of frames to be read from trajectory (Nframes
), the step that describes the decrement in the number of sites to be retained (Nstep
), the number of cpu employed (ncpu
) and the restart file for resuming previous interrupted Relevance / Resolution calculation (checkpoint) are optional arguments. The following is the summary of the different files and input parameters required by the code:
-
Coordinate FILE noH
: This is a mandatory file (-r/--ref
) containing the atom coordinates of the biomolecule _without_ hydrogen atoms (in formats such as xyz, gro, pdb, psf, etc.). If the "remove_H_atoms.py" script is used, the default name for this file is Reference_noH.gro. -
Trajectory FILE noH
: This is another mandatory file (-t/--traj
) containing the trajectory of the biomolecule _without_ hydrogen atoms (in formats such as trr, dcd, lammpstrj, gro, etc.). If the "remove_H_atoms.py" script is used, the default name for this file is Trajectory_noH.gro. -
NMappings
(default: 50): This is an optional argument (-m/--mapp
) that specifies the number of random mappings M, namely the number of combinations that will be chosen randomly, at a fixed number of retained sites. Each mapping represents a unique combination of atoms (Figure 1). By default, the value of Nmappings (i.e. M) is set to 50, meaning that 50 random mappings will be chosen. However, you have the flexibility to adjust this value according to your specific needs. Increasing the value of Nmappings will result in a greater number of random mappings being generated, while decreasing it will yield fewer mappings. Choosing a higher value for Nmappings can provide a more comprehensive exploration of different atom combinations, but it will also increase the computational time required for the calculation. Conversely, selecting a lower value will reduce the computational burden but may result in a less exhaustive sampling of mappings. By adjusting the Nmappings argument, you can strike a balance between computational efficiency and the level of exploration of different mappings that suits your specific requirements. The default value of Nmappings, which is set to 50, (M = 50) serves as a good compromise between the two factors. You can adjust the Nmappings value according to your specific needs and the available computational resources. -
Nframes
(default: 1000): This is an optional parameter (-f/--frames
) that allows you to specify the number of frames F to be read from the trajectory. The program ensures that this exact number of frames is included in the analysis, spanning the entire trajectory. To achieve this, an initial number of frames will be discarded, and the trajectory will be read at regular intervals. The default value for Nframes is set to 1000, meaning that 1000 frames will be considered for the analysis. However, you have the flexibility to adjust this value according to your specific requirements. Any integer number less than the original number of frames in the trajectory is accepted. If you set the Nframes argument to the string "all", indicated by-f all
, the program will read every frame available in the trajectory. While this option allows for a comprehensive analysis of the entire trajectory, it's important to note that the calculation of the RSD (Root Squared Deviation) map involves a computational complexity proportional to the square of the number of frames (Nframes2). Consequently, increasing the number of frames will significantly increase the computation time. The ability to read more frames depends on the number of cores available in a single node of your computing environment. If you have a higher number of cores, you can process more frames efficiently. However, it is crucial to exercise caution when choosing the value of Nframes or selecting the "all" option, as the computational resources required can grow substantially. Consider your specific analysis needs, the computational resources at your disposal, and the desired trade-off between computational time and analysis comprehensiveness when selecting the appropriate value for Nframes. -
Nstep
(default: 0.5%): This optional argument (-s/--step
) determines the decrement in the number of retained sites during the calculation of Resolution and Relevance points. The process starts from Natoms - 1 and, the Resolution and Relevance point is computed for each of random mappings defined in input (M, default: 50) for that particular number of retained sites. Subsequently, the process is iterated by reducing the number of retained sites by Nstep until a minimum of 3 atoms is reached (as shown in the flux diagram in Figure 2). Hence, the purpose of this parameter is to control the granularity of the reduction in the number of retained sites during the calculation. While reducing the number of sites by 1 maximizes the exploration of Relevance and Resolution points, selecting a value of 1 for Nstep may not be optimal for large systems due to the computational time required for each RSD map computation. Therefore, it is crucial to strike a balance between comprehensive exploration and computational efficiency when determining the appropriate value for Nstep. Thus, in order to decide the Nstep value here are the two ways you can define this argument:- Percentage of the total number of atoms: By specifying Nstep as a percentage, the number of retained sites will decrease by that percentage. The default value is 0.5%. For example, if the total number of atoms (Natoms) is 10000, then the 0.5% of this number is 50. In each iteration the number of retained sites will start from 9999 (Natoms - 1) and decrease by 50 until reaching 3 (Natoms - 1, Natoms - 51, Natoms - 101,..., 3). When using the percentage definition, _Nstep_ should be specified as an integer or float between 0 and 100, followed by the '%' symbol, without any spaces (e.g. 0.5% and not 0.5 %). If the provided percentage results in Nstep = 0, an error will be displayed. Choosing 0.5% ensures that the number of retained sites changes 200 times, striking a good balance and flexibility between computational efficiency and the exploration of different resolutions.
- Directly specifying the step: Alternatively, you can directly define the step without calculating it as a percentage of the total number of atoms. In this case, _Nstep_ should be an integer between 1 and Natoms - 1. If a value outside this range is provided, an error will be raised. For example, if Natoms is 10000 and Nstep is set to 100, the number of retained sites will decrease by 100 in each iteration of the for-loop until reaching 3 (Natoms - 1, Natoms - 101, Natoms - 201,..., 3)
-
NumberCpu
(default: maximum number possible): This is an optional parameter (-n/--ncpu
) that determines the number of CPUs used for parallelizing the calculation of the RSD (Root Square Deviation) map for each mapping. By default, the code will automatically utilize the maximum number of available cores in a single node of your laptop or cluster for parallelization. This means that if the-n/--ncpu
option is not set the code will distribute the computational workload across all the available cores for efficient processing. However, if you want to manually specify the number of CPUs to be used, you can provide the-n/--ncpu
option followed by the desired number of CPUs, for example,-n 8
to use 8 CPUs. In this case, the code will parallelize the calculation by employing the specified number of CPUs. The purpose of parallelization is to accelerate the computation process by dividing the workload among multiple processors. By utilizing multiple CPUs, you can potentially reduce the overall processing time for calculating the RSD map. Note: The actual number of CPUs available for parallelization may depend on the hardware specifications of your system or the constraints set by your cluster environment. -
Restart FILE
: Because of walltime in the node of your cluster, o bacause of other reasons, the calculation of Relevance and Resolution points may interrupt anytime. Thus, by using the same output file as restart, denoted as 'Hs-Hk-Nsites-${ProteinName}.txt' the calculation resumes from where it was interrupted.
The output of the code includes two files:
-
trace_${ProteinName}.txt
: This file serves as a log or progress report during the execution of the code. It provides updates on the number of Resolution and Relevance points that have been calculated so far. This information helps monitor the progress of the calculation, especially if it takes a long time to complete. Additionally, it includes the time required to calculate a single point at each fixed number of retained sites. This timing information can be useful for performance analysis and optimization. The file may also estimate the remaining total time based on the current progress, giving you an idea of how much time is left for the calculation to complete. -
Hs-Hk-Nsites-${ProteinName}.txt
: This file contains the actual results of the Resolution and Relevance calculations for different numbers of retained sites. It provides three rows of data. The first row corresponds to the values of Resolution (Hs); the second row contains the values of Relevance (Hk), whereas the third one specifies the corresponding number of retained sites for each Resolution and Relevance point. This information allows you to analyze the relationship between the number of retained sites and the quality of the calculated Resolution and Relevance values. By examining these values, you can identify the optimal number of retained sites that strikes a balance between capturing structural information and essential dynamics. (look Section 6). Please, take in account that this file has to be used as checkpoint file when Relevance and Resolution calculation has been previously interrupted by using the flag-c/--checkpoint Hs-Hk-Nsites-${ProteinName}.txt
. In the second example in Section 7 is reported an example for the usage of the restart file.
A short explaination of arguments is provided by launching the command python3 ResRel-MPI.py -h
or python3 ResRel-MPI.py --help
. Alternatively, for printing a short usage message, please type: python3 ResRel-MPI.py
or python3 ResRel-MPI.py -u
In this study, the main goal is to find the optimal number of sites when coarsening protein. To accomplish this, we start by simplifying the Resolution & Relevance curve. This curve represents the relationship between the Resolution (Hs) and Relevance (Hk) values for different sets of sites. Different mappings of same number of retained sites have the same colors (as show in Figure 3).
To simplify the curve, we compute the average values for Resolution (H̅s) and Relevance (H̅k) (Figure 4c and Figure 5c). These average values provide a smoother representation of the overall trend in the data. Next, we analyze the slope between each pair of consecutive points on the average curve. The slope is calculated using the formula ΔY/ΔX, where Y represents the average Relevance points (H̅k) and X represents average Resolution (H̅s) points.
In the existing literature, it is suggested that the partition where the sum of Resolution (H̅s) and Relevance (H̅k) is the largest occurs when the slope μ = -1. This observation aligns with Zipf's law. In this context, it implies that the optimal tradeoff between the simplicity of the representation (low resolution) and its informative nature (high relevance) occurs when the slope μ = -1. This point represents the sweet spot where the protein coarsening achieves the best balance between retaining important information and minimizing complexity.
After identifying the point with a slope closest to -1, we can determine the specific Resolution (Hs) and Relevance (Hk) points falling within that chosen interval. Each point within that interval corresponds to a specific number of retained sites. Therefore, the optimal number of sites can be determined by identifying the Relevance and Resolution points that have the highest occurrence in terms of the number of retained sites.
The Relevance and Resolution plot consists of a total of N points, with the default value being approximately 10000 points when using default optional arguments (NMappings = 50, Nframes = 1000, Nstep = 0.5%). One example is reported in Figure 4, where it is possible to appreciate the Resolution and Relevance points with different colors according with the value of the number of retained sites. To determine the optimal number of sites, the Resolution & Relevance curve needs to be simplified. This simplification involves computing average values for Resolution (H̅s) and Relevance (H̅k). There are two different ways to perform this calculation:
-
density
(recommended task): When using this option, the x-axis representing Resolution (Hs) is divided into X intervals, each containing the same number of points (D). The default value for D is 100, but you can adjust it using the-d
flag if needed (look forDensityPoints
argument in Section 6.3.3). It is important to note that the interval length is not fixed. Instead, the goal is to maintain a consistent density of Hs-Hk points within each interval (Figure 4). By doing so, the computation of average values for Resolution (Hs) and Relevance (Hk) is based on an equal density of points, ensuring fairness in the calculation process. In each interval, the average values for Resolution and Relevance are computed, denoted as H̅s and H̅k. By employing the density option, the calculation of average values considers the distribution of points and provides a more accurate representation of the average behavior of the Relevance and Resolution plot across different intervals. Overall, the density option offers a fair and precise approach for computing the average values of Hs and Hk, taking into account the varying density of points along the Relevance and Resolution plot. -
bin
: When using the bin option, the x-axis representing Resolution (Hs) is divided into a fixed number of windows or intervals, denoted as W. By default, the value of W is set to 50, but you can modify it using the-w
flag if desired as specified in Section 6.4.3 (specifically NumberWindows argument). Each window or interval has the same length, and since the Resolution axis ranges from 0 to 1 by definition, the length of each interval, also referred to as bin, is thus defined as 1/W. This means that the entire range of Resolution is divided into W equally spaced windows, and the bin size within each window is determined accordingly (Figure 5). In each window, the average value of Hs (denoted as H̅s) is obtained at the midpoint of each bin. On the other hand, the average value of Hk (denoted as H̅k) is computed for each window by taking the arithmetic mean of all the Relevance values within that specific window. At difference with thedensity
option, if choosing thebin
task the density of points along the Relevance and Resolution plot may vary significantly across the different windows. Some windows may contain a high density of points, while others may have relatively fewer points. As a result, the computation of average values may not be as fair or precise due to the unequal density of points. Therefore, it's advisable to use the bin option with caution and carefully consider the distribution of points along the curve before drawing conclusions from the average values of Hs and Hk computed within each window.
Based on the selected option, you should refer to the appropriate section. If you choose the density
option, please refer to Section 6.3. If you choose the bin
option, please refer to Section 6.4.
The density
task requires one mandatory file: Hs-Hk-Nsites-${ProteinName}.txt. This file contains the values of resolution (Hs), relevance (Hk), and the number of retained sites associated with each Hs and Hk point. Additionally, there are two optional arguments:
-
DensityPoints
: An integer that specifies the desired number of points in each variable-length interval. -
SlopeRange
: Specifies the range within which the best interval is determined based on the average curve of H̅s and H̅k having slope close to -1.
For more detailed information on these arguments, please refer to Section 6.3.3.
To run the Hs-Hk-plot.py script with density task, the command-line is the following:
python3 Hs-Hk-plot.py density -f <Hs-Hk-Nsites-${ProteinName}.txt> [-d <density>] [-s <range>]
or:
python3 Hs-Hk-plot.py density --file <Hs-Hk-Nsites-${ProteinName}.txt> [--DensityPoints <density>] [--SlopeRange <range>]
NOTE: Please note that the file "Hs-Hk-Nsites-${ProteinName}.txt" mentioned is the output of the "ResRel-MPI.py" script, as described in detail in Section 5. This file contains the values of resolution (Hs), relevance (Hk), and the number of retained sites associated with each Hs and Hk point.
To obtain further information and execute the "Hs-Hk-plot.py" script with the "density" option, please type on terminal python3 Hs-Hk-plot.py density
In the "density" task, there are several arguments that can be used. These arguments (one mandatory and two optional) are as follows:
-
Hs-Hk-Nsites-${ProteinName}.txt
: This is a mandatory argument (-f/--file
) that corresponds at the input file "Hs-Hk-Nsites-${ProteinName}.txt" (output of 'ResRel-MPI.py' code). The file contains the values of Resolution (Hs), Relevance (Hk), and the number of sites (N) associated with each Hs and Hk point. The file is organized in three rows, where each row contains the respective values separated by spaces. -
DensityPoints
: This is an optional argument (-d/--DensityPoints
) that specifies the number of points D that fall within each interval of variable length. By default, the value is set to 100. The density of points within each interval is used to compute the average values of Hs (denoted as H̅s) and Hk (denoted as H̅k). Using this argument, you can easily change the default value to suit your needs. -
SlopeRange
: This is an optional argument that determines how to find the best interval on the average curve with slope μ close to -1 for computing the optimal number of sites. the slope μ = -1 is associated to the point of optimal tradeoff between parsimony of the representation (low resolution) and its informativeness (high relevance). There are two ways to define this argument:- Percentage Range : By default, the argument is set as a range close to -1 in terms of percentage. The default range spans from -1.10 to -0.90, which corresponds to a 10% range. Within this range, the rightmost value with a higher resolution is chosen. It is generally preferred to select a range with values close to -1: selecting a higher percentage range could result in values that are too far from -1 and therefore not optimal for finding the best interval on the average curve and, consequently, the optimal number of sites.
- Closest Value : Alternatively, you can specify the argument as "closest" (
-s closest
) to identify the closest value of the slope to -1. This option allows you to find the specific point on the curve that has the slope μ closest to -1.
----------------------------------------------------
| Hs-1 Hs-2 Hs-3 ..... Hs-N |
| Hk-1 Hk-2 Hk-3 ..... Hk-N |
| Nsites-1 Nsites-2 Nsites-3 ..... Nsites-N |
----------------------------------------------------
The bin
task requires one mandatory file: Hs-Hk-Nsites-${ProteinName}.txt. This file contains the values of resolution (Hs), relevance (Hk), and the number of retained sites associated with each Hs and Hk point. Additionally, there are two optional arguments:
-
NumberWindows
: This argument is an integer value that determines the number of windows or intervals into which the x-axis (Resolution) is divided. While the number of windows is fixed, the density of points within each window may vary. Unlike the "density" option, where the density of points is kept constant, using the "bin" option, different windows may contain different numbers of data points, resulting in variable density across the intervals. -
SlopeRange
: Specifies the range within which the best interval is determined based on the average curve of H̅s and H̅k having slope μ close to -1.
All the details of the arguments just described are prvided in Section 6.4.3.
To run the Hs-Hk-plot.py script with the bin task, the command-line is the following:
python3 Hs-Hk-plot.py bin -f <Hs-Hk-Nsites-${ProteinName}.txt> [-w <nWindows>] [-s <range>]
or:
python3 Hs-Hk-plot.py bin --file <Hs-Hk-Nsites-${ProteinName}.txt> [--NumeberWindows <nWindows>] [--SlopeRange <range>]
NOTE: Please note that the file "Hs-Hk-Nsites-${ProteinName}.txt" mentioned is the output of the "ResRel-MPI.py" script, as described in detail in Section 5. This file contains the values of resolution (Hs), relevance (Hk), and the number of retained sites associated with each Hs and Hk point.
To obtain further information and execute the "Hs-Hk-plot.py" script with the "bin" option, please type on terminal python3 Hs-Hk-plot.py bin
In the "bin" task, there are several arguments that can be used. These arguments (one mandatory and two optional) are as follows:
-
Hs-Hk-Nsites-${ProteinName}.txt
: This is a mandatory argument (-f/--file
) that corresponds at the input file "Hs-Hk-Nsites-${ProteinName}.txt" (output of 'ResRel-MPI.py' code). The file contains the values of Resolution (Hs), Relevance (Hk), and the number of sites (N) associated with each Hs and Hk point. The file is organized in three rows, where each row contains the respective values separated by spaces. -
NumberWindows
: This is an optional argument (-w/--NumberWindows
) that specifies the number of intervals into which the x-axis (Resolution) is divided. The length of each interval is fixed, while the number of points falling within each interval can vary. This is illustrated schematically in Figure 5. By default, the value of W is set to 50, but you can modify it using the-w
flag if desired. Since the Resolution axis ranges from 0 to 1 by definition, the length of each interval, also referred to as bin, is thus defined as 1/W. Within intervals of the same bin size, the average value of Hs (denoted as H̅s) is obtained at the midpoint of each bin. On the other hand, the average value of Hk (denoted as H̅k) is computed for each window by taking the arithmetic mean of all the Relevance values within that specific window. -
SlopeRange
: This is an optional argument that determines how to find the best interval on the average curve with slope μ close to -1 for computing the optimal number of sites. the slope μ = -1 is associated to the point of optimal tradeoff between parsimony of the representation (low resolution) and its informativeness (high relevance). There are two ways to define this argument:- Percentage Range: By default, the argument is set as a range close to -1 in terms of percentage. The default range spans from -1.10 to -0.90, which corresponds to a 10% range. Within this range, the rightmost value with a higher resolution is chosen. It is generally preferred to select a range with values close to -1: selecting a higher percentage range could result in values that are too far from -1 and therefore not optimal for finding the best interval on the average curve and, consequently, the optimal number of sites.
- Closest Value: Alternatively, you can specify the argument as "closest" (
-s closest
) to identify the closest value of the slope to -1. This option allows you to find the specific point on the curve that has the slope closest to -1.
----------------------------------------------------
| Hs-1 Hs-2 Hs-3 ..... Hs-N |
| Hk-1 Hk-2 Hk-3 ..... Hk-N |
| Nsites-1 Nsites-2 Nsites-3 ..... Nsites-N |
----------------------------------------------------
This code generates 4 PDF plots and a TXT file as output:
-
Reso.pdf
: This plot displays the Resolution (Hs) and Relevance (Hk) points, with the same color representing points obtained from the same number of retained sites but different mappings. Additionally, a zoomed-in region of interest is shown where the slope μ is close to -1, providing a detailed view of that area. -
Zoom-Reso.pdf
: This plot focuses specifically on the region where the slope μ is -1, providing a closer look at the relationship between Relevance and Resolution in the region of our interest. -
slope.pdf
: A plot of the slope values against an increasing index (from 1 to N points). This plot shows the slope μ values plotted against an increasing index ranging from 1 to the total number of points. This visualization provides a closer look at the relationship between Relevance and Resolution in the region of our interest. -
histo_Nsites.pdf
: This plot displays the frequency distribution of the number of sites with different occurrences. It provides insights into the distribution of retained sites and their frequencies. -
Opt-number-of-sites.txt
: A text file that provides a summary of the arguments used and, more importantly, the optimal number of sites for a biomolecule derived from an atomistic trajectory, such that the loss of information after decimating atoms is minimized. This information is valuable for determining the appropriate number of retained sites that balances the preservation of essential structural information with the reduction in computational complexity.
Inside the tests/
directory there is the complete list of example files for a lot of proteins, allowing the user to try the three codes in succession.
Hereafter, for the sake of clarity, only four examples are reported.
# The reference and trajectory files, namely 1igd_noH.gro and 1igd_noH.xtc, have already been processed
# to exclude hydrogen atoms. As a result, the script "remove_H_atoms.py" was not utilized in this context.
####
name="1igd_noH"
PYTHONDIR=../PYTHON-scripts
inputDIR=../input-files/${name}
### 1st part: ResRel
python3 $PYTHONDIR/ResRel-MPI.py -r $inputDIR/${name}.gro -t $inputDIR/${name}.xtc
rm -r test-${name}
mkdir test-${name}
mv trace_${name}.txt Hs-Hk-Nsites-${name}.txt test-${name}
### 2nd part: Hs-Hk-plot (DENSITY OPTION)
cd test-${name}
python3 ../$PYTHONDIR/Hs-Hk-plot.py density -f Hs-Hk-Nsites-${name}.txt
# The reference and trajectory files, namely 1igd_noH.gro and 1igd_noH.xtc, have already been processed
# to exclude hydrogen atoms. As a result, the script "remove_H_atoms.py" was not utilized in this context.
# A restart is also used in this case, due to interruption of calculation of Resolution and Relevance.
####
name="1igd_noH"
PYTHONDIR=../PYTHON-scripts
inputDIR=../input-files/${name}
### 1st part: ResRel with restart
python3 $PYTHONDIR/ResRel-MPI.py -r $inputDIR/${name}.gro -t $inputDIR/${name}.xtc -c Hs-Hk-Nsites-${name}.txt # RESTART
rm -r test-${name}
mkdir test-${name}
mv trace_${name}.txt Hs-Hk-Nsites-${name}.txt test-${name}
# ake.gro and ake.xtc are the reference and trajectory files with H atoms.
###
name="ake"
PYTHONDIR=../PYTHON-scripts
inputDIR=../input-files/${name}
########## 1st part: remove_H_atoms.py ###############
python3 $PYTHONDIR/remove_H_atoms.py -r $inputDIR/${name}.gro -t $inputDIR/${name}.xtc
mv Reference_noH.gro ../input-files/${name}/${name}_noH.gro
mv Trajectory_noH.xtc ../input-files/${name}/${name}_noH.xtc
########## 2nd part: ResRel.py ###############
name_noH="${name}_noH"
python3 $PYTHONDIR/ResRel-MPI.py -r $inputDIR/${name_noH}.gro -t $inputDIR/${name_noH}.xtc
rm -r test-${name_noH}
mkdir test-${name_noH}
mv trace_${name_noH}.txt Hs-Hk-Nsites-${name_noH}.txt test-${name_noH}
########## 3rd part: Hs-Hk-plot.py (DENSITY OPTION 200 POINTS) ###############
cd test-${name_noH}
python3 ../$PYTHONDIR/Hs-Hk-plot.py density -f Hs-Hk-Nsites-${name}.txt -d 200
# The reference and trajectory files, namely 1knt_noH.gro and 1knt_noH.xtc, have already been processed
# to exclude hydrogen atoms. As a result, the script "remove_H_atoms.py" was not utilized in this context.
####
name="1knt_noH"
PYTHONDIR=../PYTHON-scripts
inputDIR=../input-files/${name}
### 1st part: ResRel (Nmappings = 100, Nframes = 500, Nstep = 1%, Ncpu = 16)
python3 $PYTHONDIR/ResRel-MPI.py -r $inputDIR/${name}.gro -t $inputDIR/${name}.xtc -m 100 -f 500 -s 1% -n 16
rm -r test-${name}
mkdir test-${name}
mv trace_${name}.txt Hs-Hk-Nsites-${name}.txt test-${name}
### 2nd part: Hs-Hk-plot (BIN OPTION 100 WINDOWS, SlopeRange="closest")
cd test-${name}
python3 ../$PYTHONDIR/Hs-Hk-plot.py bin -f Hs-Hk-Nsites-${name}.txt -w 100 -s closest
The output files of each test can be also found in PrOpRe/output-files/
directory.
Raffaele Fiorentini: [email protected] or [email protected]