Skip to content

potestiolab/excogito

Repository files navigation

README

EXCOGITO is the program to investigate the mapping problem in coarse-grained modelling of biomolecules.

If you use EXCOGITO in your research please cite:

EXCOGITO, an EXtensible COarse-GraIning TOol, M Giulini, R Fiorentini, L Tubiana, R Potestio, R Menichetti, J. Chem. Inf. Model. 2024, 64, 12, 4912–4927

An Information-Theory-Based Approach for Optimal Model Reduction of Biomolecules, M Giulini, R Menichetti, MS Shell, R Potestio, Journal of chemical theory and computation 16 (11), 6795-6813

A journey through mapping space: characterising the statistical and metric properties of reduced representations of macromolecules, R Menichetti, M Giulini, R Potestio, The European Physical Journal B 94 (10), 1-26

1. Installation

1.1 General requirements on Linux systems

The only requirement is to have Openmp installed on your machine.

1.2 Additional requirements on MAC OS

  • Install argp by using homebrew. At the terminal, run this command:
brew install argp-standalone 
  • Install xcode if your version is higher than MacOs 10.7. You are not required to install the Xcode App from AppStore. At the terminal, just run this command (about 15 Gb are required free on your disk even though, at the end of installation, only 2 Gb will be consumed)
xcode-select --install 

In order to have access to OpenMP libraries you can install libomp by using homebrew. At the terminal, run this command:

brew install libomp

1.3 Compiling

The code can be compiled using CMake. A minimal installation is obtained following these steps:

  1. create a directory in excogito, such as build
mkdir build
cd build
  1. run cmake from build, calling the outer directory
cmake ..
  1. run make
make

1.3.2 Compilation options

Cmake allows to specify several options, such as the C compiler, compilation links and compilation flags. For instance, if the optimized Intel C compiler (icc) is available, step 2 may be substitued by:

cmake .. -DCMAKE_C_COMPILER=icc -DCMAKE_C_FLAGS="-Ofast -fopenmp -I./include -mkl -xSSE4.2 -parallel -ipo -mcpu=native"

On MacOs, the C compiler identification should be AppleClang (check the first line printed on terminal after launching the command cmake ..).

2. Running

The typical usage of the program consists in a call to excogito with one of the following options:

  • optimize: to optimize the coarse-grained mapping by minimising its mapping entropy;
  • random: to randomly generate coarse-grained representations and measure the associated mapping entropies;
  • measure: to measure the mapping entropy of a mapping provided by the user (in the form of a .txt file);
  • norm: to calculate the norm of a mapping (provided by the user) throughout a trajectory;
  • cosine: to calculate pairwise distance and cosine between a pair of mappings (provided by the user) throughout a trajectory;
  • distance: to calculate the distance matrix between a data set of mappings (provided by the user) over a single conformation;
  • optimize_kl: to optimize the coarse-grained mapping by minimising its mapping entropy, calculated using the original Kullback-Leibler divergence;
  • random_kl: to randomly generate coarse-grained representations and measure the associated mapping entropies, calculated using the original Kullback-Leibler divergence;
  • measure_kl: to measure the mapping entropy of a mapping provided by the user (in the form of a .txt file), calculated using the original Kullback-Leibler divergence.

Each task can require different input files, which are provided to the program in the form of command-line options.

For further information, please type on terminal ./excogito --help or ./excogito -h

Alternatively, for printing a short usage message, please type: ./excogito --usage or ./excogito -u

After selecting which task is suitable for your purposes, read carefully the documentation below according to your choice.

2.1. Optimize Task

The optimize task requires the protein code string and three input files: parameter, trajectory, and energy.

In order to launch the optimize task follow this syntax:

./excogito optimize -p $parameter_file.ini -t $trajectory_file.xyz -e $energy_file.txt -c $prot_code

or

./excogito optimize --p $parameter_file.ini --t $trajectory_file.xyz --e $energy_file.txt --code $prot_code

For further information, please type on terminal ./excogito optimize

2.2. Random Task

The random task requires the protein code string and three input files: parameter, trajectory, and energy.

In order to launch the random task follow this syntax:

./excogito random -p $parameter_file.ini -t $trajectory_file.xyz -e $energy_file.txt -c $prot_code

or

./excogito random --p $parameter_file.ini --t $trajectory_file.xyz --e $energy_file.txt --code $prot_code

For further information, please type on terminal ./excogito random

2.3. Measure Task

The measure task requires the protein code string and four input files: parameter, trajectory, energy, and mapping.

In order to launch the measure task follow this syntax:

./excogito measure -p $parameter_file.ini -t $trajectory_file.xyz -e $energy_file.txt -c $prot_code -m $mapping_file.txt

or 

./excogito measure --p $parameter_file.ini --t $trajectory_file.xyz --e $energy_file.txt --prot_code $prot_code --m1 $mapping_file.txt

For further information, please type on terminal ./excogito measure

2.4. Norm Task

The norm task requires the protein code string and three input files: parameter, trajectory, and mapping.

In order to launch the norm task follow this syntax:

./excogito norm -p $parameter_file.ini -t $trajectory_file.xyz -c $prot_code -m $mapping_file.txt

or 

./excogito norm --p $parameter_file.ini --t $trajectory_file.xyz --prot_code $prot_code --m1 $mapping_file.txt

For further information, please type on terminal ./excogito norm

2.5. Cosine Task

The cosine task requires the protein code string and four input files: parameter, trajectory, 1st mapping, and 2nd mapping.

In order to launch the cosine task follow this syntax:

./excogito cosine -p $parameter_file.ini -t $trajectory_file.xyz -c $prot_code -m $mapping_file.txt -n $mapping_file2.txt

or 

./excogito cosine --p $parameter_file.ini --t $trajectory_file.xyz --prot_code $prot_code --m1 $mapping_file.txt --m2 $mapping_file2.txt

For further information, please type on terminal ./excogito cosine

2.6. Distance Task

The distance task requires the protein code string and thre input files: parameter, trajectory, mapping matrix.

In order to launch the distance task follow this syntax:

./excogito distance -p $parameter_file.ini -t $trajectory_file.xyz -c $prot_code -x $mapping_matrix_file.txt

or

./excogito distance --p $parameter_file.ini --t $trajectory_file.xyz --prot_code $prot_code --matrix $mapping_matrix_file.txt

For further information, please type on terminal ./excogito distance

2.7. Optimize_kl Task

The optimize_kl task requires the protein code string and three input files: parameter, trajectory, and probability.

In order to launch the optimize_kl task follow this syntax:

./excogito optimize -p $parameter_file.ini -t $trajectory_file.xyz -r $probability_file.txt -c $prot_code

or

./excogito optimize --p $parameter_file.ini --t $trajectory_file.xyz --probs $probability_file.txt --code $prot_code

For further information, please type on terminal ./excogito optimize_kl

2.8. Random_kl Task

The random_kl task requires the protein code string and three input files: parameter, trajectory, and probability.

In order to launch the random_kl task follow this syntax:

./excogito random_kl -p $parameter_file.ini -t $trajectory_file.xyz -r $probability_file.txt -c $prot_code

or

./excogito random_kl --p $parameter_file.ini --t $trajectory_file.xyz --probs $probability_file.txt --code $prot_code

For further information, please type on terminal ./excogito random_kl

2.9. Measure_kl Task

The measure_kl task requires the protein code string and four input files: parameter, trajectory, probability, and mapping.

In order to launch the measure_kl task follow this syntax:

./excogito measure_kl -p $parameter_file.ini -t $trajectory_file.xyz -r $probability_file.txt -c $prot_code -m $mapping_file.txt

or 

./excogito measure_kl --p $parameter_file.ini --t $trajectory_file.xyz --probs $probability_file.txt --prot_code $prot_code --m1 $mapping_file.txt

For further information, please type on terminal ./excogito measure_kl

2.10. optimize_spins Task

The optimize_spins task requires the protein code string and three input files: parameter, trajectory, and probability.

In order to launch the optimize_spins task follow this syntax:

./excogito optimize_spins -p $parameter_file.ini -t $trajectory_file.xyz -r $probability_file.txt -c $prot_code

or 

./excogito measure_kl --p $parameter_file.ini --t $trajectory_file.xyz --probs $probability_file.txt --prot_code $prot_code

Here the trajectory is simply given by the states of a discrete system (like this).

For further information, please type on terminal ./excogito optimize_spins.

3. Which arguments are mandatory? A short explanation

As shown in Section 2.x, the protein code string and two files are always mandatory, namely the parameter file and the xyz trajectory file. The other files can be mandatory, depending on the chosen task.

What are these files?

  • $parameter_file.ini → Set of parameters in ini format for the algorithm (see 3.1). Examples are present in /examples/parameters;
  • $trajectory_file.xyz → Trajectory in xyz format (see the Section 3.2). An example is present in /examples/trajectories;
  • $energy_file.txt → File with the energies corresponding to each configuration in the trajectory (see the Section 3.3). An example is present in /examples/energies;
  • $prot_code → Unique string that identifies the structure (see 3.4). It will be used to generate the output files;
  • $mapping_file.txt → Mapping file, containing the indices of the retained atoms (see 3.5). An example is present in /examples/mappings;
  • $mapping_file2.txt → 2nd Mapping file, containing the indices of the retained atoms (see 3.5). An example is present in /examples/mappings;
  • $mapping_matrix_file.txt → Matrix with n_mappings CG mappings (see 3.6).
  • $probability_file.txt → File with the probabilities corresponding to each configuration in the trajectory (see 3.7). They must sum to 1.0. An example is present in /examples/probabilities;

3.1. Parameter FILE

The core element of EXCOGITO is the parameter file, which is employed to define the constants used in the different tasks.

A sample parameter file for each task can be found in /examples/parameters.

There exist 16 parameters, but only few of them are mandatory for the selected task. They are illustrated in the following table:

Parameter Description Type Mandatory Suggested value
atomnum number of atoms in the system int all
frames number of frames in the trajectory int all < 5000 on laptops, < 15000 if criterion != 3
cgnum number of CG sites int all between atomnum/20 and atomnum/2
criterion criterion for clustering int O-R-M 0, 1, 2, 3, or 4
nclust number of CG macrostates int C0 - C3 between frames/500, and frames/100
n_mappings number of mappings in tasks random and distance int R-D
MC_steps number of MC step in task optimize int O > 5000
rotmats_period MC steps between two full alignments in task optimize int O
t_zero starting temperature in task optimize double O
distance cophenetic distance threshold double C1
max_nclust upper number of clusters int C2 between frames/100 and frames/50
min_nclust lower number of clusters int C2 between frames/1000 and frames/500 (must be < max_nclust)
Ncores number of cores int no
decay_time governs temperature decay in task optimize double O
rsd use rsd (if 1) instead of rmsd (if 0) int no
stride number of structures between two pivot configurations int C3 ~ 10 if frames between 1000 and 10000

O-R-M-D refer to the tasks (optimize/optimize_kl, random/random_kl, measure/measure_kl, distance) in which the parameter is mandatory. C0 .. C3 indicates that the parameter is mandatory if the clustering criterion is equal to 0 .. 3, respectively.

Clustering

Four criteria for hierarchical clustering:

  • 0 Maxclust clustering: configurations are lumped into Nclust macrostates;
  • 1 Maxdist clustering: clustering with the cophenetic distance;
  • 2 Multiple maxclust: as described in Giulini et al. (JCTC, 2020);
  • 3 Fast clustering: as in criterion 0, but applied to a set of pivot configurations. Labels of intermediate structures are assigned to the closer pivot;

3.2. Trajectory FILE

The trajectory should be provided in the xyz format. The first line of each frame indicates the number of atoms, while the second can contain an arbitrary string. As an example, a trajectory with 2 frames and 3 atoms should resemble the following string:

3

X   2.53  2.09   3.55
X   2.57  1.95   3.51
X   2.45  1.87   3.46
3

X   2.69  1.96   3.40
X   2.80  1.91   3.43
X   2.67  2.03   3.28

In the python subdirectory there is a script that helps with the conversion from GROMACS XTC to the XYZ format.

3.3. Energy FILE

Energy files, mandatory for tasks optimize, random, and measure, should contain one value for each frame in the trajectory.

3.4 Protein Code

The protein code is a string that is used to create output files. Don't insert spaces or special characters in this string

3.5 Mapping FILES

A mapping file, mandatory for tasks measure, norm, and cosine is a file with an integer per line. The value correspond to the index of the atom in the xyz trajectory. As an example, a mapping with 8 sites on a peptide of 50 sites should respect the following format:

3
7
19
21
26
34
40
47

3.6. Mapping Matrix FILES

A mapping matrix is mandatory for task distance. It is simply a series of transposed mappings. If we aim at computing the distance matrix between three mappings with 8 sites on a peptide of 50 sites, we must respect the following syntax:

3 7 19 21 26 34 40 47
2 8 19 24 25 38 41 44
0 10 12 20 29 31 35 49

3.7. Probability FILE

Probability files, mandatory for tasks optimize_kl, random_kl, and measure_kl, must contain one value for each frame in the trajectory and should be properly normalized to 1. For a trajectory of 5 frames, the following file is acceptable:

0.1
0.15
0.6
0.05
0.1

4. Examples

Inside the directory examples there are example files for the 6d93 protein, allowing the user to try all the different tasks:

  • optimize: ./build/excogito optimize -p examples/parameters/parameters_optimize_6d93_N31_small.ini -t examples/trajectories/6d93_100frames.xyz -e examples/energies/6d93_energies_100frames.txt -c 6d93

  • random: ./build/excogito random -p examples/parameters/parameters_random_6d93_N31_small.ini -t examples/trajectories/6d93_100frames.xyz -e examples/energies/6d93_energies_100frames.txt -c 6d93

  • measure: ./build/excogito measure -p examples/parameters/parameters_loadca_6d93_N31.ini -t examples/trajectories/6d93_1000frames.xyz -e examples/energies/6d93_energies_1000frames.txt -c 6d93 -m examples/mappings/tamapin_ca_mapping.txt

  • norm: ./build/excogito norm -p examples/parameters/parameters_norm_6d93_N31.ini -t examples/trajectories/6d93_1000frames.xyz -e examples/energies/6d93_energies_1000frames.txt -c 6d93 -m examples/mappings/tamapin_ca_mapping.txt

  • cosine: ./build/excogito cosine -p ./examples/parameters/parameters_cosine_6d93_N31.ini -t ./examples/trajectories/6d93_1000frames.xyz -e ./examples/energies/6d93_energies_1000frames.txt -c 6d93 -m ./examples/mappings/tamapin_ca_mapping.txt --m2 ./examples/mappings/tamapin_nextca_mapping.txt

  • distance: ./build/excogito distance -p examples/parameters/parameters_distance_6d93_N31.ini -t ./examples/trajectories/6d93_1frame.xyz -x examples/mappings/6d93_mapping_matrix.txt -c 6d93

  • optimize: ./build/excogito optimize_kl -p examples/parameters/parameters_optimizekl_6d93_N31_notemp.ini -t examples/trajectories/6d93_100frames.xyz -r examples/probabilities/6d93_probs_100frames.txt -c 6d93

  • random_kl: ./build/excogito random_kl -p examples/parameters/parameters_randomkl_6d93_N31.ini -t examples/trajectories/6d93_100frames.xyz -r examples/probabilities/6d93_probs_100frames.txt -c 6d93

  • measure_kl: ./build/excogito measure_kl -p examples/parameters/parameters_measurekl_6d93_N31.ini -t examples/trajectories/6d93_100frames.xyz -r examples/probabilities/6d93_probs_100frames.txt -c 6d93 -m examples/mappings/tamapin_ca_mapping.txt

  • optimize_spins ./build/excogito optimize_spins -p examples/parameters/parameters_spins_m1.ini -t examples/trajectories/m1_spins_grouped.csv -r examples/probabilities/m1_probs.txt -c m1

5. Scaling values

The approximated mapping entropy is calculated (tasks optimize, random and measure) without the scaling factor equ (see. Giulini et al.). This factor should be computed by the user according to the temperature employed to simulate the system.

6. Documentation

File refman.pdf in the docs directory contains detailed documentation authomatically generated with doxygen version 1.8.5.

A custom documentation can be generated in html and tex format by running doxygen excogito_doxygen.conf.

7. Contacts

Marco Giulini ([email protected]) Raffaello Potestio ([email protected])