The dPUC (Domain Prediction Using Context) software improves domain prediction by considering every domain in the context of other domains, rather than independently as standard approaches do. Our framework maximized the probability of the data under an approximation, which reduces to a pairwise context problem, and we have shown that our probabilistic method is indeed more powerful than competing methods.
Here you can download our code, and learn how to use it. This page is the software's manual.
The dPUC 2.xx series is a major update from dPUC 1.0.
dPUC2 works with HMMER 3 and Pfam 24 onward, and is much faster than before.
If you were a dPUC 1.0 user, note that inputs and outputs are completely different; the new setup is simpler due to the changes that HMMER3 and the new Pfams have brought.
Lastly, there are improvements in the context network parametrization, most notably the addition of directed context.
See Changes
file for more information.
dPUC2 installation is a bit complex, broken down into the following parts below:
- Install software dependencies (easy on Fedora and Ubuntu Linux)
- Clone this repository
- Download Pfam HMM database files
- Download precomputed context network file
dPUC2 contains code written in C that requires the lp_solve
library headers to compile.
dPUC2 requires:
- Perl 5
gzip
- The Inline::C Perl package
- The HMMER3 protein hidden Markov model software
- The lp_solve 5.5 integer linear programming library
- Git versioning control software (optional, to clone repository)
Perl and gzip are part of practically all Linux distributions. Below are commands for installing the rest of these dependencies on two common Linux distributions: Fedora and Ubuntu.
On a terminal, type:
sudo dnf install perl-Inline-C lpsolve-devel hmmer git
This command should also work on Red Hat Linux too (untested).
On a terminal, type:
sudo apt install libinline-c-perl liblpsolve55-dev hmmer git
sudo ln -s /usr/lib/lp_solve/liblpsolve55.so /usr/lib/liblpsolve55.so
This worked on Ubuntu 19.04. This command may also work on other Debian Linux systems (untested).
If the above doesn't prevent an error about linking to the lp_solve library, other users reported these additional steps helping:
- Adding
/usr/lib/lp_solve/
to the LD library path in~/.bashrc
- Creating a file in
/etc/ld.so.conf.d
calledlpsolve.conf
with the content:/usr/lib/lp_solve
Thank you to Philipp Meister for helping me figure out these Ubuntu instructions.
Besides the above dependencies, dPUC2 has to be able to find lp_lib.h
.
DpucLpSolve.pm
has a hardcoded path for /usr/include/lpsolve
, which works for Fedora and Ubuntu.
On other systems, you can run this command to make sure lp_lib.h
is in the right place:
find / -name lp_lib.h
If the result includes /usr/include/lpsolve/lp_lib.h
, then no changes to DpucLpSolve.pm
are needed.
Otherwise, open DpucLpSolve.pm
with a text editor and replace /usr/include/lpsolve
with the directory that contains lp_lib.h
.
Email me if you enconter any issues installing dPUC2 and its dependencies, especially if you were able to solve them, so I can update these instructions.
You can download a ZIP copy of this repository on GitHub. However, cloning the repository has many advantages. You can clone it with this command:
git clone https://github.com/alexviiia/dpuc2.git
If there are any updates in the future, you can simply update your copy of the repository by typing, inside the dpuc2
directory:
git pull
Each script can be run easily from the directory that contains it (and its *.pm
Perl module dependencies).
For example, if you cloned this repository onto the local directory dpuc2/
, on a terminal you can run one of the help messages by typing:
cd dpuc2/
perl -w 1dpuc2.pl # <ARGS>...
To run the scripts from other directories, you have to specify the directory containing the *.pm
Perl modules using the -I
option for perl
, like so:
cd ..
perl -Idpuc2/ -w dpuc2/1dpuc2.pl # <ARGS>...
Note that the home directory shortcut ~
doesn't work with -I
, but you can use $HOME
instead.
So if the code is in ~/dpuc2/
, then this will work:
perl -I$HOME/dpuc2/ -w ~/dpuc2/1dpuc2.pl # <ARGS>...
For the HMM database, you will need to download Pfam-A.hmm.gz
and Pfam-A.hmm.dat.gz
from Pfam.
Use HMMER3's hmmpress to prepare database for searching.
A minimal set of commands is:
# download files
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz
# uncompress Pfam-A.hmm, while keeping compressed version around
gunzip -k Pfam-A.hmm.gz
# press the database
hmmpress Pfam-A.hmm
# remove temporary uncompressed copy (no longer needed)
rm Pfam-A.hmm
Lastly, download a precomputed context network file that corresponds with the Pfam version you want to use, such as dpucNet.pfam33.1.txt.gz
, from the dpuc2-data repository.
A quick download command is (change the pfam version accordingly):
wget https://github.com/alexviiia/dpuc2-data/raw/master/dpucNet.pfam33.1.txt.gz
Code to recompute this file from Pfam is also provided (slow and uses lots of memory, not recommended).
This repository contains the following sample files:
Sample input:
sample/sample.fa
: 10 random protein sequences from the Plasmodium falciparum proteome.
Sample outputs:
sample/sample.txt
: HMMER3/Pfam33.1 domain predictions using weak filters (produced by0runHmmscan.pl
as below).sample/sample.dpuc.txt
: Dpuc2 domain predictions using the previous sample as input (produced by1dpuc2.pl
as in the first command below).
All scripts give detailed usage instructions when executed without arguments.
The following commands can be run on the sample file placed in the same directory as the code and called from that location.
The input file may be compressed with gzip
and may be specified with or without the GZ extension.
The output file is compressed with gzip
, whether the outputs indicate a GZ extension or not.
So the commands as they are below produce and will work entirely with compressed files, without a single GZ extension indicated.
Do this the first time only, to compile the C portion of the code or to troubleshoot dependency problems. If successful, it will display the "usage" message.
perl -w 1dpuc2.pl
Create the dPUC context count network for your Pfam version (skip this step by downloading above the precomputed dPUC network you need).
perl -w dpucNet.pl Pfam-A.full.uniprot dpucNet.txt
Produce domain predictions with HMMER3 (provides hmmscan) with weak filters. This is the slowest step.
perl -w 0runHmmscan.pl hmmscan Pfam-A.hmm sample.fa sample.txt
(Optional) compress output, our scripts will read it either way
gzip sample.txt
Produce dPUC predictions with default parameters
perl -w 1dpuc2.pl Pfam-A.hmm.dat dpucNet.txt sample.txt sample.dpuc.txt
Produce dPUC predictions for more than one p-value threshold for candiate domains (produces one output for each threshold)
perl -w 1dpuc2.pl Pfam-A.hmm.dat dpucNet.txt sample.txt sample.dpuc.txt --pvalues 1e-1 1e-4 1e-9
This is the help message you get by running the script without arguments:
perl -w dpucNet.pl
# dpucNet.pl: Extract the context count network from Pfam predictions
# dPUC 2.08 - https://github.com/alexviiia/dpuc2
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: perl -w dpucNet.pl <Pfam-A.full.uniprot> <dpucNet output>
The required inputs are
<Pfam-A.full.uniprot> Input "full" domain prediction file from Pfam (29-latest).
Use Pfam-A.full for Pfam 23-28.
<dpucNet output> Directed context network of domain family pair counts.
Input file may be compressed with gzip, and may be specified with or without the .gz
extension. Output file will be automatically compressed with gzip.
Run once per Pfam release. This Pfam-specific script counts every ordered domain family
pair observed in the full predictions of standard Pfam releases. The code is optimized to
parse such large files quickly and store domain predictions with minimal memory usage.
However, for the latest Pfam releases it remains a considerable amount of time and memory.
Most users should skip this step, downloading instead the desired precomputed dPUC context
networks file from:
https://github.com/alexviiia/dpuc2-data
This script is provided for completeness and transparency.
Let me further insist that, despite optimizations, this script consumes prohibitive amounts of memory (and time, to a lesser extent), so most regular users won't even be able to complete a run.
First, downloading the input Pfam file Pfam-A.full.uniprot.gz
takes a very long time (24 GB for Pfam 33.1).
Then, for Pfam 33.1, 24G of memory are needed for the script to run.
The memory problems are due to Pfam-A.full.uniprot listing domains grouped by family, whereas these family pair counts can only be calculated after predictions are grouped by protein, so all the data must be in memory before it can be rearranged.
The output file starts with one line listing all Pfam accession codes observed (sorted), followed by one line for every pair of Pfam families (also sorted), with the count of observed pairs. For Pfam 32, this file looks a bit like this (abbreviated):
PF00001 PF00002 PF00003 PF00004 PF00005 ... PF18867 PF18868 PF18869 PF18870 PF18871
PF00001 PF00001 1992
PF00001 PF00002 1
PF00001 PF00003 2
PF00001 PF00004 2
PF00001 PF00005 1
...
PF18870 PF18870 1
PF18871 PF04326 3
PF18871 PF04471 3
PF18871 PF18735 10
PF18871 PF18871 5
This is the help message you get by running the script without arguments:
perl -w 0runHmmscan.pl
# 0runHmmscan.pl: Get domain predictions from your protein sequences
# dPUC 2.08 - https://github.com/alexviiia/dpuc2
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: perl -w 0runHmmscan.pl <hmmscan> <Pfam-A.hmm> <FASTA input> <output table> \
[<file stdout>]
The required inputs are
<hmmscan> the path to the HMMER3 hmmscan executable.
<Pfam-A.hmm> the path to your HMM library of choice (in HMMER3 format).
<FASTA input> the FASTA sequence file, may be compressed with gzip.
<output table> the hmmscan output is plain text table delimited by whitespace (always
uncompressed).
The optional input is
<file stdout> the file to which hmmscan's standard output goes, including alignments
(default /dev/null)
This script changes hmmscan parameters that are important for dPUC and DomStratStats to
work. In particular, it forces outputs to report p-values in place of E-values, and it
relaxes the heuristic p-value filters to allow more predictions through. This script sets
the most stringent p-value threshold at 1e-4.
Note that the p-value threshold of 1e-4 here matches the default domain candidate p-value threshold that dPUC uses, and which worked best in our benchmarks.
Run on the sample file provided here, which is a standard FASTA file that looks like this (abbreviated):
>MAL8P1.134
MSSIAKKTQYNIKVDIHEVK...
>PFA0015c
MALKKGVINESKLSARNVLE...
...
The main output is a large table that looks like this (abbreviated and split into 2 columns):
# --- full sequence ---
# target name accession tlen query name accession qlen E-value score bias
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ -----
C2 PF00168.30 103 MAL8P1.134 - 1704 2.3e-39 120.2 0.8
C2 PF00168.30 103 MAL8P1.134 - 1704 2.3e-39 120.2 0.8
C2 PF00168.30 103 MAL8P1.134 - 1704 2.3e-39 120.2 0.8
C2 PF00168.30 103 MAL8P1.134 - 1704 2.3e-39 120.2 0.8
C2 PF00168.30 103 MAL8P1.134 - 1704 2.3e-39 120.2 0.8
Ferlin_C PF16165.5 154 MAL8P1.134 - 1704 1.9e-08 20.6 0.0
ATS PF15445.6 446 PFA0015c - 1327 3.1e-194 632.5 39.3
Duffy_binding PF05424.11 182 PFA0015c - 1327 7.5e-67 211.0 19.8
Duffy_binding PF05424.11 182 PFA0015c - 1327 7.5e-67 211.0 19.8
NTS PF15447.6 37 PFA0015c - 1327 5.5e-14 38.8 0.2
ArsR PF09824.9 159 PFA0015c - 1327 3.5e-05 9.6 15.0
...
-------------- this domain ------------- hmm coord ali coord env coord
# of c-Evalue i-Evalue score bias from to from to from to acc description of target
--- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
1 5 3.6e-10 3.6e-10 26.4 0.7 2 90 11 105 10 118 0.80 C2 domain
2 5 7e-08 7e-08 19.0 0.0 3 57 177 234 175 279 0.72 C2 domain
3 5 3.5e-10 3.5e-10 26.4 0.0 7 91 670 761 666 773 0.77 C2 domain
4 5 5.6e-08 5.6e-08 19.3 0.0 4 95 1333 1434 1330 1441 0.78 C2 domain
5 5 1.4e-07 1.4e-07 18.0 0.0 3 102 1494 1599 1492 1600 0.82 C2 domain
1 1 1.9e-08 1.9e-08 20.6 0.0 63 143 1616 1702 1587 1704 0.82 Ferlin C-terminus
1 1 3.1e-194 3.1e-194 632.5 39.3 2 446 895 1327 894 1327 0.95 acidic terminal segme...
1 2 2e-45 2e-45 141.2 0.7 1 156 120 267 120 289 0.91 Duffy binding domain
2 2 4.8e-25 4.8e-25 74.8 5.2 1 162 528 657 528 684 0.84 Duffy binding domain
1 1 5.5e-14 5.5e-14 38.8 0.2 1 37 14 48 14 48 0.98 N-terminal segments o...
1 1 2.8e-05 2.8e-05 9.9 1.0 68 143 756 832 743 842 0.86 ArsR transcriptional ...
...
Although the file still calls the various columns by "E-value", they are all are p-values when output by 0runHmmscan.pl
.
This is the help message you get by running the script without arguments:
perl -w 1dpuc2.pl
# 1dpuc2.pl: Produce dPUC domain predictions from raw hmmscan data
# dPUC 2.08 - https://github.com/alexviiia/dpuc2
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: perl -w 1dpuc2.pl [-options] <Pfam-A.hmm.dat> <dpucNet> <input table> <output table>
The required inputs are
<Pfam-A.hmm.dat> Pfam annotation file, contains GA thresholds and nesting network.
<dpucNet> Directed context network of domain family pair counts.
<input table> The output from hmmscan (previous script).
<output table> Input with most domains removed by dPUC. Format is identical to input.
File path will be used as the base to generate multiple files, one for
each p-value thresholds, if more than one threshold is requested.
Options:
--pvalues <x...> The p-value thresholds for candidate domains [1e-4]
If multiple thresholds are provided, the output table has these
thresholds inserted as text just before the file extension, creating
separate outputs for each threshold.
--alpha <x> Pseudocount (sym-Dirichlet param) for context scores [0.01]
--cutNet <x> Keep context network counts >= cutNet [2]
--fCut <x> Permissive overlap max proportion [.50]
--lCut <x> Permissive overlap max length in amino acids [40]
--timeout <x> lp_solve timeout in seconds [1]
--noGzip Outputs will not be compressed (default is gzip compression).
If more than one p-value threshold is provided on candidate domains:
Note that dPUC predictions are not necessarily nested as candidate domain thresholds are varied,
since dPUC solves a complicated pairwise optimization problem, so these files are not redundant
and are not obtained by filtering the largest file. Each output is as if dPUC had been run anew
on the given p-value threshold, although the code does this efficiently by factoring out shared
steps between thresholds.
Input file may be compressed with gzip, and may be specified with or without the .gz
extension. Output file will be automatically compressed with gzip unless --noGzip is used.
Since my source code is not self-documented, here's a brief description of what each module does.
File | Description |
---|---|
FileGz.pm | Handles normal and compressed files transparently. |
ParsePfam.pm | Parses Pfam-A.hmm.dat and other Pfam-specific files. |
Domains.pm | Processes domains, particularly overlaps. |
Hmmer3ScanTab.pm | Runs and parses hmmscan outputs. |
Dpuc.pm | Main dPUC package connects various context domain prediction strategies. |
DpucPosElim.pm | C code that solves the dPUC "positive elimination". |
DpucLpSolve.pm | Perl-C glue for lp_solve library. |
DpucNet.pm | Counts domain family pairs in Pfam-A.full. |
DpucNetScores.pm | Turns context pair counts into bitscores for domain prediction. |
DpucOvsCompact.pm | Compacts domain overlap LP definitions by finding cliques. |
NetCC.pm | Finds connected components in a network. |
EncodeIntPair.pm | Maps non-negative integer pairs into single integers. |
This code is released under the GNU GPLv3 (GNU General Public License version 3). See LICENSE.
This code has been tested on
- Perl 5.18, 5.20, 5.22, 5.28
- HMMER 3.0, 3.1b1, 3.1b2, 3.2.1, 3.3
- Pfam 25, 27-33.1
- Inline::C 0.53, 0.62, 0.78, 0.80
- lp_solve 5.5.2.0
2017-04-12. Alejandro Ochoa, Mona Singh. Domain prediction with probabilistic directional context. Bioinf 33(16) 2471-2478. Article, bioRxiv 2016-12-14.
2015-11-17. Alejandro Ochoa, John D Storey, Manuel Llinás, and Mona Singh. Beyond the E-value: stratified statistics for protein domain prediction. PLoS Comput Biol. 11 e1004509. Article, arXiv 2014-09-23.
2011-03-31. Alejandro Ochoa, Manuel Llinás, and Mona Singh. Using context to improve protein domain identification. BMC Bioinformatics, 12:90. Article.