Skip to content

Latest commit

 

History

History
366 lines (225 loc) · 22.5 KB

README.md

File metadata and controls

366 lines (225 loc) · 22.5 KB

DiscoVista

DiscoVista (Discordance Visualization Tool) is a command-line software package for visualizing phylogenetic discordance. The package is written by Erfan Sayyari and Siavash Mirarab.

While a general description of commands is given below, a more detailed tutorial using an example plant dataset is available here.

Please cite DiscoVista using:

  • Sayyari, Erfan, James B. Whitfield, and Siavash Mirarab. 2018. “DiscoVista: Interpretable Visualizations of Gene Tree Discordance.” Molecular Phylogenetics and Evolution 122 (May): 110–15. https://doi.org/10.1016/j.ympev.2018.01.019.

INSTALLATION:

Simpler option (preferred): using docker

Since DiscoVista has several dependencies, direct installation might be difficult and time consuming; therefore, we have created a docker image automatically linked to the DiscoVista github repository.

Docker is a software container platform (almost like a virtual machine system) that greatly simplifies installation. To use the docker installation of DiscoVista, you should first install docker and then pull the docker image from the dockerhub. That is it; you will be ready to use DiscoVista.

Here is what you need to do:

  1. Install docker following the instractions for Mac, Windows, or Ubuntu. If you have other operating systems look here for more details.
  • After installation and running the docker, you should pull docker with this command inside a terminal:
    docker pull esayyari/discovista
    This will pull the image.
  • Then you can run DiscoVista following this command:
    docker run -v <absolute path to data folder>:/data esayyari/discovista discoVista.py [OPTIONS]

By using -v we mount the data folder to /data folder inside the container, and all the changes and figures that DiscoVista creates will be available inside this folder. Also note that, \<absolute path to data folder\> is an aboslute path, and program assumes that data is mounted under /data inside container.

Difficault option: installation from source code

The software package DiscoVista depends on several R and python packages. You could install DiscoVista in a couple of steps:

  1. Clone to DiscoVista git repository or download this zip file.
  • Then you need to set environmental variable WS_HOME to the directory under which DiscoVista repository is placed. For example, if you clone to DiscoVista and placed it under the /Users/Erfan/reposiotry folder, then you would export WS_HOME as /Users/Erfan/reposiotry.

  • Then you need to install the dependencies described below.

R dependencies

For instruction on installing R please see this page. After installing R, there are several R packages that you need to install. The R package dependencies are: Reshape, Reshape2, ggplot2, plyr, scales, ape, and optparse.

To install these packages you need to use the following command in R:

install.packages(c("Reshape","Reshape2","ggplot2","plyr","scales","ape","optparse"))

Python dependency

You need to install DendroPy>=4.2.0. In Mac or Linux, you would use pip to install DendroPy. If you have root access, you could use:

sudo pip install dendropy

otherwise, you would install dendropy with the command:

pip install dendropy --user

How does DiscoVista work?

The main utility to run DiscoVista is discoVista.py. To use this utility in command-line you would use:

Usage: discoVista.py [options]

Options:
  -h, --help            show this help message and exit
  -a ANNOTATION, --annotation=ANNOTATION
                        The annotation file
  -c CLADES, --clades=CLADES
                        The path to the clades definition file
  -m MODE, --mode=MODE  Specifies the analysis to perform. To
                        summarize species tree use 0.  To summarize gene
                        trees use 1 . For GC stat analysis use 2. For occupancy
                        analysis use 3. For frequency analysis use 5.
  -p PATH, --path=PATH  path to the gene directory or species tree
  -r ROOT, --rooting=ROOT
                        The rooting file
  -s STYLE, --style=STYLE
                        The color style set
  -t THRESH, --threshold=THRESH
                        The bootstrap threshold
  -x MODELCOND, --modelCond=MODELCOND
                        The model condition that the occupancy map will be
                        plotted for
  -y NEWMODEL           The new order for model conditions
  -w NEWORDER           The new order for clades
  -k MISSING, --missing=MISSING
                        The missing data handling flag. If this flag set to
                        one, clades with partially missing taxa are considered
                        as complete.
  -o LABEL, --output=LABEL
                        name of the output folder for the relative frequency
                        analysis. If you are using the docker it should start
                        with '/data'.
  -g OUTG, --outgroup=OUTG
                        Name of the outgroup for the hypothesis in relative
                        frequency analysis specified in the annotation file,
                        eg. Outgroup or Base.

Input files

Several types of inputs (in addition to gene trees and species trees) need to be provided to DiscoVista (but a subset may be needed for any visualization).

  1. The annotation file (-a file). In each line of this file, you need the taxon name and the corresponding clade name that species belongs to. Use tab as the field separator. You would find an example of annotation file here. This file is used with occupancy and frequency analyses.
  2. The rooting file (-r file). Let's say that you have an outgroup clade. On each line of this file, the set of species in an outgroup clade is listed. The set of species on the first line belongs to the species that are the most distant species to the ingroup species. The next line belongs species in the outgroups which are the second most distant species to the ingroup species, and so on. We root at the first outgroup clade; if not, will move to the second, third, and so on. Note that most analyses do not need an outgroup. You would find an example of annotation file here. This is only used with the
  3. Clade definitions (-c CLADE). In this file the user can easily combine taxa into groups of interest and give the groups names. Each split is a bipartition of the taxa into two groups and corresponds to an edge in an unrooted tree. The user can specify one side of a split (which would be a clade if the side that doesn't include the root is given). With careful definition of splits, alternative hypotheses of interests could be specified. You would find an example of annotation file here. Also, an auxiliary tool to generate this clade definition file in made available under DiscoVista/src/utils/generate_clade-defs.py. We further elaborate on the contents of this file below.

The clade definition file

Each line of the clade definition file defines a clade of potential interest. The file has several columns:

  • Clade Name defines the name of a clade
  • Clade Definition is the list of species or other clades in this clade. You could use + or - signs to define a new clade based on previous ones.
  • Section Letter is an arbitrary name that you can use to group the clades together. If there is no natural grouping of clades, leave it blank. Clades of the same group will appear together in the figures.
  • Components The list of important species or clades that together define the clade. If one of these species or clades is completely missed, the clade will be considered as missing. This is useful to indicate that if all species in some part of the clade are missing, it is not meaningful to talk about that clade anymore.
  • Show is a 0/1 variable. If this is 1, that clade will be shown in the graphs, otherwise, this clade will not be shown.
  • Comments free form comments about the clade.

We have provided a python code generate_clade-defs.py that could be used to generate the clade definition file from the annotation file. You can use it using the command:

generate_clade-defs.py [annotation file] [outputfile] [Other clades file]

This will create one clade for every value in the second column of the annotation file. Using other clades file, you could define other important branches of the expected tree. Let's say that in your annotation file you have two clades A, and B, and you are interested in a clade that unites A and B. Then you would define it with A+B in this file.

  • Note: In the clade definition file, you should provide a clade with the name All, which indicates all the species names you are considering in your analysis (with Show=0).

Using DiscoVista

Throughout the descriptions below, we assume that you are using bash, and your current directory is $WS_HOME/DiscoVista/.

1. Discordance analysis on species trees

To perform discordance analysis on species trees, you need species trees with support values draw on the branches and represented in the Newick format as node labels. For drawing bootstrap support values on branches we highly recommend using newick utilities. Please double check the support values after rerooting with our tool using any graphical viewing software like FigTree to be sure support values are correctly drawn and rerooting was correct.

  • Species trees should be stored following this structure path/MODEL_CONDITION-DST/estimated_species_tree.tree. Here path points to the directory where species trees are located. Put each estimated species tree inferred with different methods under a different directory. The name of these directories should follow model_condition-data_sequence_type. For example, if you have different filtering strategies for your nucleotide acid sequences and then the gene trees are inferred using RAxML, you may use RAxML_highly_filtered-NA. Please only use "-" to separate the model condition from the data sequence type.

  • Let's assume that the support values are drawn on branches of the species tree available at path \path, and there are 3 model conditions, RAxML_highly_filtered-NA, RAxML_med_filtered-NA, and RAxML_highly_filtered-NA. Also, assume that you consider branches with support above 95 as highly supported branches, and the code will contract branches below that. Then you would call the software in bash using the following command:

./discoVista.py -m 0 -c clades-def.txt -p $path -t 95 -o $path/results
  • Using docker:
docker run -v <absolute path to data folder>:/data esayyari/discovista discoVista.py discoVista.py -m 0 -c clades-def.txt -p $path -t 95 -o $path/results
  • If you are using local posterior probabilities instead of bootstrap, and let's assume that the branches above the threshold of 0.95 should be considered as highly supported, then you can run:
./discoVista.py -m 0  -c parameter/clades-def.txt -p $path  -t 0.95 -o $path/results
  • Using docker:
docker run -v <absolute path to data folder>:/data esayyari/discovista discoVista.py -m 0 -c /data/parameter/clades-def.txt -p $path  -t 0.95 -o $path/results

2. Discordance analysis on gene trees

To perform discordance analysis on gene trees, you need gene trees with the MLBS values draw on the branches and represented in the Newick format as node labels. For drawing bootstrap support on branches we highly recommend using newick utilities.

  • Gene trees should be stored using this structure path/GENE_ID/GENE_ID-MODEL_CONDITION-DST/estimated_gene_trees.tree. Here path points to the directory that gene trees are located. Please only use "-" to separate the gene ID, model condition, and data sequence type. Put each estimated gene tree inferred with different methods for the different gene under different directories. The name of these directories should follow GENE_ID-model_condition-data_sequence_type.

  • Note that you should do this analysis for each model condition separately.

  • Let's assume that the MLBS values are drawn on branches of the gene trees of model condition RAxML_highly_filtered-NA available at path path. Also, assume that you consider branches with MLBS above 75 as highly supported branches, and the code will contract branches below that. Then you would call the software in bash using the following command:

./discoVista.py -m 1 -c parameter/clades-def.txt -p $path -t 75 -o $path/results
  • Using docker:
docker run -v <absolute path to data folder>:/data esayyari/discovista discoVista.py -m 1 -c /data/parameter/clades-def.txt -p $path -t 75  -o $path/results

3. GC content analysis

  • GC content analysis shows the ratio of GC content (to the number of A, C, G, T's) in first codon position, second codon position, third codon position, and all together across different species. For satisfying stationary assumption in DNA sequence evolution models, we expect that these ratios be close to identical across all species for each codon position separately. This might not be true for the third codon, which suggests removing the third codon position might help gene tree inferences.
  • For GC content analysis use this structure path/GENE_ID/DST-alignment-noFilter.fasta, where DST defines the data sequence type (e.g FNA, NA, etc.), and DST-alignment-noFilter.fasta is the original sequence alignment without filtering. Please use the following command in bash:
./discoVista.py -p $path -m 2 -o $path/results
  • Using docker:
docker run -v <absolute path to data folder>:/data esayyari/discovista discoVista.py -p $path -m 2 -o $path/results

4. Occupancy analysis

  • To see the occupancy of different species or clades in different genes you would use this analysis.
  • For this analysis use this structure to have the sequence alignments, path/GENE_ID/DST-alignment-MODEL_CONDITION.fasta, where MODE_CONDITION defines the model condition that the sequence is generated based on. Then you would use this command:
 ./discoVista.py -p $path -m 3 -a parameter/annotation.txt -o $path/results
  • Using docker:
docker run -v <absolute path to data folder>:/data esayyari/discovista discoVista.py -p $path -m 3 -a /data/parameter/annotation.txt -o $path/results
  • If you want to have a tile graph that describes the occupancy of species for only one model condition you would use the option -x DST-model_condition. For example, if you are interested in the occupancy map of your data and you used FNA as your DST in your directory names, and the model condition is noFiltered, then you can use this command:
 ./discoVista.py -p $path -m 3 -a parameter/annotation.txt -x FNA-noFiltered -o $path/results
  • Using docker
docker run -v <absolute path to data folder>:/data esayyari/discovista discoVista.py -p $path -m 3 -a /data/parameter/annotation.txt -x FNA-noFiltered -o $path/results

5. Branch support vs branch length analysis

  • This analysis shows the correlation between the average of average gene MLBS values and average of average and maximum gene branch lengths for analyzing the long branch attraction and the effects of different inference methods on the reliability of gene trees.
  • First, organize gene trees using this structure path/MODEL_CONDITION/DST-estimated_gene_trees.tree, where all estimated gene trees for the model condition are concatenated. Let's say that you have 3 model conditions, noFiltered, medFiltered, and highFiltered and you use FNA as your DST, then you would use the following code:
./discoVista.py -p $path -m 4  -r parameter/rootingDef.txt -o $path/results
  • Using docker:

  • Using bash

docker run -v <absolute path to data folder>:/data esayyari/discovista discoVista.py -p $path -m 4  -r /data/parameter/rootingDef.txt -o $path/results

6. Relative frequencey analysis

DiscoVista can show frequency of all three topologies around some focal branches of the infered species trees. These figures can be used to test amount of ILS, as well as if the conditions of ILS are met or not. Before describing the inputs and outputs of this analysis note that this analysis depends on DiscoVista branch of ASTRAL, and in future version we will merge it with the master branch of ASTRAL. If you don’t want to deal with installation difficulties you would simply use DiscoVista docker image.

In order to run this analysis you need a folder (“-p”) under which you have your estimated species tree (with the name estimated_species_tree.tree) and your gene trees all in one file (with the name estimated_gene_trees.tree). For example, in the 1KP folder we have an estimated gene tree that has 844 genes in it. The rooting of them is not important. You need the output folder (“-o”), and you need an annotation file (“-a”) where you have one line per each species which assigns each species to a major split (clade) separated by tabs. Note that all the major splits (clades) in your annotation file should be compatible (monophyletic) with the species tree so that code works properly. Also, in your annotation file you don't have to have taxon which is not in your species tree. There is an optional feature (“-g”) that you might specify the root of the tree you expect from your splits to it as well, e.g. Base or Outgroup.

The output will be similar figures to what we have under the results folder of examples. But it will generate 4 different figures. One of them is named tree.pdf which has 4 different ways of showing your summarized species tree based on your annotation file, and are your guide trees. Then we have the relativeFreq.pdf, which shows the frequency of three topologies around each focal internal branches of your summarized species tree. Here are example commands to run this analysis:

./discoVista.py -p $path -m 5 -a parameter/annotation-hypo.txt -o $path/results  -g Outgroup

using docker:

docker run -v <absolute path to data folder>:/data esayyari/discovista discoVista.py -p $path -m 5  -a /data/parameter/annotation-hypo.txt -o $path/results -g Outgroup

Outputs

Discordance analysis on species trees

Here are the example outputs:

alt text

In this figure rows correspond to major orders and clades, and columns correspond to the results of different methods of the plant dataset. The spectrum of blue-green indicates amount of MLBS values for monophyletic clades. Weakly rejected clades correspond to clades that are not present in the tree, but are compatible if low support branches (below 90%) are contracted

alt text

In this figure rows correspond to major orders and clades, and columns correspond to the results of different methods ofthe plants dataset. Weakly rejected clades correspond to clades that are not present in the tree, but are compatible if low support branches (below 90%) are contracted.

Discordance analysis on gene trees

alt text

This figure shows the portion of RAxML genes for which important clades (x-axis) are highly (weakly) supported or rejected for three model conditions of the plants dataset. Weakly rejected clades are those that are not in the tree but are compatible if low support branches (below 75%) are contracted.

alt text

This figure shows the number of RAxML genes for which important clades (x-axis) are highly (weakly) supported or rejected or are missing of three model conditions. Weakly rejected clades are those that are not in the tree but are compatible if low support branches (below 75%) are contracted.

GC content analysis

Here are some example outputs of this analysis:

alt text

This figure corresponds to the GC content analysis. Each dot shows the average GC content ratio for each species in all (red), first (pink), second (light blue), and third (dark blue) codon positions.

alt text

This figure corresponds to the GC content analysis, using boxplots for first, second, third, as well as all three codon positions.

Occupancy analysis

Here are some example outputs of this analysis:

alt text

This figure shows the occupancy analysis over each individual species for two model conditions.

This figure shows the occupancy analysis on the important splits over each individual species for two model conditions.

Relative frequencey analysis

Here is the example output of this analysis:

alt text

This figure corresponds to the DiscoVista relative frequency analysis considering a hypothesis. Frequency of three topologies around focal internal branches of ASTRAL species trees using the trimmed gene trees. Main topologies are shown in red, and the other two alternative topologies are shown in blue. The dotted lines indicate the 1/3 threshold. The title of each subfigure indicates the label of the corresponding branch on the tree on the right (also generated by DiscoVista). Each internal branch has four neighboring branches which could be used to represent quartet topologies. On the x-axis the exact definition of each quartet topology is shown using the neighboring branch labels separated by “#”.

Bug Reports

Please contact [email protected].