Usage

Getting started

Consider having a look at the tutorial.

This page describes how to format the input files: Input.

Cookbook

Tetracycline dataset from Scoary 1

# Dataset from Scoary 1: genes in Roary gene count format
scoary2 \
    --genes Gene_presence_absence.csv \
    --gene-data-type 'gene-count:,' \
    --traits Tetracycline_resistance.csv \
    --outdir out \
    --n-permut 1000
# If gene_presence_absence.csv is in gene-list format, use 
#   --gene-data-type 'gene-list:,'
# instead

OrthoFinder-style dataset with numeric traits

 scoary2 \
    --genes N0.tsv \
    --gene-data-type 'gene-list:\t' \
    --traits traits.tsv \
    --trait-data-type 'gaussian:\t' \
    --n-permut 300 \
    --n-cpus 7 \
    --outdir out

OrthoFinder-style dataset + numeric traits + gene-info + trait-info + isolate-info

 scoary2 \
    --genes N0.tsv \
    --gene-data-type 'gene-list:\t' \
    --gene-info N0_best_names.tsv \
    --traits traits.tsv \
    --trait-data-type 'gaussian:\t' \
    --trait-info trait_info.tsv \
    --isolate-info isolate_info.tsv \
    --n-permut 200 \
    --n-cpus 7 \
    --outdir out

Manual

More info:

Input
Output

Below is the output of scoary2 --help:

POSITIONAL ARGUMENTS
    GENES
        Type: str
        Path to gene presence/absence table: columns=isolates, rows=genes
    TRAITS
        Type: str
        Path to trait presence/absence table: columns=traits, rows=isolates
    OUTDIR
        Type: str
        Directory to place output files

FLAGS
    --multiple_testing=MULTIPLE_TESTING
        Type: str
        Default: 'bonferroni:0.999'
        Apply multiple testing to the p-values of Fisher's test to account for the many genes/traits tested. Format: "method:cutoff". Cutoff is a number that specifies the FWER and method is one of [native, bonferroni, sidak, holm-sidak, holm, simes-hochberg, hommel, fdr_bh, fdr_by,  fdr_tsbh, fdr_tsbky]. If method is 'native': then, the cutoff targets the uncorrected p-value from Fisher's test.
    --trait_wise_correction=TRAIT_WISE_CORRECTION
        Type: bool
        Default: False
        Apply multiple testing correction to each trait separately. Not recommended as this can lead to many false positives!
    -w, --worst_cutoff=WORST_CUTOFF
        Type: Optional[float]
        Default: None
        Drop traits if no gene with "worst" p-value lower than threshold. Recommended if dataset contains multiple species
    --max_genes=MAX_GENES
        Type: Optional[int]
        Default: None
        Keep only n highest-scoring genes in Fisher's test. Recommended if dataset is big and contains multiple species; avoids waisting computational resources on traits that simply correlate with phylogeny
    --gene_info=GENE_INFO
        Type: Optional[str]
        Default: None
        Path to file that describes genes: columns=arbitrary properties, rows=genes
    --trait_info=TRAIT_INFO
        Type: Optional[str]
        Default: None
        Path to file that describes traits: columns=arbitrary properties, rows=traits
    --isolate_info=ISOLATE_INFO
        Type: Optional[str]
        Default: None
        Path to file that describes isolates: columns=arbitrary properties, rows=isolates
    --newicktree=NEWICKTREE
        Type: Optional[str]
        Default: None
        Path to a custom tree in Newick format
    -p, --pairwise=PAIRWISE
        Type: bool
        Default: True
        If False, only perform Fisher's test. If True, also perform pairwise comparisons algorithm.
    --n_permut=N_PERMUT
        Type: int
        Default: 500
        Post-hoc label-switching test: perform N permutations of the phenotype by random label switching. Low p-values suggest that the effect is not merely lineage-specific.
    --restrict_to=RESTRICT_TO
        Type: Optional[str]
        Default: None
        Comma-separated list of isolates to which to restrict this analysis
    --ignore=IGNORE
        Type: Optional[str]
        Default: None
        Comma-separated list of isolates to be ignored for this analysis
    --n_cpus=N_CPUS
        Type: int
        Default: 1
        Number of CPUs that should be used. There is overhead in multiprocessing, so if the dataset is small, use n_cpus=1
    --n_cpus_binarization=N_CPUS_BINARIZATION
        Type: Optional[int]
        Default: None
        Number of CPUs that should be used for binarization. Default: one tenth of n_cpus
    --trait_data_type=TRAIT_DATA_TYPE
        Type: str
        Default: 'binary:,'
        "<method>:<?cutoff>:<?covariance_type>:<?alternative>:<?delimiter>" How to read the traits table. Example: "gene-list:\t" for OrthoFinder N0.tsv table
    --gene_data_type=GENE_DATA_TYPE
        Type: str
        Default: 'gene-count:,'
        "<data_type>:<?delimiter>" How to read the genes table. Example: "gene-list:\t" for OrthoFinder N0.tsv table
    -f, --force_binary_clustering=FORCE_BINARY_CLUSTERING
        Type: bool
        Default: False
        Force clustering of binary data even if numeric data is available
    -s, --symmetric=SYMMETRIC
        Type: bool
        Default: True
        if True, correlated and anti-correlated traits will cluster together
    -d, --distance_metric=DISTANCE_METRIC
        Type: str
        Default: 'jaccard'
        distance metric (binary data only); See metric in https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
    --linkage_method=LINKAGE_METHOD
        Type: str
        Default: 'ward'
        linkage method for clustering [single, complete, average, weighted, ward, centroid, median]
    -o, --optimal_ordering=OPTIMAL_ORDERING
        Type: bool
        Default: True
        whether to use optimal ordering; See scipy.cluster.hierarchy.linkage.
    -c, --corr_method=CORR_METHOD
        Type: str
        Default: 'pearson'
        correlation method (numeric data only) [pearson, kendall, spearman]
    --random_state=RANDOM_STATE
        Type: Optional[int]
        Default: None
        Set a fixed seed for the random number generator
    --limit_traits=LIMIT_TRAITS
        Type: Optional[(<class ...
        Default: None
        Limit the analysis to traits n to m. Useful for debugging. Example: "(0, 10)"
    -v, --version=VERSION
        Type: bool
        Default: False
        Print software version of Scoary2 and exit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage

Getting started

Cookbook

Manual

Table of Contents

Clone this wiki locally