Skip to content
Thomas Roder edited this page Apr 5, 2024 · 11 revisions

Getting started

Consider having a look at the tutorial.

This page describes how to format the input files: Input.

Cookbook

Tetracycline dataset from Scoary 1

# Dataset from Scoary 1: genes in Roary gene count format
scoary2 \
    --genes Gene_presence_absence.csv \
    --gene-data-type 'gene-count:,' \
    --traits Tetracycline_resistance.csv \
    --outdir out \
    --n-permut 1000
# If gene_presence_absence.csv is in gene-list format, use 
#   --gene-data-type 'gene-list:,'
# instead

OrthoFinder-style dataset with numeric traits

 scoary2 \
    --genes N0.tsv \
    --gene-data-type 'gene-list:\t' \
    --traits traits.tsv \
    --trait-data-type 'gaussian:\t' \
    --n-permut 300 \
    --n-cpus 7 \
    --outdir out

OrthoFinder-style dataset + numeric traits + gene-info + trait-info + isolate-info

 scoary2 \
    --genes N0.tsv \
    --gene-data-type 'gene-list:\t' \
    --gene-info N0_best_names.tsv \
    --traits traits.tsv \
    --trait-data-type 'gaussian:\t' \
    --trait-info trait_info.tsv \
    --isolate-info isolate_info.tsv \
    --n-permut 200 \
    --n-cpus 7 \
    --outdir out

Manual

More info:

Below is the output of scoary2 --help:

POSITIONAL ARGUMENTS
    GENES
        Type: str
        Path to gene presence/absence table: columns=isolates, rows=genes
    TRAITS
        Type: str
        Path to trait presence/absence table: columns=traits, rows=isolates
    OUTDIR
        Type: str
        Directory to place output files

FLAGS
    --multiple_testing=MULTIPLE_TESTING
        Type: str
        Default: 'bonferroni:0.999'
        Apply multiple testing to the p-values of Fisher's test to account for the many genes/traits tested. Format: "method:cutoff". Cutoff is a number that specifies the FWER and method is one of [native, bonferroni, sidak, holm-sidak, holm, simes-hochberg, hommel, fdr_bh, fdr_by,  fdr_tsbh, fdr_tsbky]. If method is 'native': then, the cutoff targets the uncorrected p-value from Fisher's test.
    --trait_wise_correction=TRAIT_WISE_CORRECTION
        Type: bool
        Default: False
        Apply multiple testing correction to each trait separately. Not recommended as this can lead to many false positives!
    -w, --worst_cutoff=WORST_CUTOFF
        Type: Optional[float]
        Default: None
        Drop traits if no gene with "worst" p-value lower than threshold. Recommended if dataset contains multiple species
    --max_genes=MAX_GENES
        Type: Optional[int]
        Default: None
        Keep only n highest-scoring genes in Fisher's test. Recommended if dataset is big and contains multiple species; avoids waisting computational resources on traits that simply correlate with phylogeny
    --gene_info=GENE_INFO
        Type: Optional[str]
        Default: None
        Path to file that describes genes: columns=arbitrary properties, rows=genes
    --trait_info=TRAIT_INFO
        Type: Optional[str]
        Default: None
        Path to file that describes traits: columns=arbitrary properties, rows=traits
    --isolate_info=ISOLATE_INFO
        Type: Optional[str]
        Default: None
        Path to file that describes isolates: columns=arbitrary properties, rows=isolates
    --newicktree=NEWICKTREE
        Type: Optional[str]
        Default: None
        Path to a custom tree in Newick format
    -p, --pairwise=PAIRWISE
        Type: bool
        Default: True
        If False, only perform Fisher's test. If True, also perform pairwise comparisons algorithm.
    --n_permut=N_PERMUT
        Type: int
        Default: 500
        Post-hoc label-switching test: perform N permutations of the phenotype by random label switching. Low p-values suggest that the effect is not merely lineage-specific.
    --restrict_to=RESTRICT_TO
        Type: Optional[str]
        Default: None
        Comma-separated list of isolates to which to restrict this analysis
    --ignore=IGNORE
        Type: Optional[str]
        Default: None
        Comma-separated list of isolates to be ignored for this analysis
    --n_cpus=N_CPUS
        Type: int
        Default: 1
        Number of CPUs that should be used. There is overhead in multiprocessing, so if the dataset is small, use n_cpus=1
    --n_cpus_binarization=N_CPUS_BINARIZATION
        Type: Optional[int]
        Default: None
        Number of CPUs that should be used for binarization. Default: one tenth of n_cpus
    --trait_data_type=TRAIT_DATA_TYPE
        Type: str
        Default: 'binary:,'
        "<method>:<?cutoff>:<?covariance_type>:<?alternative>:<?delimiter>" How to read the traits table. Example: "gene-list:\t" for OrthoFinder N0.tsv table
    --gene_data_type=GENE_DATA_TYPE
        Type: str
        Default: 'gene-count:,'
        "<data_type>:<?delimiter>" How to read the genes table. Example: "gene-list:\t" for OrthoFinder N0.tsv table
    -f, --force_binary_clustering=FORCE_BINARY_CLUSTERING
        Type: bool
        Default: False
        Force clustering of binary data even if numeric data is available
    -s, --symmetric=SYMMETRIC
        Type: bool
        Default: True
        if True, correlated and anti-correlated traits will cluster together
    -d, --distance_metric=DISTANCE_METRIC
        Type: str
        Default: 'jaccard'
        distance metric (binary data only); See metric in https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
    --linkage_method=LINKAGE_METHOD
        Type: str
        Default: 'ward'
        linkage method for clustering [single, complete, average, weighted, ward, centroid, median]
    -o, --optimal_ordering=OPTIMAL_ORDERING
        Type: bool
        Default: True
        whether to use optimal ordering; See scipy.cluster.hierarchy.linkage.
    -c, --corr_method=CORR_METHOD
        Type: str
        Default: 'pearson'
        correlation method (numeric data only) [pearson, kendall, spearman]
    --random_state=RANDOM_STATE
        Type: Optional[int]
        Default: None
        Set a fixed seed for the random number generator
    --limit_traits=LIMIT_TRAITS
        Type: Optional[(<class ...
        Default: None
        Limit the analysis to traits n to m. Useful for debugging. Example: "(0, 10)"
    -v, --version=VERSION
        Type: bool
        Default: False
        Print software version of Scoary2 and exit.
Clone this wiki locally