-
Notifications
You must be signed in to change notification settings - Fork 1
Usage
Thomas Roder edited this page Apr 5, 2024
·
11 revisions
Consider having a look at the tutorial.
This page describes how to format the input files: Input.
Tetracycline dataset from Scoary 1
# Dataset from Scoary 1: genes in Roary gene count format
scoary2 \
--genes Gene_presence_absence.csv \
--gene-data-type 'gene-count:,' \
--traits Tetracycline_resistance.csv \
--outdir out \
--n-permut 1000
# If gene_presence_absence.csv is in gene-list format, use
# --gene-data-type 'gene-list:,'
# instead
OrthoFinder-style dataset with numeric traits
scoary2 \
--genes N0.tsv \
--gene-data-type 'gene-list:\t' \
--traits traits.tsv \
--trait-data-type 'gaussian:\t' \
--n-permut 300 \
--n-cpus 7 \
--outdir out
OrthoFinder-style dataset + numeric traits + gene-info + trait-info + isolate-info
scoary2 \
--genes N0.tsv \
--gene-data-type 'gene-list:\t' \
--gene-info N0_best_names.tsv \
--traits traits.tsv \
--trait-data-type 'gaussian:\t' \
--trait-info trait_info.tsv \
--isolate-info isolate_info.tsv \
--n-permut 200 \
--n-cpus 7 \
--outdir out
More info:
Below is the output of scoary2 --help
:
POSITIONAL ARGUMENTS
GENES
Type: str
Path to gene presence/absence table: columns=isolates, rows=genes
TRAITS
Type: str
Path to trait presence/absence table: columns=traits, rows=isolates
OUTDIR
Type: str
Directory to place output files
FLAGS
--multiple_testing=MULTIPLE_TESTING
Type: str
Default: 'bonferroni:0.999'
Apply multiple testing to the p-values of Fisher's test to account for the many genes/traits tested. Format: "method:cutoff". Cutoff is a number that specifies the FWER and method is one of [native, bonferroni, sidak, holm-sidak, holm, simes-hochberg, hommel, fdr_bh, fdr_by, fdr_tsbh, fdr_tsbky]. If method is 'native': then, the cutoff targets the uncorrected p-value from Fisher's test.
--trait_wise_correction=TRAIT_WISE_CORRECTION
Type: bool
Default: False
Apply multiple testing correction to each trait separately. Not recommended as this can lead to many false positives!
-w, --worst_cutoff=WORST_CUTOFF
Type: Optional[float]
Default: None
Drop traits if no gene with "worst" p-value lower than threshold. Recommended if dataset contains multiple species
--max_genes=MAX_GENES
Type: Optional[int]
Default: None
Keep only n highest-scoring genes in Fisher's test. Recommended if dataset is big and contains multiple species; avoids waisting computational resources on traits that simply correlate with phylogeny
--gene_info=GENE_INFO
Type: Optional[str]
Default: None
Path to file that describes genes: columns=arbitrary properties, rows=genes
--trait_info=TRAIT_INFO
Type: Optional[str]
Default: None
Path to file that describes traits: columns=arbitrary properties, rows=traits
--isolate_info=ISOLATE_INFO
Type: Optional[str]
Default: None
Path to file that describes isolates: columns=arbitrary properties, rows=isolates
--newicktree=NEWICKTREE
Type: Optional[str]
Default: None
Path to a custom tree in Newick format
-p, --pairwise=PAIRWISE
Type: bool
Default: True
If False, only perform Fisher's test. If True, also perform pairwise comparisons algorithm.
--n_permut=N_PERMUT
Type: int
Default: 500
Post-hoc label-switching test: perform N permutations of the phenotype by random label switching. Low p-values suggest that the effect is not merely lineage-specific.
--restrict_to=RESTRICT_TO
Type: Optional[str]
Default: None
Comma-separated list of isolates to which to restrict this analysis
--ignore=IGNORE
Type: Optional[str]
Default: None
Comma-separated list of isolates to be ignored for this analysis
--n_cpus=N_CPUS
Type: int
Default: 1
Number of CPUs that should be used. There is overhead in multiprocessing, so if the dataset is small, use n_cpus=1
--n_cpus_binarization=N_CPUS_BINARIZATION
Type: Optional[int]
Default: None
Number of CPUs that should be used for binarization. Default: one tenth of n_cpus
--trait_data_type=TRAIT_DATA_TYPE
Type: str
Default: 'binary:,'
"<method>:<?cutoff>:<?covariance_type>:<?alternative>:<?delimiter>" How to read the traits table. Example: "gene-list:\t" for OrthoFinder N0.tsv table
--gene_data_type=GENE_DATA_TYPE
Type: str
Default: 'gene-count:,'
"<data_type>:<?delimiter>" How to read the genes table. Example: "gene-list:\t" for OrthoFinder N0.tsv table
-f, --force_binary_clustering=FORCE_BINARY_CLUSTERING
Type: bool
Default: False
Force clustering of binary data even if numeric data is available
-s, --symmetric=SYMMETRIC
Type: bool
Default: True
if True, correlated and anti-correlated traits will cluster together
-d, --distance_metric=DISTANCE_METRIC
Type: str
Default: 'jaccard'
distance metric (binary data only); See metric in https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
--linkage_method=LINKAGE_METHOD
Type: str
Default: 'ward'
linkage method for clustering [single, complete, average, weighted, ward, centroid, median]
-o, --optimal_ordering=OPTIMAL_ORDERING
Type: bool
Default: True
whether to use optimal ordering; See scipy.cluster.hierarchy.linkage.
-c, --corr_method=CORR_METHOD
Type: str
Default: 'pearson'
correlation method (numeric data only) [pearson, kendall, spearman]
--random_state=RANDOM_STATE
Type: Optional[int]
Default: None
Set a fixed seed for the random number generator
--limit_traits=LIMIT_TRAITS
Type: Optional[(<class ...
Default: None
Limit the analysis to traits n to m. Useful for debugging. Example: "(0, 10)"
-v, --version=VERSION
Type: bool
Default: False
Print software version of Scoary2 and exit.