Skip to content

Python script(s) for assigning allele codes under current PulseNet nomenclature system in BioNumerics but as a command-line executable without BioNumerics dependencies

Notifications You must be signed in to change notification settings

ncezid-biome/AlleleCodes

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AlleleCodes

Python script(s) for assigning allele codes under current PulseNet nomenclature system in BioNumerics but as a command-line executable without BioNumerics dependencies

Installation

git

cd ~/bin
git clone [email protected]:gmaniscoo/AlleleCodes.git
export PATH=$PATH:~/AlleleCodes/scripts

Usage

Allele Code Assignment Algorithm:  Using input config file (--config) to denote names of core loci in a cgMLST scheme, assess nearest neighbor
                  distances of input allele profiles (--alleles) against legacy profiles contained in input data
                  diretory (--datadir) in a hierarchical method at organism-specific thresholds defined by input prefix (--prefix)
                  using tree file in data directory as a guide.
Required arguments:
  -a, --alleles:  new allele profiles (Key/StrainID in column 1, locus names as headers with allele numbers in rows beneath, starting in column 2)
  -c, --config:  text file containing names of core loci on separate lines with no header (i.e. all lines are considered locu names)
  -d, --datadir:  data directory (string:  folder where results will be saved it --no-save flag not given)
  -p, --prefix:  (string:  what will be appended in front of the Allele Code and subfolders in data directory where data is stored)

Optional args:
  --nosave:  if provided, tree and allele calls file(s) will not be saved, only results printed to terminal
  --verbose:  if provided, also print what is written to log file to terminal too
  -o, --output:  if provided, print results to file rather than terminal. Delimiter determined by extension (',' for csv, '\t' for tsv)

Example

Example files are included in this repository for this example.

assignAlleleCodes_py3.6.py --alleles BN.tsv  --config ExampleData/coreLoci_CAMP.txt --datadir alleleCodesSave --prefix CAMP --output BN.newallelecodes.tsv

Most output files will be saved to alleleCodesSave in this example. The main output will be saved to BN.newallelecodes.tsv in this example. If you don't want to save anything into alleleCodesSave, then you can use the --nosave option.

Output files

Some output files will have a prefix as designated by --prefix. Following our example, files with a prefix in this table with simply have CAMP as the prefix. For these files, you will also notice that the prefix is joined to the rest of the filename with an underscore.

file what Example
alleleCodes.tsv A TSV file as designated with --output. Contains two columns: sample and allele code. SRR4280276 CAMP2.1 - 3.1.15.1.1.1
CAMP_nomenclature_srcfiles source files
CAMP_nomenclature_srcfiles/allele_calls Main directory storing allele profiles written to file over time
CAMP_nomenclature_srcfiles/allele_calls/current Directory contains most recently written allele profiles in blocks of 1000
CAMP_nomenclature_srcfiles/tree Directory contains back-ups of previously rendered single-linkage tree as json dictionary
CAMP_nomenclature_srcfiles/tree/current Directory contains most recently rendered single-linkage tree as json dictionary for use in next run timestamped with last run's date-time [email protected]
CAMP_nomenclature_logs/CAMP_nomenclature_logs This directory contains the actual log files from each run, allowing you to run the main script multiple times and aggregate results. Filenames will have a timestamped format such as [email protected].
CAMP_nomenclature_logs/CAMP_nomenclature_logs/change_log This directory contains log files of how allele codes have changed over time. The files are timestamped with the current date of run (e.g., 2023-05-30.tsv) and appended to each time if run multiple times on the same day

Flowchart

flowchart LR
    A[New cgMLST Profile] --> B[Calculate Distance to Founders]
    B --> C[Filter to founders within distance threshold]
    C --> D{ }
    D --> E[Single Reference Remains]
    D --> F[Multiple References Remain]
    D --> G[No Reference Remains]
    E --> H["Retain reference number (1)"]
    F --> I["Change code with fewer members to the other,\nReassess each subsequent positions' members"]
    G --> J["Assign next integer,\nWrite retained code to database field (e.g., 1.1.2.1.5)"]
Loading

About

Python script(s) for assigning allele codes under current PulseNet nomenclature system in BioNumerics but as a command-line executable without BioNumerics dependencies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%