-
Notifications
You must be signed in to change notification settings - Fork 0
Planning 'get_cazyme_annotations'
To generate a bar chart summarising the number of CAZymes annotated within all GenBank files for a given species. Specifically, creating a single data frame and summary graphical representation, thus collating data for a species which possess multiple genomic assemblies. This is because even if one genomic assembly is sparsely annotated assembly another assembly maybe highly annotated and together provide many potential CAZyme engineering candidates.
Inputs
- Data-frame containing at least 'genus', 'species', 'taxonomy ID', and 'accession number' columns (can be generated using
get_ncbi_genomes
. - Path to directory containing GenBank files
Imports
import argparse
import gzip
import io
import logging
import re
import shutil
import sys
from pathlib import Path
from typing import List, Optional
import pandas as pd
import seaborn as sns
from Bio import SeqIO
from bioservices import UniProt
from tqdm import tdqm
from pyrewton.directory_handling.output_dir_handling_main import make_output_directory
from pyrewton.directory_handling import input_dir_get_cazyme_annotations
from pyrewton.loggers.logger_pyrewton_main import build_logger
from pyrewton.parsers.parser_get_cazyme_annotations import build_parser
Add functions for retrieving input data frame, creating output directory, and retrieving GenBank files to the directory_handling
module
CMD-line args
-d, --dataframe
- optional, default dataframe taken from STDIN, in invoked will specify path to input dataframe.
-g, --genbank
- optional, default input taken from STDIN, if invokved will specify directory containing GenBank files.
-l, --log
- optional, default None, if invokved will cause the log to be written out to a file.
-o, --output
- optional, default output sent to STDOUT, if invoked will specify output directory.
def main():
- Build parser
- Build logger
- Retrieve input data-frame
- Create data-frame
def create_annotated_cazy_df()
- Parse input data-frame, row-wise
- Extract genus, species and taxonomy ID.
- Iterate over accession numbers, retrieving GenBank file and protein data by calling
get_protein_data
- Create data-frame with a unique protein per row
- Iterate over protein ID and pass to
get_uniprot_data
function, to retrieve UniProt data for protein
def get_protein_data()
Use accession number to open/read respective GenBank file.
Retrieve protein_id
, protein_name
, locus_tag
.
def get_uniprot_data()
Use Bioservices to call to UniProt, searching via protein_id
and/or locus_tag
, and scientific name of the host species.
Retrieve UniProt entry ID, protein names, EC number, and GO classified function.
Not every CAZyme has a UniProt entry linked to a CAZy entry, but may still be a CAZyme. For example, some proteins are annotated with the function 'endoglucanase' but their respective UniProt entry are not linked to a CAZy family. Therefore, it is imperative to create a criteria list for identifying CAZymes.
Planning CAZyme identification criteria
- UniProt entry annotated with CAZy family
- GenBank annotated function indicating CAZyme function
- EC Number
- GO recorded function
These page contain the initial plans and development notes for pyrewton