Planning 'get_cazyme_annotations'

Plant and Draft for 'get_cazyme_annotations' sub-module

Script aim

To generate a bar chart summarising the number of CAZymes annotated within all GenBank files for a given species. Specifically, creating a single data frame and summary graphical representation, thus collating data for a species which possess multiple genomic assemblies. This is because even if one genomic assembly is sparsely annotated assembly another assembly maybe highly annotated and together provide many potential CAZyme engineering candidates.

Script plan and structure

Inputs

Data-frame containing at least 'genus', 'species', 'taxonomy ID', and 'accession number' columns (can be generated using get_ncbi_genomes.
Path to directory containing GenBank files

Imports
import argparse import gzip import io import logging import re import shutil import sys

from pathlib import Path from typing import List, Optional

import pandas as pd import seaborn as sns

from Bio import SeqIO from bioservices import UniProt from tqdm import tdqm

from pyrewton.directory_handling.output_dir_handling_main import make_output_directory from pyrewton.directory_handling import input_dir_get_cazyme_annotations from pyrewton.loggers.logger_pyrewton_main import build_logger from pyrewton.parsers.parser_get_cazyme_annotations import build_parser

Add functions for retrieving input data frame, creating output directory, and retrieving GenBank files to the directory_handling module

CMD-line args
-d, --dataframe - optional, default dataframe taken from STDIN, in invoked will specify path to input dataframe.
-g, --genbank - optional, default input taken from STDIN, if invokved will specify directory containing GenBank files.
-l, --log - optional, default None, if invokved will cause the log to be written out to a file.
-o, --output - optional, default output sent to STDOUT, if invoked will specify output directory.

def main():

Build parser
Build logger
Retrieve input data-frame
Create data-frame

def create_annotated_cazy_df()

Parse input data-frame, row-wise
Extract genus, species and taxonomy ID.
Iterate over accession numbers, retrieving GenBank file and protein data by calling get_protein_data
Create data-frame with a unique protein per row
Iterate over protein ID and pass to get_uniprot_data function, to retrieve UniProt data for protein

def get_protein_data()
Use accession number to open/read respective GenBank file.
Retrieve protein_id, protein_name, locus_tag.

def get_uniprot_data()
Use Bioservices to call to UniProt, searching via protein_id and/or locus_tag, and scientific name of the host species.
Retrieve UniProt entry ID, protein names, EC number, and GO classified function.

CAZyme identification

Not every CAZyme has a UniProt entry linked to a CAZy entry, but may still be a CAZyme. For example, some proteins are annotated with the function 'endoglucanase' but their respective UniProt entry are not linked to a CAZy family. Therefore, it is imperative to create a criteria list for identifying CAZymes.

Planning CAZyme identification criteria

UniProt entry annotated with CAZy family
GenBank annotated function indicating CAZyme function
EC Number
GO recorded function

These page contain the initial plans and development notes for pyrewton

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Planning 'get_cazyme_annotations'

Plant and Draft for 'get_cazyme_annotations' sub-module

Script aim

Script plan and structure

CAZyme identification

Clone this wiki locally