Skip to content

Planning 'get_cazyme_annotations'

Emma Hobbs edited this page Jun 17, 2020 · 1 revision

Plant and Draft for 'get_cazyme_annotations' sub-module

Script aim

To generate a bar chart summarising the number of CAZymes annotated within all GenBank files for a given species. Specifically, creating a single data frame and summary graphical representation, thus collating data for a species which possess multiple genomic assemblies. This is because even if one genomic assembly is sparsely annotated assembly another assembly maybe highly annotated and together provide many potential CAZyme engineering candidates.

Script plan and structure

Inputs

  • Data-frame containing at least 'genus', 'species', 'taxonomy ID', and 'accession number' columns (can be generated using get_ncbi_genomes.
  • Path to directory containing GenBank files

Imports
import argparse import gzip import io import logging import re import shutil import sys

from pathlib import Path from typing import List, Optional

import pandas as pd import seaborn as sns

from Bio import SeqIO from bioservices import UniProt from tqdm import tdqm

from pyrewton.directory_handling.output_dir_handling_main import make_output_directory from pyrewton.directory_handling import input_dir_get_cazyme_annotations from pyrewton.loggers.logger_pyrewton_main import build_logger from pyrewton.parsers.parser_get_cazyme_annotations import build_parser

Add functions for retrieving input data frame, creating output directory, and retrieving GenBank files to the directory_handling module

CMD-line args
-d, --dataframe - optional, default dataframe taken from STDIN, in invoked will specify path to input dataframe.
-g, --genbank - optional, default input taken from STDIN, if invokved will specify directory containing GenBank files.
-l, --log - optional, default None, if invokved will cause the log to be written out to a file.
-o, --output - optional, default output sent to STDOUT, if invoked will specify output directory.

def main():

  1. Build parser
  2. Build logger
  3. Retrieve input data-frame
  4. Create data-frame

def create_annotated_cazy_df()

  1. Parse input data-frame, row-wise
  2. Extract genus, species and taxonomy ID.
  3. Iterate over accession numbers, retrieving GenBank file and protein data by calling get_protein_data
  4. Create data-frame with a unique protein per row
  5. Iterate over protein ID and pass to get_uniprot_data function, to retrieve UniProt data for protein

def get_protein_data()
Use accession number to open/read respective GenBank file.
Retrieve protein_id, protein_name, locus_tag.

def get_uniprot_data()
Use Bioservices to call to UniProt, searching via protein_id and/or locus_tag, and scientific name of the host species.
Retrieve UniProt entry ID, protein names, EC number, and GO classified function.

CAZyme identification

Not every CAZyme has a UniProt entry linked to a CAZy entry, but may still be a CAZyme. For example, some proteins are annotated with the function 'endoglucanase' but their respective UniProt entry are not linked to a CAZy family. Therefore, it is imperative to create a criteria list for identifying CAZymes.

Planning CAZyme identification criteria

  • UniProt entry annotated with CAZy family
  • GenBank annotated function indicating CAZyme function
  • EC Number
  • GO recorded function