Klebsiella genome metadata scheme, plus guidance, examples and submission template.
This is a community-driven data curation effort to facilitate the use and reuse of public genome collections for maximum knowledge gain. These efforts are focussed on Klebsiella pneumoniae and closely related organisms in the K. pneumoniae Species Complex (KpSC) and are coordinated by the KlebNET-GSP project team. The data will be collated and made publicly available in this repository and the PathogenWatch website, which hosts public KpSC genome collections and reports associated genotypes.
Our goal is to collect information that has broad utility for research focused on KpSC, and that can be readily harmonised for easy and effective reuse. We aim to capture information that is not currently well represented in the public data repositories. Notably, the National Center for Biotechnology Information (NCBI) already allows submission of detailed Antimicrobial Susceptibility Testing (AST) information that is directly applicable to the KpSC, and AST data is therefore excluded from our data curation effort. If you have generated and are able to share AST data for KpSC isolates please consider submitting to NCBI.
Our scheme includes data fields divided into two sections:
1. Isolate metadata fields capture information about the individual KpSC genomes and their associated isolates, as well as the sample sources and/or hosts from which the isolates were collected.
2. Sampling fields capture information about how and why isolates were collected and/or chosen for sequencing. These data are essential to understand the underlying biases in genome collections, and to make decisions about the inclusion or exclusion of isolates for comparative and aggregate analyses.
The submission template is available here. Detailed instructions and guidance for data submission can be found below.
1. Data submission
2. Isolate metadata fields
3. Sampling fields
i. Term definitions for 'purpose of sampling'
ii. Examples of how to describe study designs using the sampling fields
4. Queries and suggestions
5. License
The data submission template is available here. Please MAKE A COPY before inputting your own data. You cannot enter data directly into the master copy of the template. Once completed, email or share your copy to [email protected].
The full list of data fields, value formats and options are shown in the tables below.
Some fields have restricted vocabularies and/or require selection from a list of predefined data values. In most cases the list of possible values can be accessed and searched via a drop-down list within the submission template (also shown in the tables below, marked 'Choose from list') and only values matching those in the list will be accepted. However, in a minority of cases the possible set of values is derived from an established ontology that is too large for inclusion within the submission template. These fields are marked as, 'Controlled vocabulary,' with a link to the appropriate ontology e.g. NCBI taxonomy database or MeSH disease ontology.
In some cases it is desirable to have a restricted vocabulary to support data harmonisation, but there are no appropriate predefined ontologies and too many foreseeable options to create a definitive list. In these cases, we provide a list of suggested values that we expect to capture the vast majority of scenarios, but also provide the option to enter alternative values via free text. These fields are marked in the tables below as 'Choose common values from the list, or if none are appropriate, enter free text'. The submission template includes a drop-down list of the suggested values, but will allow other values to be entered (these free text entries will be marked with warnings).
These data describe individual genome sequences and the bacterial isolates from which they were derived. Please complete one row per seqeunce (i.e. one set of seqeunce read data and/or a de novo assembly).
Variable fields, and guidance for completing them, are shown in the table below.
For text fields, please DO NOT enter 'unknown' or 'missing' unless otherwise specified. Instead, leave the field blank if you do not have any data to input for that field.
Status | Variable | Definition; Guidance | Value format |
---|---|---|---|
REQUIRED if published | References | PubMed ID for associated publication reporting genome data; DOI is acceptable for preprints only. Multiple references can be provided as a list (comma-separated). If no associated publications, leave blank. | {text} |
RECOMMENDED; REQUIRED if no Assembly accession provided | Run accession | Sequence archive run accession (sequence read accession); SRRxxx, ERRxxx. If multiple sequences for the same ISOLATE, a list of accessions can be given (comma-separated). | {text} |
REQUIRED | Project accession | BioProject accession; PRJxxx. If multiple projects for the same ISOLATE, a list of accessions can be given (comma-separated). | {text} |
REQUIRED | Sample accession | BioSample accession; SAMxxx | {text} |
RECOMMENDED; REQUIRED if no Assembly accession provided | Experiment accession | Sequence archive experiment accession; SRXxxx, ERXxxx. If multiple experiments for the same ISOLATE, a list of accessions can be given here (comma-separated). | {text} |
optional | Secondary sample accession | NCBI Biosample; ERSxxx | {text} |
optional; REQUIRED if no Run accession provided | Assembly accession | GenBank assembly accession; GCA_xxx. The accession for the entire assembly, including chromosome and plasmids. | {text} |
optional | Secondary assembly accession | Genbank WGS master record accession | {text} |
REQUIRED | Genome source | Type of sequence from which this genome was derived; Indicate if the sequence represents a single cultured isolate whole genome sequence (WGS) or is derived from a mixed sequence / metagenome-assembled genome (MAG). Choose from the list. | Isolate WGS | MAG | Unknown |
REQUIRED | Isolate name | A name that you choose for the isolate. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible. Every Isolate name from a single Submitter must be unique. | {text} |
REQUIRED | Collection year | The year that the isolate was collected; YYYY | {int} |
REQUIRED | Collection month | The month that the isolate was collected; MM | {int} |
REQUIRED | Collection day | The day that the isolate was collected within the month specified in 'Collection month'; DD | {int} |
REQUIRED | Country | Country of isolate collection. Controlled vocabulary, choose from the list of values as defined in https://www.insdc.org/submitting-standards/geo_loc_name-qualifier-vocabulary/ | {term} |
REQUIRED | Isolate source | Short free text description of the sample source from which the Klebsiella was isolated. E.g. ‘human blood’ , ‘animal feed’ , ‘river water grab sample’. | {text} |
REQUIRED | Source type | Controlled vocabulary describing the source of the isolate. Choose from the list. Enables high level grouping of isolates. | Human | Animal | Food | Environmental | Other | Missing | Restricted access | Not applicable | Not collected | Not provided |
REQUIRED | Host | Scientific name of the host from which the isolate was collected. Controlled vocabulary as defined in https://www.ncbi.nlm.nih.gov/taxonomy. If not host-associated, specify 'not host-associated'. Ensure the source is appropriately described under ‘Isolation source' and consider submitting detailed source information to NCBI via the One Health Enteric metadata template. | {term} |
RECOMMENDED unless lat_lon given | City or region | City or region of isolate collection. | {text} |
RECOMMENDED unless City or region given | lat_lon | The geographical coordinates of the location where the sample was collected. Specify as degrees latitude and longitude in the format "d[d.dddd] N|S d[dd.dddd] W|E", e.g. 38.98 N 77.11 W. | {float}{float} |
optional | Isolate alias | Other IDs associated with this isolate. Multiple IDs can be given (comma-separated). | {text} |
optional | Travel associated | For isolates collected from human hosts, indicate if associated with recent travel. Leave blank if travel status is unknown. | Travel associated | NOT travel associated |
optional | Travel country | If travel associated, indicate the travel country. This should be one of the countries listed here: https://www.insdc.org/submitting-standards/geo_loc_name-qualifier-vocabulary/. Leave blank if unknown. | {term} |
REQUIRED if host-associated | Host tissue sampled | Name of body site or specimen type from which the sample was obtained, such as a specific organ, tissue or clinical specimen. Choose common values from the list, or if none are appropriate, enter free text. | Blood | Cerebrospinal fluid (CSF) | Urine | Sputum | Bronchoalveolar lavage (BAL) | Other respiratory | Wound | Skin | Feces | Rectal swab | Throat swab | Cecal swab | {text} |
REQUIRED if host-associated | Infection | For host-associated isolates, indicate if infecting or colonising isolate, or if the infection status is unknown. Choose from list. | Infection | Colonisation | Unknown |
REQUIRED if Infection = 'Infection' | Host disease | For host-associated infecting isolates, provide the name of the relevant disease, e.g. Pneumonia, Bacteremia. Controlled vocabulary as defined in https://meshb.nlm.nih.gov/treeView. If unknown, leave blank. | {term} |
optional | Infection outcome | For host-associated and infecting isolates, indicate the broad infection outcome at 28 days post-infection. Choose from the list. | Death within 28 days | Alive at 28 days | Restricted access | Unknown |
optional | Infection severity | For host-associated infecting isolates, if severity information could be made available (upon request), indicate the type of information here. If none available or none can be shared with the community, leave blank. | {text} |
optional | Host age group | For human-associated isolates, indicate the age range of the host. Choose from the list. | 0-30 days | 1-12 months | 1-5 years | 5-18 years | 18-60 years | >60 years | Restricted access | Not collected | Not applicable | Missing |
optional | Host sex | For host-associated isolates, indicate the biological sex of the host. Choose from the list. | Male | Female | Restricted access | Not collected | Not applicable | Missing |
REQUIRED | Repeat isolate status | If this is the only ISOLATE sequenced for this host infection or colonisation episode, select 'Primary isolate.' If more than one ISOLATE is sequenced from the same host infection or colonisation episode, indicate whether this is the primary isolate or a repeat isolate. | Primary isolate | Repeat isolate |
REQUIRED if Repeat isolate = 'Repeat isolate' | Primary isolate name | If other ISOLATES are sequenced from the same host infection or colonisation episode, and this entry is NOT the primary isolate in the series, provide the isolate name of the primary isolate. | {text} |
REQUIRED | Repeat sequence status | If this is the only sequence record for this isolate (READS and/or ASSEMBLY), select 'Primary sequence.' If multiple sequences are provided for the same isolate, indicate whether this is the primary sequence or the repeat sequence. | Primary sequence | Repeat sequence |
REQUIRED | Primary sequence name | If multiple sequences of the same isolate, and this entry is NOT the primary sequence in the series, provide the READ or ASSEMBLY accession for the primary sequence, otherwise leave blank. | {text} |
REQUIRED | Collected by | Name of persons or institute who collected the sample. | {text} |
REQUIRED | Lab contact | Contact email address for the person providing the metadata. Note this information will only be made available to the KlebNET-GSP team. | {text} |
These contextual data describe the purpose of sampling, and the sampling strategy for the collection from which each isolate is derived. Please complete one row per isolate.
Variable fields, and guidance for completing them, are summarised in the table below. Definitions and detailed examples are also shown below the table.
Status | Variable | Definition | Guidance | Value format |
---|---|---|---|---|
REQUIRED | purpose of sampling | Primary purpose for sampling bacterial isolates | Indicate the primary purpose for the collection and sequencing of these isolates (e.g. routine diagnostics, outbreak investigation, research). Choose from the list, or if none of the values are appropriate, provide the reason as free text. Definitions are shown below this table. | Routine diagnostics and / or infection control | Routine surveillance | Outbreak investigation / outbreak-initiated surveillance | Research | {text} |
REQUIRED | study population | Population from which bacterial isolates were sampled | Give details about the population of hosts or environments represented in the sample (e.g. Hospital patients, Neonates, Hospital wastewater). This information is essential to inform the inclusion and exclusion of studies for aggregate or comparative epidemiological analyses. Choose common values from the list, or if none of the values are appropriate, enter the information as free text. Multiple values can be specified (comma-separated). | Hospital patients | Intensive Care Unit (ICU) patients | Primary care patients | Community participants | Neonates | Clinical environment: sinks and drains | Clinical environment: surfaces | Medical devices | Hospital wastewater | Wastewater (not hospital) | Fresh water | Seawater | Soil | Rhizosphere | Plants | Livestock | Companion animals | Captive animals | Wild animals | Food | {text} |
REQUIRED | target epi | Broad epidemiological category of the study | Indicate the broad epidemiological category of the study (e.g. Host colonisation, Host infection, Environmental). This information is useful to inform aggregate or comparative analyses of disease-associated vs non-disease associated isolates. Choose from the list, or if none of the values are appropriate, enter the information as free text. | Host infection | Host colonisation | Environmental | Host infection & colonisation | Host infection, colonisation & environmental | {text} |
REQUIRED if target epi includes 'Host infection' | selected by clinical phenotype | Flag to indicate whether isolates were selected for inclusion on the basis of host clinical phenotype | Indicate whether isolates were selected for inclusion on the basis of host clinical phenotype (e.g. blood stream infection, liver abscess, severe infection) or if no selection was applied. Choose from the list. This information is essential to inform studies focussed on specific infection types or disease severity. E.g. to determine serotype distributions among invasive infection isolates or compare rates of drug resistance among blood stream infections. The specific phenotype used for selection can be indicated in the 'selected clinical phenotype' field. | Selected by clinical phenotype | NOT selected by clinical phenotype |
REQUIRED if selected by clinical phenotype = 'selected by clinical phenotype' | selected clinical phenotype | Clinical phenotype used to select isolates for inclusion | Indicate the specific clinical phenotype that was used to select samples for collection and/or sequencing. Choose common values from the list, or if none of the values are appropriate, enter the information as free text. Multiple values can be specified (comma-separated). | Liver abscess | Invasive infection | Blood stream infection | Respiratory infection | Urinary tract infection | Hospital acquired infection | Community acquired infection | Severe disease | {text} |
REQUIRED | selected by organism trait | Flag to indicate whether isolates were selected for inclusion on the basis of microbial trait | Indicate if samples were selected for inclusion on the basis of a microbial phenotype or genotype (e.g. specific drug resistance or serotype, presence of a specific gene) or if no selection was applied. Choose from the list. This information is essential to inform studies aiming to estimate the prevalence of microbial phenotypes / genotypes by study populations, geographies, etc. For example, to estimate national prevalence of ceftriaxone or carbapenem resistant isolates. The specific phenotype or genotype used for selection can be indicated in the 'selected organism trait' field. | Selected by organism trait | NOT selected by organism trait |
REQUIRED if selected by organism trait = 'selected by organism trait' | selected organism trait | Microbial trait used to select isolates for inclusion | Indicate the specific microbial phenotype or genotype that was used to select isolates for collection and/or sequencing. Choose common values form the list, or if none of the values are appropriate, enter the information as free text. Multiple values can be specified (comma-separated). | Ceftriaxone resistance | Carbapenem resistance | Drug resistance (not ceftriaxone or carbapenem) | ESBL producers | Carbapenemase producers | OXA positive | NDM positive | KPC positive | iuc (aerobactin) positive | iro (salmochelin) positive | rmpA positive | peg-344 positive | String-test positive | Hypermucoviscous by low-speed centrifugation | Hypermucoviscous by percoll-gradient sedimentation | 7-gene multi-locus sequence type | Serotype | {text} |
RECOMMENDED | sampling period start | Start date for the sampling period | Indicate when the sample collection began (YYYY, or YYYY-MM or YYYY-MM-DD). This information is useful for understanding the temporal coverage of data to inform trend analysis. | {ISO format} |
RECOMMENDED | sampling period end | End date for the sampling period | Indicate when the sample collection ended (YYYY, or YYYY-MM or YYYY-MM-DD). This information is useful for understanding the temporal coverage of data to inform trend analysis. If collection and sequencing are on-going, leave this field blank. | {ISO format} |
Samples collected through the routine and ongoing activities of clinical or veterinary microbiology laboratories for the purposes of clinical diagnosis and/or infection control. This may include isolates confirmed as infecting agents and/or those considered asymptomatic or environmental colonisers. E.g. isolates identified from hospital sinks or patient screening swabs as part of routine infection prevention and control procedures.
Samples collected through the routine and ongoing activities of other laboratories (not clinical or veterinary microbiology laboratories) and/or collected for purposes other than clinical diagnostics and infection control, e.g. laboratories processing samples from non-healthcare environmental sources or food products.
Samples collected as part of a response to a specific outbreak, e.g. within a hospital or other healthcare setting (human or veterinary). This may include isolates confirmed as infecting agents and/or those considered asymptomatic colonisers (e.g. from screening swabs) and/or those from environmental sources (e.g. hospital sinks, drains etc.)
Samples collected for specific research purposes (excluding outbreak investigation / outbreak-initiated surveillance) that would not have otherwise been collected via routine diagnostics, infection control or surveillance activities as described above.
Below we describe various hypothetical study designs and show how the sampling fields would be populated for each.
K. pneumoniae were isolated from the blood of neonates via routine diagnostic procedures. All isolates collected between 01 Jan 2019 and 31 Dec 2020 were stocked and subjected to whole genome sequencing.
K. pneumoniae identified via routine diagnostic procedures from hospitalised patients in a tertiary care centre between February 2016 and February 2018 were collected. Isolates resistant to ceftriaxone were selected for sequencing.
In May 2019 there was a sudden increase in CPE infections in the ICU of a large tertiary care centre. Enhanced infection prevention and control procedures were activated from 18 May 2019 until 31 August 2019 when the outbreak was declared contained. Rectal screening swabs were collected on patient admission and every three days thereafter, in addition to sink and drain screening swabs. All swabs were cultured on selective media and presumptive carbapenem-resistant K. pneumoniae were sequenced alongside all carbapenem-resistant K. pneumoniae identified from ICU patients via routine diagnostics procedures.
Carbapenem-resistant K. pneumoniae were isolated from liver abscess patients as part of a research study focussed on diabetic patients, between 01 June 2018 and 30 June 2020. Strains carrying K. pneumoniae carbapanemase genes were detected by PCR and string test was used to determine hypermucoidy. String test positive isolates harbouring blaKPC were subjected to whole genome sequencing.
Veterinary researchers collected 100 faecal samples from each of six pig farms in June 2017. K. pneumoniae were isolated by culture on SCAI media and subjected to whole genome sequencing as part of a One Health research project.
(Note that the specific hosts, i.e., pigs, should be indicated in the isolate metadata field 'host', rather than in the sampling field)
K. pneumoniae were isolated from fresh and wastewaters in a metropolitan centre as part of routine water surveillance conducted by the Environmental Protection Authority. Since 2021, all isolates have been stocked and 100 isolates have been randomly selected for sequencing each year. Sampling and sequencing is ongoing.
We welcome queries and suggestions from the community on any aspect of this scheme. In particular, please notify us if you think we have missed key data fields or options, or if the guidance is unclear. You can contact us via the issue tracker.
These resources are freely available for reuse and adaptation under GNU general public license v3. We encourage the development of similar schemes for other organisms.