Klebsiella-genome-metadata

Klebsiella genome metadata scheme, plus guidance, examples and submission template.

This is a community-driven data curation effort to facilitate the use and reuse of public genome collections for maximum knowledge gain. These efforts are focussed on Klebsiella pneumoniae and closely related organisms in the K. pneumoniae Species Complex (KpSC) and are coordinated by the KlebNET-GSP project team. The data will be collated and made publicly available in this repository and the PathogenWatch website, which hosts public KpSC genome collections and reports associated genotypes.

Our goal is to collect information that has broad utility for research focused on KpSC, and that can be readily harmonised for easy and effective reuse. We aim to capture information that is not currently well represented in the public data repositories. Notably, the National Center for Biotechnology Information (NCBI) already allows submission of detailed Antimicrobial Susceptibility Testing (AST) information that is directly applicable to the KpSC, and AST data is therefore excluded from our data curation effort. If you have generated and are able to share AST data for KpSC isolates please consider submitting to NCBI.

Our scheme includes data fields divided into two sections:

1. Isolate metadata fields capture information about the individual KpSC genomes and their associated isolates, as well as the sample sources and/or hosts from which the isolates were collected.

2. Sampling fields capture information about how and why isolates were collected and/or chosen for sequencing. These data are essential to understand the underlying biases in genome collections, and to make decisions about the inclusion or exclusion of isolates for comparative and aggregate analyses.

The submission template is available here. Detailed instructions and guidance for data submission can be found below.

Data submission

The data submission template is available here. Please MAKE A COPY before inputting your own data. You cannot enter data directly into the master copy of the template. Once completed, email or share your copy to klebsiella.genome.metadata@gmail.com.

The full list of data fields, value formats and options are shown in the tables below.

Fields with restricted vocabularies

Some fields have restricted vocabularies and/or require selection from a list of predefined data values. In most cases the list of possible values can be accessed and searched via a drop-down list within the submission template (also shown in the tables below, marked 'Choose from list') and only values matching those in the list will be accepted. However, in a minority of cases the possible set of values is derived from an established ontology that is too large for inclusion within the submission template. These fields are marked as, 'Controlled vocabulary,' with a link to the appropriate ontology e.g. NCBI taxonomy database or MeSH disease ontology.

Fields with a list of suggested values

In some cases it is desirable to have a restricted vocabulary to support data harmonisation, but there are no appropriate predefined ontologies and too many foreseeable options to create a definitive list. In these cases, we provide a list of suggested values that we expect to capture the vast majority of scenarios, but also provide the option to enter alternative values via free text. These fields are marked in the tables below as 'Choose common values from the list, or if none are appropriate, enter free text'. The submission template includes a drop-down list of the suggested values, but will allow other values to be entered (these free text entries will be marked with warnings).

Isolate metadata

These data describe individual genome sequences and the bacterial isolates from which they were derived. Please complete one row per seqeunce (i.e. one set of seqeunce read data and/or a de novo assembly).

Variable fields, and guidance for completing them, are shown in the table below.

For text fields, please DO NOT enter 'unknown' or 'missing' unless otherwise specified. Instead, leave the field blank if you do not have any data to input for that field.

Status	Variable	Definition; Guidance	Value format
REQUIRED if published	References	PubMed ID for associated publication reporting genome data; DOI is acceptable for preprints only. Multiple references can be provided as a list (comma-separated). If no associated publications, leave blank.	{text}
RECOMMENDED; REQUIRED if no Assembly accession provided	Run accession	Sequence archive run accession (sequence read accession); SRRxxx, ERRxxx. If multiple sequences for the same ISOLATE, a list of accessions can be given (comma-separated).	{text}
REQUIRED	Project accession	BioProject accession; PRJxxx. If multiple projects for the same ISOLATE, a list of accessions can be given (comma-separated).	{text}
REQUIRED	Sample accession	BioSample accession; SAMxxx	{text}
RECOMMENDED; REQUIRED if no Assembly accession provided	Experiment accession	Sequence archive experiment accession; SRXxxx, ERXxxx. If multiple experiments for the same ISOLATE, a list of accessions can be given here (comma-separated).	{text}
optional	Secondary sample accession	NCBI Biosample; ERSxxx	{text}
optional; REQUIRED if no Run accession provided	Assembly accession	GenBank assembly accession; GCA_xxx. The accession for the entire assembly, including chromosome and plasmids.	{text}
optional	Secondary assembly accession	Genbank WGS master record accession	{text}
REQUIRED	Genome source	Type of sequence from which this genome was derived; Indicate if the sequence represents a single cultured isolate whole genome sequence (WGS) or is derived from a mixed sequence / metagenome-assembled genome (MAG). Choose from the list.	Isolate WGS \| MAG \| Unknown
REQUIRED	Isolate name	A name that you choose for the isolate. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible. Every Isolate name from a single Submitter must be unique.	{text}
REQUIRED	Collection year	The year that the isolate was collected; YYYY	{int}
REQUIRED	Collection month	The month that the isolate was collected; MM	{int}
REQUIRED	Collection day	The day that the isolate was collected within the month specified in 'Collection month'; DD	{int}
REQUIRED	Country	Country of isolate collection. Controlled vocabulary, choose from the list of values as defined in https://www.insdc.org/submitting-standards/geo_loc_name-qualifier-vocabulary/	{term}
REQUIRED	Isolate source	Short free text description of the sample source from which the Klebsiella was isolated. E.g. ‘human blood’ , ‘animal feed’ , ‘river water grab sample’.	{text}
REQUIRED	Source type	Controlled vocabulary describing the source of the isolate. Choose from the list. Enables high level grouping of isolates.	Human \| Animal \| Food \| Environmental \| Other \| Missing \| Restricted access \| Not applicable \| Not collected \| Not provided
REQUIRED	Host	Scientific name of the host from which the isolate was collected. Controlled vocabulary as defined in https://www.ncbi.nlm.nih.gov/taxonomy. If not host-associated, specify 'not host-associated'. Ensure the source is appropriately described under ‘Isolation source' and consider submitting detailed source information to NCBI via the One Health Enteric metadata template.	{term}
RECOMMENDED unless lat_lon given	City or region	City or region of isolate collection.	{text}
RECOMMENDED unless City or region given	lat_lon	The geographical coordinates of the location where the sample was collected. Specify as degrees latitude and longitude in the format "d[d.dddd] N\|S d[dd.dddd] W\|E", e.g. 38.98 N 77.11 W.	{float}{float}
optional	Isolate alias	Other IDs associated with this isolate. Multiple IDs can be given (comma-separated).	{text}
optional	Travel associated	For isolates collected from human hosts, indicate if associated with recent travel. Leave blank if travel status is unknown.	Travel associated \| NOT travel associated
optional	Travel country	If travel associated, indicate the travel country. This should be one of the countries listed here: https://www.insdc.org/submitting-standards/geo_loc_name-qualifier-vocabulary/. Leave blank if unknown.	{term}
REQUIRED if host-associated	Host tissue sampled	Name of body site or specimen type from which the sample was obtained, such as a specific organ, tissue or clinical specimen. Choose common values from the list, or if none are appropriate, enter free text.	Blood \| Cerebrospinal fluid (CSF) \| Urine \| Sputum \| Bronchoalveolar lavage (BAL) \| Other respiratory \| Wound \| Skin \| Feces \| Rectal swab \| Throat swab \| Cecal swab \| {text}
REQUIRED if host-associated	Infection	For host-associated isolates, indicate if infecting or colonising isolate, or if the infection status is unknown. Choose from list.	Infection \| Colonisation \| Unknown
REQUIRED if Infection = 'Infection'	Host disease	For host-associated infecting isolates, provide the name of the relevant disease, e.g. Pneumonia, Bacteremia. Controlled vocabulary as defined in https://meshb.nlm.nih.gov/treeView. If unknown, leave blank.	{term}
optional	Infection outcome	For host-associated and infecting isolates, indicate the broad infection outcome at 28 days post-infection. Choose from the list.	Death within 28 days \| Alive at 28 days \| Restricted access \| Unknown
optional	Infection severity	For host-associated infecting isolates, if severity information could be made available (upon request), indicate the type of information here. If none available or none can be shared with the community, leave blank.	{text}
optional	Host age group	For human-associated isolates, indicate the age range of the host. Choose from the list.	0-30 days \| 1-12 months \| 1-5 years \| 5-18 years \| 18-60 years \| >60 years \| Restricted access \| Not collected \| Not applicable \| Missing
optional	Host sex	For host-associated isolates, indicate the biological sex of the host. Choose from the list.	Male \| Female \| Restricted access \| Not collected \| Not applicable \| Missing
REQUIRED	Repeat isolate status	If this is the only ISOLATE sequenced for this host infection or colonisation episode, select 'Primary isolate.' If more than one ISOLATE is sequenced from the same host infection or colonisation episode, indicate whether this is the primary isolate or a repeat isolate.	Primary isolate \| Repeat isolate
REQUIRED if Repeat isolate = 'Repeat isolate'	Primary isolate name	If other ISOLATES are sequenced from the same host infection or colonisation episode, and this entry is NOT the primary isolate in the series, provide the isolate name of the primary isolate.	{text}
REQUIRED	Repeat sequence status	If this is the only sequence record for this isolate (READS and/or ASSEMBLY), select 'Primary sequence.' If multiple sequences are provided for the same isolate, indicate whether this is the primary sequence or the repeat sequence.	Primary sequence \| Repeat sequence
REQUIRED	Primary sequence name	If multiple sequences of the same isolate, and this entry is NOT the primary sequence in the series, provide the READ or ASSEMBLY accession for the primary sequence, otherwise leave blank.	{text}
REQUIRED	Collected by	Name of persons or institute who collected the sample.	{text}
REQUIRED	Lab contact	Contact email address for the person providing the metadata. Note this information will only be made available to the KlebNET-GSP team.	{text}

Sampling fields

These contextual data describe the purpose of sampling, and the sampling strategy for the collection from which each isolate is derived. Please complete one row per isolate.

Variable fields, and guidance for completing them, are summarised in the table below. Definitions and detailed examples are also shown below the table.

Status	Variable	Definition	Guidance	Value format
REQUIRED	purpose of sampling	Primary purpose for sampling bacterial isolates	Indicate the primary purpose for the collection and sequencing of these isolates (e.g. routine diagnostics, outbreak investigation, research). Choose from the list, or if none of the values are appropriate, provide the reason as free text. Definitions are shown below this table.	Routine diagnostics and / or infection control \| Routine surveillance \| Outbreak investigation / outbreak-initiated surveillance \| Research \| {text}
REQUIRED	study population	Population from which bacterial isolates were sampled	Give details about the population of hosts or environments represented in the sample (e.g. Hospital patients, Neonates, Hospital wastewater). This information is essential to inform the inclusion and exclusion of studies for aggregate or comparative epidemiological analyses. Choose common values from the list, or if none of the values are appropriate, enter the information as free text. Multiple values can be specified (comma-separated).	Hospital patients \| Intensive Care Unit (ICU) patients \| Primary care patients \| Community participants \| Neonates \| Clinical environment: sinks and drains \| Clinical environment: surfaces \| Medical devices \| Hospital wastewater \| Wastewater (not hospital) \| Fresh water \| Seawater \| Soil \| Rhizosphere \| Plants \| Livestock \| Companion animals \| Captive animals \| Wild animals \| Food \| {text}
REQUIRED	target epi	Broad epidemiological category of the study	Indicate the broad epidemiological category of the study (e.g. Host colonisation, Host infection, Environmental). This information is useful to inform aggregate or comparative analyses of disease-associated vs non-disease associated isolates. Choose from the list, or if none of the values are appropriate, enter the information as free text.	Host infection \| Host colonisation \| Environmental \| Host infection & colonisation \| Host infection, colonisation & environmental \| {text}
REQUIRED if target epi includes 'Host infection'	selected by clinical phenotype	Flag to indicate whether isolates were selected for inclusion on the basis of host clinical phenotype	Indicate whether isolates were selected for inclusion on the basis of host clinical phenotype (e.g. blood stream infection, liver abscess, severe infection) or if no selection was applied. Choose from the list. This information is essential to inform studies focussed on specific infection types or disease severity. E.g. to determine serotype distributions among invasive infection isolates or compare rates of drug resistance among blood stream infections. The specific phenotype used for selection can be indicated in the 'selected clinical phenotype' field.	Selected by clinical phenotype \| NOT selected by clinical phenotype
REQUIRED if selected by clinical phenotype = 'selected by clinical phenotype'	selected clinical phenotype	Clinical phenotype used to select isolates for inclusion	Indicate the specific clinical phenotype that was used to select samples for collection and/or sequencing. Choose common values from the list, or if none of the values are appropriate, enter the information as free text. Multiple values can be specified (comma-separated).	Liver abscess \| Invasive infection \| Blood stream infection \| Respiratory infection \| Urinary tract infection \| Hospital acquired infection \| Community acquired infection \| Severe disease \| {text}
REQUIRED	selected by organism trait	Flag to indicate whether isolates were selected for inclusion on the basis of microbial trait	Indicate if samples were selected for inclusion on the basis of a microbial phenotype or genotype (e.g. specific drug resistance or serotype, presence of a specific gene) or if no selection was applied. Choose from the list. This information is essential to inform studies aiming to estimate the prevalence of microbial phenotypes / genotypes by study populations, geographies, etc. For example, to estimate national prevalence of ceftriaxone or carbapenem resistant isolates. The specific phenotype or genotype used for selection can be indicated in the 'selected organism trait' field.	Selected by organism trait \| NOT selected by organism trait
REQUIRED if selected by organism trait = 'selected by organism trait'	selected organism trait	Microbial trait used to select isolates for inclusion	Indicate the specific microbial phenotype or genotype that was used to select isolates for collection and/or sequencing. Choose common values form the list, or if none of the values are appropriate, enter the information as free text. Multiple values can be specified (comma-separated).	Ceftriaxone resistance \| Carbapenem resistance \| Drug resistance (not ceftriaxone or carbapenem) \| ESBL producers \| Carbapenemase producers \| OXA positive \| NDM positive \| KPC positive \| iuc (aerobactin) positive \| iro (salmochelin) positive \| rmpA positive \| peg-344 positive \| String-test positive \| Hypermucoviscous by low-speed centrifugation \| Hypermucoviscous by percoll-gradient sedimentation \| 7-gene multi-locus sequence type \| Serotype \| {text}
RECOMMENDED	sampling period start	Start date for the sampling period	Indicate when the sample collection began (YYYY, or YYYY-MM or YYYY-MM-DD). This information is useful for understanding the temporal coverage of data to inform trend analysis.	{ISO format}
RECOMMENDED	sampling period end	End date for the sampling period	Indicate when the sample collection ended (YYYY, or YYYY-MM or YYYY-MM-DD). This information is useful for understanding the temporal coverage of data to inform trend analysis. If collection and sequencing are on-going, leave this field blank.	{ISO format}

Term definitions for purpose-of-sampling

Routine diagnostics and / or infection control

Samples collected through the routine and ongoing activities of clinical or veterinary microbiology laboratories for the purposes of clinical diagnosis and/or infection control. This may include isolates confirmed as infecting agents and/or those considered asymptomatic or environmental colonisers. E.g. isolates identified from hospital sinks or patient screening swabs as part of routine infection prevention and control procedures.

Routine surveillance

Samples collected through the routine and ongoing activities of other laboratories (not clinical or veterinary microbiology laboratories) and/or collected for purposes other than clinical diagnostics and infection control, e.g. laboratories processing samples from non-healthcare environmental sources or food products.

Outbreak investigation / outbreak-initiated surveillance

Samples collected as part of a response to a specific outbreak, e.g. within a hospital or other healthcare setting (human or veterinary). This may include isolates confirmed as infecting agents and/or those considered asymptomatic colonisers (e.g. from screening swabs) and/or those from environmental sources (e.g. hospital sinks, drains etc.)

Research

Samples collected for specific research purposes (excluding outbreak investigation / outbreak-initiated surveillance) that would not have otherwise been collected via routine diagnostics, infection control or surveillance activities as described above.

Examples of how to describe study designs using the sampling fields

Below we describe various hypothetical study designs and show how the sampling fields would be populated for each.

Neonatal sepsis study

K. pneumoniae were isolated from the blood of neonates via routine diagnostic procedures. All isolates collected between 01 Jan 2019 and 31 Dec 2020 were stocked and subjected to whole genome sequencing.

Ceftriaxone-resistant infection study

K. pneumoniae identified via routine diagnostic procedures from hospitalised patients in a tertiary care centre between February 2016 and February 2018 were collected. Isolates resistant to ceftriaxone were selected for sequencing.

CPE outbreak study

In May 2019 there was a sudden increase in CPE infections in the ICU of a large tertiary care centre. Enhanced infection prevention and control procedures were activated from 18 May 2019 until 31 August 2019 when the outbreak was declared contained. Rectal screening swabs were collected on patient admission and every three days thereafter, in addition to sink and drain screening swabs. All swabs were cultured on selective media and presumptive carbapenem-resistant K. pneumoniae were sequenced alongside all carbapenem-resistant K. pneumoniae identified from ICU patients via routine diagnostics procedures.

CR-hvKp study

Carbapenem-resistant K. pneumoniae were isolated from liver abscess patients as part of a research study focussed on diabetic patients, between 01 June 2018 and 30 June 2020. Strains carrying K. pneumoniae carbapanemase genes were detected by PCR and string test was used to determine hypermucoidy. String test positive isolates harbouring bla_KPC were subjected to whole genome sequencing.

Pig gut carriage study

Veterinary researchers collected 100 faecal samples from each of six pig farms in June 2017. K. pneumoniae were isolated by culture on SCAI media and subjected to whole genome sequencing as part of a One Health research project.

(Note that the specific hosts, i.e., pigs, should be indicated in the isolate metadata field 'host', rather than in the sampling field)

Water surveillance study

K. pneumoniae were isolated from fresh and wastewaters in a metropolitan centre as part of routine water surveillance conducted by the Environmental Protection Authority. Since 2021, all isolates have been stocked and 100 isolates have been randomly selected for sequencing each year. Sampling and sequencing is ongoing.

Queries and suggestions

We welcome queries and suggestions from the community on any aspect of this scheme. In particular, please notify us if you think we have missed key data fields or options, or if the guidance is unclear. You can contact us via the issue tracker.

License

These resources are freely available for reuse and adaptation under GNU general public license v3. We encourage the development of similar schemes for other organisms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Klebsiella-genome-metadata

Contents

Data submission

Fields with restricted vocabularies

Fields with a list of suggested values

Isolate metadata

Sampling fields

Term definitions for purpose-of-sampling

Routine diagnostics and / or infection control

Routine surveillance

Outbreak investigation / outbreak-initiated surveillance

Research

Examples of how to describe study designs using the sampling fields

Neonatal sepsis study

Ceftriaxone-resistant infection study

CPE outbreak study

CR-hvKp study

Pig gut carriage study

Water surveillance study

Queries and suggestions

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Klebsiella-genome-metadata

Contents

Data submission

Fields with restricted vocabularies

Fields with a list of suggested values

Isolate metadata

Sampling fields

Term definitions for purpose-of-sampling

Routine diagnostics and / or infection control

Routine surveillance

Outbreak investigation / outbreak-initiated surveillance

Research

Examples of how to describe study designs using the sampling fields

Neonatal sepsis study

Ceftriaxone-resistant infection study

CPE outbreak study

CR-hvKp study

Pig gut carriage study

Water surveillance study

Queries and suggestions

License