Skip to content
This repository has been archived by the owner on Dec 13, 2019. It is now read-only.

Overview of Phenotype Exchange Format

Michael Baudis edited this page Sep 12, 2016 · 2 revisions

Introduction

While great strides have been made in exchange formats for sequence and variation data (e.g. Variant Call Format; VCF1), complementary standards for phenotypes and environment are urgently needed. For individuals with rare and undiagnosed diseases, such standards could improve the speed and accuracy of diagnosis. For patients with common but hard-to-treat diseases, such standards can help us design personalized interventions and learn more about shared disease mechanisms2. The development of a clinical phenotype data exchange standard is both necessary and timely. It is necessary because study sizes of well over 100,000 patients are thought to be required to effectively assess the role of rare variation in common disease3 or to discover the genomic basis for a substantial portion of Mendelian diseases4. It is timely because studies of this power are now becoming financially and technologically tractable.

Phenotypic abnormalities of individuals are currently described in diverse places in diverse formats: publications, databases, health records, and even in social media. We propose that these descriptions a) contain a minimum set of fields and b) get transmitted alongside genomic sequence data, such as in VCF, between clinics, authors, journals, and data repositories. The structure of the data in the exchange standard will be optimized for integration from these distributed contexts. The implementation of such a system will allow the sharing of phenotype data prospectively, as well as retrospectively. Increasing the volume of computable data across a diversity of systems will support large-scale computational disease analysis using the combined genotype and phenotype data.

The terms ‘disease’ and ‘phenotype’ are often conflated. Here we use ‘phenotype’ to refer to a phenotypic feature, such as hypoglycemia, that is the component of a disease, such as diabetes mellitus type II. The Phenotype Exchange Formalism (PXF) proposed here is designed to support “deep phenotyping”, a process wherein individual components of each phenotype are observed and documented5. The PXF requires the use of a common ontology, a logically defined hierarchy of terms, that allows sophisticated algorithmic analysis over medically relevant abnormalities. The Human Phenotype Ontology6 (HPO) was built for this purpose and has been used for genomic diagnostics, translational research, genomic matchmaking, and systems biology applications7–14. The HPO is developed in the context of the Monarch Initiative, an international team of computer scientists, clinicians, and biologists in the United States, Europe, and Australia; HPO is being translated into multiple languages to support international interoperability. Due to its extensive phenotypic coverage beyond other terminologies15,16, HPO has recently been integrated into the Unified Medical Language System (UMLS) to support deep phenotyping in a variety of mainstream health care IT systems.

Figure 1. Phenopacket data exchange in the biomedical ecosystem. Multiple providers of phenotypic data include patients and clinicians, via a variety of mechanisms. Such Phenopackets can be created by a variety of tools and consumed by journals, databases, patient matchmaking services, EHR systems, and genomic analysis tools.

The online supplementary material to this article presents the version 1.0 of the PXF standard proposed in this article. The format defines the required information expected to be transmitted about each individual – aka a “Phenopacket”; it includes items such as as patient identifier (non-PHI), age or age group, sex, and a list of one or more phenotypic abnormalities represented by ontology terms. The use of HPO is recommended but if not possible, an alternative terminology as represented in the International Committee of Human Phenotype Terminologies (ICPHT), an activity of the International Rare Disease Research Consortium (IRDiRC), is acceptable. Figure 1 provides a summary of the Phenopacket exchange ecosystem, and the online supplement provides concrete examples of PXF encoded in several exchange formats such as XML, JSON, and RDF. Note that PXF is designed to be compatible with a variety of rare disease phenotyping efforts, such as 100,000 genomes17.

Requirements Historically, successful standards evolve gradually over time. They are not designed in the abstract, springing fully-formed from committee, but rather are developed incrementally as they are taken into the field and proven to successfully meet real-world challenges. Level 1 of the PXF is intentionally simple in order to ease wide adoption of the standard and thereby increase the value of the network of systems (see Figure 1) aiming to share Phenopackets for computational use. The requirements for such a standard are:

  1. Computable. The standard must be both human and machine-interpretable, enabling computing operations and validation on the basis of defined relationships between diagnoses, lab measurements, genotypic information, and medications.
  2. Transferable. The standard must enable seamless transfer of data from a data source (e.g., a document describing the phenotype) to a data receiver (e.g., an application that receives and uses it). The standard can have multiple serializations, such as tab-separated-values, XML, or JSON.
  3. Utilize an ontology for phenotypes. The standard must enable “fuzzy matching”, that is, the use of algorithms that leverage the logic within an ontology to match sets of phenotypes that are related but not exact matches. This is currently mission critical for rare disease, and we believe will also greatly facilitate precision medicine. Journals can aid use of the PXF standard by supporting data citation to Phenopackets, essentially a metadata record, which will be made available as a separate online document resolvable by a Digital Object Identifier (DOI)18. Phenopackets can be deposited in the journal, a public phenotype data repository such as the Monarch Initiative, or in generic data repositories such as FigShare. This approach ensures that the phenotype data described within a manuscript is made computable outside the pay-wall of journals, and can be cited within the original article via the DOI. The PXF has been adopted as a recommended or mandatory standard by journals including the CSH Molecular Case Studies, the Orphanet Journal of Rare Disease, and XXX.

It is hoped that public data repositories will begin to accept phenotype data provided in PXF. For example, the Monarch Initiative19 is already pulling Phenopacket data from the aforementioned journals and also provides an online editor tool for creating them. A variety of international efforts that aim to standardize genotype-phenotype data, such as the International Rare Diseases Research Consortium (IRDIRC) and the Global Alliance for Genomics and Health (GA4GH), support the use of this new PXF standard for sharing phenotypic data related to variant and other genomic health data. What is in a Phenopacket?

Achieving a functional and community-adopted PXF standard will require addressing several critical requirements, which are only partially fulfilled by this first release of Level 1 of the standard. Here we detail the basic level 1 components, and urge the community to participate in helping extend and evaluate the PXF:

Content Description Example Phenotypes Representation of phenotypic features using an ontology term with a resolvable and versionable identifier. http://purl.obolibrary.org/obo/HP_0001943 ‘Hypoglycemia’ Defined as: “A decreased concentration of glucose in the blood.”

Age of onset

Each phenotype can be indicated the exact age, or age range, for which the phenotype first manifested. ISO year and month standards should be followed, or the use of ontology terms from HPO for age ranges are recommended.

P43Y08M or Adult onset (HP:0003581)

Negation of phenotypes

Notable absence of a particular phenotype or phenotypic class. NOT Aortic regurgitation (HP:0001659)

Genomic data

Able to link to a VCF file, describing the patient’s genomic variants, or HGVS notation

Family history

Able to link to a PED file, describing familial linkage to other PCF files.

Quantitative specification

Quantitative phenotypes expressed in relative terms should be accompanied with reference population and values. A bility to transmit not only qualitative ontology terms such as “Hyperglycemia” but also specific values such as “blood glucose 178 mg/dl”

Evidence

Any of the above elements may be linked to one or more evidence assertions. Evidence could include items of the following:

  • EHR record numbers
  • published papers
  • functional assays
  • computational models
  • population studies
  • clinical trials

Discussion

In many ways, the phenotype data exchange community is in a position similar to that of the genetics community in the early days of public sequence databases. Although the content of sequence descriptions has changed over the years, this evolution is a sign of success, not failure. Early descriptions played key roles both in promoting the effective use of sequence data and in understanding how that data should be recorded and communicated. The Phenopacket standard proposed here is tailored to function in the context of rare disease, and for precision medicine in cancer and other common diseases. We are currently in an exciting position and the standardization and exchange of a broad range of phenotype data can trigger a new wave of advances in medical discovery and realize the goal of precision medicine. Further, patient-centered phenotyping approaches offer the opportunity, if not the necessity, for affected individuals and their families to be involved and integrated into the wider context that is the future of precision medicine. The documentation and use of data for patients with challenging to diagnose rare and genetic conditions is different than for more common diseases. The realization of this vision and the phenotype exchange requirement described here will require substantial effort. Given the relative immaturity of existing efforts, further research and prototypes of data capture and exchange systems will be necessary to better understand the issues. Such explorations will likely be undertaken by ongoing research efforts, many which do not have the luxury of waiting for the completion of an emerging consensus model.

References

  1. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–8 (2011).
  2. Council, N. R. Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. (2011). at http://www.nap.edu/catalog/13284/toward-precision-medicine-building-a-knowledge-network-for-biomedical-research
  3. Zuk, O. et al. Searching for missing heritability: designing rare variant association studies. Proc. Natl. Acad. Sci. U. S. A. 111, E455–64 (2014).
  4. Krawitz, P., Buske, O., Zhu, N., Brudno, M. & Robinson, P. N. The genomic birthday paradox: how much is enough? Hum. Mutat. 36, 989–97 (2015).
  5. Robinson, P. N. Deep phenotyping for precision medicine. Hum. Mutat. 33, 777–80 (2012).
  6. Köhler, S. et al. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 42, D966–74 (2014).
  7. Bayés, A. et al. Characterization of the proteome, diseases and evolution of the human postsynaptic density. Nat. Neurosci. 14, 19–21 (2011).
  8. Robinson, P. N. et al. Improved exome prioritization of disease genes through cross-species phenotype comparison. Genome Res. 24, 340–8 (2014).
  9. Singleton, M. V. et al. Phevor Combines Multiple Biomedical Ontologies for Accurate Identification of Disease-Causing Alleles in Single Individuals and Small Nuclear Families. Am. J. Hum. Genet. 94, 599–610 (2014).
  10. Javed, A., Agrawal, S. & Ng, P. C. Phen-Gen: combining phenotype and genotype to analyze rare disorders. Nat. Methods 11, 935–937 (2014).
  11. Sifrim, A. et al. eXtasy: variant prioritization by genomic data fusion. Nat. Methods 10, 1083–4 (2013).
  12. Soden, S. E. et al. Effectiveness of exome and genome sequencing guided by acuity of illness for diagnosis of neurodevelopmental disorders. Sci. Transl. Med. 6, 265ra168 (2014).
  13. Consortium, R. E. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
  14. Castellano, S. et al. Patterns of coding variation in the complete exomes of three Neandertals. Proc. Natl. Acad. Sci. U. S. A. 111, 6666–71 (2014).
  15. Haendel, M. Why the Human Phenotype Ontology? 2015 at http://monarch-initiative.blogspot.com/2015/05/why-human-phenotype-ontology.html
  16. Winnenburg, R. & Bodenreider, O. Coverage of Phenotypes in Standard Terminologies. in Phenotype Day, ISMB (2014). at http://phenoday2014.bio-lark.org/pdf/5.pdf
  17. 100,000 genomes. at http://www.genomicsengland.co.uk/the-100000-genomes-project/
  18. The Digital Object Identifier system.
  19. Mungall, C. J. et al. Use of model organism and disease databases to support matchmaking for human disease gene discovery. Hum. Mutat. 36, 979–84 (2015).