Skip to content

Latest commit

 

History

History

metadata

iReceptor Repertoire Metadata files

The iReceptor Project maintains a Repertoire Metadata standard file format that it uses in its internal data curation process. Repertoire metadata files are UTF-8 encoded comma delimited text files (CSV files), consisting of a single header line followed by a single line for each repertoire. The header line should consist of the names of the keys that will be stored in the AIRR Data Commons repository. If the intent is to create an AIRR compatible repository, then the header lines should consist of keys that map to AIRR fields as specified in the MiAIRR standard as defined in the MiAIRR Data Elements.

The MiAIRR Standard has a number of complex fields, including complex strings (which include the "," field separation character), arrays of strings, and Ontology objects. Complex strings that contain special characters (including the field separation character ",") should be enclosed in quoataion (") characters (e.g. "This, not that"). Fields that contain an array of strings (e.g. keywords_study) should use a set of simple strings separated by commas, enclosed in the quotation (") character (e.g. "contains_tr, contains_paired_chain, contains_schema_rearrangement"). Ontology objects are represented in the CSV file by two fields, the base MiAIRR field name (e.g. study_type) column in the CSV file stores the Ontology label (e.g. "Case-Control Study") while the Ontology ID is stored in a column with the field name with an _id suffix. For example, the study_type_id column in the CSV file should contain Ontology IDs such as "NCIT:C15197". Ontology IDs are represented as Compact URIs (CURIEs) which contain an identifier that represents the Ontology used (e.g. NCIT) and an ID from that Ontology (e.g. C15197), separated by a colon (:) character. See the MiAIRR Ontology documentation for more details on ontology IDs and their structure. Some fields in the MiAIRR Standard are controlled vocabularies, and should contain only a limited set of string values, more information on these fields can be found on the MiAIRR Data Elements page or the Metadata Guidleines page.

Fields that have no data should be left blank (the CSV field should contain no data). Strings such as "NA", "None", "null" should NOT be used. Boolean fields should have values of either TRUE or FALSE.

The iReceptor Turnkey repository id designed to load data at the most granular level possible. The iReceptor team recommends that you split all data such that there is only one "type" of sequence data in each input data file. Splitting data by "type" of sequence means that data from a single b-cell (IGH, IGK, IGL) or t-cell (TRA, TRB, TRD, TRG) locus should be split into separate files. Each file would then be represented as a single row in the iReceptor Metadata file. Typically, there would be a single file with data from a single locus listed in the data_processing_files for each row in the metadata file. The iReceptor data loader associates repertoire metadata with other data using the file name in the data_processing_files field when loading Rearrangements, Clones, and Cells. So it is important to store the correct file name in the data_processing_files field so that the link between Reperotires and Rearrangements/Clone/Cells can be made when these data are loaded. Although it is is possible to load multi-locus Reperotires, this is not supported using the the iReceptor Metadata CSV file format. Please refer to the AIRR JSON format data loading for this purpose. The iReceptor team recommends using the data_processing_protocols field to store detailed infomration about ho wthe data was processed, including exaple shell script commands if that is possible (see examples).

Repertoire metadata files are loaded into a repository using the iReceptor Data Loading scripts. The data loading scripts will store all columns contained in the repertoire metadata CSV file that have a header in the first row of the column, and will provide warnings to the user if the file is missing MiAIRR compatible field names. If there is no header line, the data in that column will not be loaded. Please refer to the documentaion on iReceptor Turnkey Data Loading on how to use these scripts to load repertoire data into a repository.

Example iReceptor Repertoire Metadata Files

In order to ease the use of iReceptor Repertoire Metadata files, this directory contains an annotated, example Repertoire Metadata Excel file. It consists of a set of rows that describe realistic repertoire samples from an actual study curated in an iReceptor repository. The first two rows are header rows. The first defines the type of the field (AIRR fields have specific types) and is for added clarity. This row should be removed before loading the data into a repository. The second row defines the field names loaded into the repository. Header rows in green are AIRR terms and should be provided if possible, Header rows in yellow and are internal iReceptor fields that we find useful to set but can be ignored. In general, we recommend that you use the AIRR Minimal Standard terms for these columns (these are the default columns provided) but any columns used will be loaded into the repository (custom, non-AIRR columns are loaded).

Documentation on the columns and their uses is provided in the Excel spreadsheet itself. To load the data into an iReceptor Turnkey repository, it is necessary to export the spreadsheet to a UTF-encoded comma separated CSV file. The first row, with the the column types should be removed.