Skip to content

MoBaGenetics1.5

Gutorm Høgåsen edited this page Sep 8, 2022 · 81 revisions

Table of contents

Background

MoBa Genetics has a long history, and projects have historically used genotyped data-sets based on the Mother, Father and Child Cohort Study (MoBa) samples before MoBa Genetics even exited.

This page is documenting what will later become the official published version of MoBa Genetics (MoBa Genetics 1.5). Keep in mind that as for Version 1.0. Publish in this context means accessible to projects that have been allowed to use MoBa Genetics.

Important warning

Due to legal/ethical aspects MoBaGenetics datasets will not be completely static: Participant can ask for their data to be deleted. It is still undecided if we will change the version numbering when this happens, but what is important is that we will delete data as requested and without keeping a backup.

While we work on MoBa Genetics 1.5 the data-structure and naming conventions might change slightly, and data-errors might be corrected. As of June 2022, we think that the structure/format has found its form. The structure is described below.

You will have access to the same raw-data as we have. The only thing we do is to standardize the naming of samples and the structure in which they are presented.

Later, the version MoBaGenetics 1.0 will be deleted. We will first delete the raw data - plink and idat files. When the 1.5 QC is in place, 1.0 will be completely removed. This brutal deleting is due to the fact that when participants ask for their data to be deleted, they must be deleted in every data-set. Dynamically maintaining multiple sets (including QC!) is just to expensive for us.

Sex mismatch

8.6.22 Many sets, in particular those number after snp014 (snp015++) we have seen problems with sex-coding in the plink files (.fam). You cannot trust these coding according to the fam-file format: It seems that sometimes 0 is female (should be 2). We are working on resolving/documenting this further.

Status of MoBa Genetics 1.5 vs. 1.0

Version 1.0 of MoBa Genetics, also called the September 2019 interim version, contains around 100.000 samples from the Mother, Father and Child Cohort Study (MoBa). The total cohort consists of more than 250.000 samples/individuals (90.000 mothers 70.000 fathers) and the aim of MoBa Genetics 1.5 is to publish all these.

As of June 2021, all samples have been genotyped, but the merged QC is not done. The raw data is now available on TSD. What we plan to do is:

  • Develop a robust QC pipeline
  • Fix/mark pedigree errors due to misplaced samples or unexpected father
  • Run all the samples (including the ones part of MoBa Genetics 1.0) through the pipeline
  • Merge all the individual sets to a large/final 250.000+ set

Since this has proved more time-consuming than expected, the raw-data was published before the QC data and the merged data-set. A separate changelog will help you keep track of what happens here.

MoBa Genetics 1.5

Data is found on TSD, on project p229mobagenetics under the directory data/durable/MoBaGenetics-V1.5/snpArray. As of June 2021 all raw data gathered after version 1.0 was completed are present under 1.5. Gradually, the 1.5 will contain all the datasets where we have good/complete raw data (idat-files and preferably bedsets). As of Mars 2022, sets snp001 to snp007 have been included (see individual datasets below).

When version 1.5 is ready, it will replace version 1.0.

Common files

Certain files are part of a common infrastructure. The subdirectory SnpCommon contains common files/directories.

Manifest files

Provided we have them, the directgory SnpCommon/Manifest contain chip-manifest files suited for programs like Genome Studio. Those that have been returned to us are shared. More documentation is needed here, we will add it somehow but the chip used are already documented on the individual data-sets

If you know where to find manifest-files we are lacking - please let us know! (As of June 2022 we are building up the directory, so it could take some time before it is complete).

The yet non-existing Merge data-set

As for Version 1.0, a merged data-set will be made available. The aim of this set is to make it easy to access the complete data about the MoBa cohort, without having to worry about different chips and batch effects - These data has been genotyped over several years and with a multitude of chips-sets/batches.

We most probably will sort out duplicates, so if individuals have been genotyped in several sets, only one sample will end up in the merged set.

Datasets - Common for all datasets

Every data-set will be found on its own sub-directory under Datasets.

They are named snp001, snp002 etc where early numbers often mean early genotyped data-set. The data-sets will correspond to a common clustering - in rare cases batch effects have caused clustering to be slightly off. In such cases, the raw data will be split in different data-sets. Details on the individual sets are found on Individual datasets

raw-data

These data are almost as delivered by the lab that processed the underlying biological material. We might have done some minor renaming in order to protect/standardize IDs.

If you want to do your own QC or verify what has been done, here is the place to look. The MoBa Genetics team will not be able to help you with this, but you can of course use the slack channel to discuss this with other members of the MoBa Genetic community.

In most cases you will find both idat-files as well bedsets (suitable for plink).

idat files

Most users will not use these, but here is documentation for the specialists ...

The directory contains idat files (directory structure can vary) as well as a samplesheet, typically called sampleSheet_snpnnn.csv.

For each set, the "original number of samples" is supplied - due to participants withdrawing from MoBa, the number might be slightly lower.

Even though it is not yet generally implemented, our intension is that idat-files are organized in subdirectories where with one for each Sample Plate from the chip. So an idat-file 123456789_R01C01_Red.idat will we found on the sub-directory 123456789_R01C01 .

We are aiming to have sample-files that contain at least contain the following 6 columns Sample_Name, Sentrix_ID,Sentrix_Position,Sample_Well,Sample_Plate. In rare cases the well/plate information might have been lost.

Sample_Name will be named independent on whatever project scanned them, typically as they will be named in the bedsets and idat-files - using the combination of Sentrix_IDand Sentrix_Position.

We try to include optional but relevant fields as sample plate/well, as well. In addition extra information the scanning lab gave us will usually be left untouched. We can however not give you any support on these fields. If we get information about these, we will include it below, under Individual datasets.

If we have them, cluster files (often .egt) will be placed on the idat directory as well. Provided we have them, Manifest files are found on the SnpCommon/Manifest directory. (They are common for the chip and hence not placed under the dataset)

Genome Studio version 2 is picky on the headers, and we will try to provide examples of the needed header. If it exists, it will be called gsHeader.csv. If not, it is trivial to create. If you use Genome Studio, you will need to create a new SampleSheet where the existing header is replaced by gsHeader.csv.

bedset files

Bedset are suitable for analysis with programs like plink.

The bedsets will normally be called snpnnn.bed/bin/fam/log. The logfile will contain traces of what the files were called at the creation time.

You will always find .bed, .bim and .fam files. In addition a .log file might be present, showing when the set was generated.

Raw data bedsets use sentrixID as both sample and family id. In rare cases, there can be several bedsets for a set, with non-overlapping samples (and that is when the standard name is not used)

We have not tampered with the data - they are delivered as the analzying lab did. The sole exception is changing the sample and family ID so they all used sentrix_Ids. Additional fields, like sex and pedigree will be as the lab/project returned them to the NIPH. (For data-sets earlier part of MoBaGenetics 1.0 this might not be completely true).

Individual datasets

The individual sets might together contain more sample than the merged set (see above) due to duplicates.

snp001

  • NIPH reference of the project: PDB315
  • NIPH Biobankretrieval id 581, 582, 583, 584, 585, 586, 588, 589, 590, 591, 592
  • Project lead:Pål R. Njølstad
  • Processed by lab: NTNU Genomics Core Facility, Trondheim, Norway
  • Chip used: Illumina HumanCoreExome12v1.1
  • Date scanned: 2014?
  • Original number of samples: 18972

This set is also part of MoBaGenetics 1.0, and was there part of a set previously called Harvest . snp002 (different batch) and snp003 (different chip) are also from Harvest. We here represent them as three sets since these data need to be QC'ed individually due to chip/batch effects.

Historically, a batch effect in the HumanCoreExome12v1.1 analyzed samples in Harvest caused two subsets. One was called good (snp001) and one called bad (snp002). Neither was good or bad - they where just different.

The triads in the project were selected randomly, however samples matching any of the following criteria were excluded:

  1. Stillborn
  2. Deceased
  3. Twins
  4. Non-existing Medical Birth Registry data
  5. Missing anthropometric measurements at birth in Medical Birth Registry
  6. Pregnancies where the mother did not answer the first questionnaire (Q1) (as a proxy for higher fallout rate)
  7. Missing parental DNA samples.

For more info on original financing, see projects that have contributed to MoBa Genetics

The directory structure might be changed later (see Important warning)

snp002

  • NIPH reference of the project: PDB315
  • NIPH Biobankretrieval id 586, 588, 589, 590, 591
  • Project lead: Pål R. Njølstad
  • Processed by lab: NTNU Genomics Core Facility, Trondheim, Norway
  • Chip used: Illumina HumanCoreExome12v1.1
  • Date scanned: 2014?
  • Original number of samples: 1692

This set is also part of MoBaGenetics 1.0, and was there part of a set previously called Harvest . snp001 (different batch) and snp003 (different chip) are also from Harvest. We here represent them as three sets since these data need to be QC'ed individually due to chip/batch effects.

Historically, a batch effect in the HumanCoreExome12v1.1 analyzed samples in Harvest caused two subsets. One was called good (snp001) and one called bad (snp002). Neither was good or bad - they where just different.

For more info on selection criteria and original financing, see snp001 above.

The directory structure might be changed later (see Important warning)

snp003

  • NIPH reference of the project: PDB1438
  • NIPH Biobankretrieval id 630, 631
  • Project lead: Per Magnus
  • Processed by lab: NTNU Genomics Core Facility, Trondheim, Norway
  • Chip used: Illumina HumanCoreExome24v1.0
  • Date scanned: 2015?
  • Original number of samples: 12874

This set is also part of MoBaGenetics 1.0, and was there part of a set previously called Harvest . snp001 and snp002 (different chip) are also from Harvest. We here represent them as three sets since these data need to be QC'ed individually due to chip/batch effects.

For more info on selection criteria and original financing, see snp001 above.

The directory structure might be changed later (see Important warning)

snp004-006

These are placeholders for either a) empty sets b) very old sets that are hard to figure out. They were not part of MobaGenetics 1.0 - but might be added here later.

snp007

  • NIPH reference of the project: PDB1479
  • NIPH Biobankretrieval id 654
  • Project lead: Ole Andreassen at NORMENT (Norwegian Centre for Mental Disorders Research)
  • Processed by lab: DeCODE genetics, Island
  • Chip used: Illumina HumanOmniExpress-24v1.0.
  • Date scanned: 2014?
  • Original number of samples: 2983

The set was previously known as Norment Jan15. All individuals are unrelated parents used as controls for a separate project in January of 2015.The only exclusion criteria was samples from a plural birth pregnancy.

Nitty gritty details: Note that the number of idat-files available for the corresponding set under MoBaGenetics 1.0 (3414) does not match the samples in snp007 (2983). The snp007 idat-files match the corresponding bedset for snp007, which is the same as the bedset published for Norment Jan15.

idat-files are found in sub-directories of idats/ matching the sentrix_id.

The directory structure might be changed later (see Important warning)

snp008

  • NIPH reference of the project: PDB1479
  • NIPH Biobankretrieval id 654
  • Project lead: Ole Andreassen at NORMENT (Norwegian Centre for Mental Disorders Research)
  • Processed by lab: DeCODE genetics, Island
  • Chip used: Illumina HumanOmniExpress-24v1.0
  • Date scanned: Start of 2015?
  • Original number of samples: 2976

snp007 and snp008 are closely related as they are both from NIPH Biobankretrieval 654

The set was previously known as Norment Jun15. All individuals are unrelated parents used as controls for a separate project in January of 2015.The only exclusion criteria was samples from a plural birth pregnancy.

idat-files are found in sub-directories of idats/ matching the sentrix_id.

The directory structure might be changed later (see Important warning)

snp009

  • NIPH reference of the project: PDB1479
  • NIPH Biobankretrieval id 738/739
  • Project lead: Ole Andreassen at NORMENT (Norwegian Centre for Mental Disorders Research)
  • Processed by lab: DeCODE genetics, Island
  • Chip used: Illumina InfiniumOmniExpress-24v1.2
  • Date scanned: 2016 (?)
  • Original number of samples: 17608/18436 (bedset/idats - approximately 6000 triads)

The set was previously know as Norment May16.

The triads in the project were selected randomly, however samples matching any of the following criteria were excluded (same as snp010 and almost the same as snp001):

  1. Stillborn
  2. Deceased
  3. Twins
  4. Non-existing Medical Birth Registry data
  5. Missing anthropometric measurements at birth in Medical Birth Registry
  6. Missing parental DNA samples.

The number of bedset/plink samples mentioned (17608) are almost the same samples as where part of MoBaGenetics 1.0, short of 122 samples that should not have been part for 1.0.

For some reason yet unknown as of 25.5.22, there are more "idat" samples than "plink" samples. As we progress with the QC, this will hopefully be sorted out.

snp010

  • NIPH reference of the project: PDB1479
  • NIPH Biobankretrieval id 887
  • Project lead: Ole Andreassen at NORMENT (Norwegian Centre for Mental Disorders Research)
  • Processed by lab: DeCODE genetics, Island
  • Chip used: Illumina GSA (probably 1.0) with additional 50k custom SNPs added by Decode Genetics.
  • Date scanned: End of 2017 (?)
  • Original number of samples: 9632 (approximately 3000 triads)

The set was previously know as Norment_Feb18.

The triads in the project were selected randomly, however samples matching any of the following criteria were excluded (same as snp009 and almost the same as snp001):

  1. Stillborn
  2. Deceased
  3. Twins
  4. Non-existing Medical Birth Registry data
  5. Missing anthropometric measurements at birth in Medical Birth Registry
  6. Missing parental DNA samples. Note that the exclusion criteria is almost as snp001.

idat-files are found in sub-directories of idats/ matching the sentrix_id.

The directory structure might be changed later (see Important warning)

snp011

  • NIPH reference of the project: PDB1382
  • NIPH Biobankretrieval id 766, 767
  • Project lead: Ted Reichborn-Kjennerud
  • Processed by lab: DeCODE genetics, Island
  • Chip used: Illumina InfiniumOmniExpress-24v1-2
  • Date scanned: 2016/2017
  • Original number of samples: 5410

The set was previously know as TED.

These are ADHD case and controls triads and duos. There are overlapping samples with other sets MoBa Genetics.

Detailed overview of samples in this project: 5818 samples were selected for genotyping.

  • Case children: 1649
  • Control children: 1651
  • Mothers (only selected for cases): 1595
  • Fathers (only selected for cases): 923

(28 wells turned out empty, 5790 samples were sent for genotyping, 5410 samples successfully genotyped).

The directory structure might be changed later (see Important -warning)

snp012

  • NIPH reference of the project: PDB315
  • NIPH Biobankretrieval id 867, 868
  • Project lead: Pål R. Njølstad
  • Processed by lab: Erasmus MC, Rotterdam, Netherlands
  • Chip used: Illumina GSAMDv1.0 (aka. Global Screening Array MD v.1.0)
  • Genome Studio manifest-file GSA-24v1-0_C1
  • Date scanned: 16.3.2018
  • Original number of samples: 17949

The set was previously known as Rotterdam1.

For selection criteria, see snp001

snp013

  • NIPH reference of the project: PDB315
  • NIPH Biobankretrieval id 875 876
  • Project lead: Ted Reichborn-Kjennerud
  • Processed by lab: DeCODE genetics, Island
  • Chip used: DeCodeGenetics V1_v2
  • Genome Studio manifest-file DeCodeGenetics_V1_20012591_A1.bpm
  • Date scanned: 18.01.2018 - 19.01.2018
  • Original number of samples: 2426

Despite the low snp-number, these data were not returned to NIPH at the time MoBaGenetics 1.0 was produced.

There were a lot of logistic problems while receiving these data, and certain valuable information like well-positions of the idat-files seem to be lost.

idat-files are found in sub-directories of idats/ matching the sentrix id. The cluster-file used by deCODE is available on DeCodeGenetics_V1_v2.egt.

Samplesheet_snp013.csv contains fields we got from the lab (deCODE) Yield and ng/ul . We cannot support these fields as we don't understand them well enough, but should you have information on them we will share it here :-)

The bedset is found in the bedset/ directory.

snp014

  • NIPH reference of the project: PDB315
  • NIPH Biobankretrieval ids 926 928 930
  • Project lead: Pål R. Njølstad
  • Processed by lab: Erasmus MC, Rotterdam, Netherlands
  • Chip used: Illumina GSAMDv1.0 (aka. Global Screening Array MD v.1.0)
  • Genome Studio manifest-file GSA-24v1-0_C1
  • Date scanned: November 2018
  • Original number of samples: 9041

The set was previously know as Rotterdam2

snp015

Because the set, all with data from the Biobank Retrieval id 954 has been processed by two different chips, the set has been split in two. There were multiple problems during data-return to NIPH and the precise scan dates might have been lost.

Common for both sets:

  • NIPH reference of the project: PDB1479
  • NIPH Biobankretrieval id 954
  • Project lead: Ole Andreassen at NORMENT (Norwegian Centre for Mental Disorders Research)
  • Processed by lab: DeCODE genetics, Island

idat-files are found in sub-directories idats/ corresponding to the sentrix-ids. Bedset files are found in the bedset/ directory.

snp015a

  • Chip used: DeCodeGenetics V1_v2
  • Original number of samples 13505
  • Date scanned: 26.08.2019 - 28.8.2019

snp015b

  • Chip used: DeCodeGenetics v3 1_v2
  • Original number of samples 4418
  • Date scanned: 28.8.2019

snp016

Common for subsets:

  • NIPH reference of the project: PDB1479
  • Project lead: Ole Andreassen at NORMENT (Norwegian Centre for Mental Disorders Research)
  • Processed by lab: DeCODE genetics, Island
  • Chip used: DeCodeGenetics v3 1_v2

idat-files are found in sub-directories of idats/ corresponding to the biobank-retrieval ids. bedset are found in the same bedset/ directory, with names sporting the biobank retrieval-id.

snp016a

  • NIPH Biobankretrieval id 966
  • Date called: 11.11.2019 - 28.5.2020
  • Original number of samples 24999

snp016b

  • NIPH Biobankretrieval id 1029
  • Date called: 8.1.2020 - 28.5.2020
  • Original number of samples 24980

snp017

Common for all

  • NIPH reference of the project: PDB1479
  • Project lead: Ole Andreassen at NORMENT (Norwegian Centre for Mental Disorders Research)
  • Processed by lab: DeCODE genetics, Island
  • Chip used: DeCodeGenetics v3 1_v2

Since DeCODE delivered idat-files and bedsets in 6 directories corresponding to the biobank-retrievals, we split the set.

snp017a

  • NIPH Biobankretrieval id 1066
  • Date called: 12.5.2020 - 9.9.2020
  • Original number of samples 24995

snp017b

  • NIPH Biobankretrieval id 1077
  • Date called: 31.7.2020 - 11.9.2020
  • Original number of samples 4699

snp017c

  • NIPH Biobankretrieval id 1108
  • Date called: 7.8.2020 - 11.9.2020
  • Original number of samples 4792

snp017d

  • NIPH Biobankretrieval id 1109
  • Date called: 11.8.2020 - 15.9.2020
  • Original number of samples 5625

snp017e

  • NIPH Biobankretrieval id 1135
  • Date called: 12.8.2020 - 15.9.2020
  • Original number of samples 4605

snp017f

  • NIPH Biobankretrieval id 1146
  • Date called: 14.5.2020 - 15.9.2020
  • Original number of samples 5256

snp018

A set of multiple rather small retrievals from the NIPH biobank were extracted to 'complete' the genotyping of MoBa. Time will probably show that we still need to genotype some individuals due to various errors, but this is the explanation for this odd set.

Since deCODE made different bedsets/plink-files, we decided to respect these sets. They have been named snp018a-e

Common for all sets:

  • NIPH reference of the project: PDB1479
  • Project lead: Ole Andreassen at NORMENT (Norwegian Centre for Mental Disorders Research)
  • Processed by lab: DeCODE genetics, Island
  • Chip used: DeCodeGenetics v3 1_v2

snp018a

  • NIPH Biobankretrieval id 1273
  • Data called: 19.11.2020 - 8.2.2020
  • Original number of samples 5446

snp018b

  • NIPH Biobankretrieval id 1409
  • Data called: 1.12.2020 - 19.1.2021
  • Original number of samples 2702

snp018c

  • NIPH Biobankretrieval id 1413
  • Data called: 3.12.2020 - 9.2.2021
  • Original number of samples 5637

snp018d

  • NIPH Biobankretrieval id 1531
  • Data called: 16.12.2020 - 8.2.2021
  • Original number of samples 1971

snp018e

  • NIPH Biobankretrieval id 1532
  • Data called: 16.12.2020 - 19.1.2021
  • Original number of samples 219

snp019

  • NIPH reference of the project: PDB2217
  • NIPH Biobankretrieval ids 913 914 915
  • Project lead: Morten Vatn
  • Processed by lab: Laboratorium Universitätsklinik Schlesvig Holstein, Kiel, Germany
  • Chip used: GSAMD-24v2-0_20024620_B1
  • Date scanned: 26.7.2019 - 5.2.2020
  • Original number of samples 1164

This is a smaller set that is almost certainly re-genotyped in one of the set genotyped by the Norment project (search above) set . It might very well be that for general convenience that the set never will be part of the Merged set.

(The snp-number is slightly confusing as it is genotyped earlier than the number suggests)

No bedset/plinkfiles were returned by the lab.

The chip used to produce the idat-files has been customized (by LifeBrain?). Two cluster (.egt) files are supplied, one standard from Illumina, and one with cluster definition added for the self-defined content of the array. See README_GSA-ClusterFile_definition.txt for a minimum of information.