- Elbay Aliyev - Lead
- Muhammad Sohail Raza - Writer
- Shangzhe Zhang, Anastasia Illarionova, Prashant Ranjan, Tiancheng Xu, Aditi Sammi - Tech support
- Bryce Kille, ChunHsuan LO - SysAdmin
After two decades of refinements, the human reference genome (GRCh38) has become more accurate and mostly complete. However, there are still hundreds of unresolved gaps persist, and no single chromosome has been finished from end to end because of the existence of highly repeated sequences (called transposable elements). Foutunatedly, by using the high-coverage & ultra-long-read technologies, several scientific groups have presented a new human genome assembly that surpasses the continuity of GRCh38, along with a gapless, telomere-to-telomere assembly of a human chromosome, which is called CHM13 reference genome.
Designing a portable pippeline which performs the following steps (the tool can be applied to any genomes in fasta format and any VCF files)
- To introduce common variants into CHM13 reference genome.
- To propose a more representative reference genome by 1000 Genomes Project.
- To screen out in-frame stop codons sites that disagree with Ribo-seq analysis for validting the annotation.
Credits: Anastasia Illarionova
- Positions of in-frame stop codons in CHRM13 reference sequence
- New reference Fasta file (for all samples in 1GP)
- (Optional) New reference Fasta files for 5 subpopulations
- ClinVar annotation of common alleles
- Install snakemake
conda install -n base -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake
- Clone the Git repo
git clone https://github.com/collaborativebioinformatics/RPG_Pikachu.git
- Go inside the RPG_Pikachu
cd RPG_Pikachu
- Dry run (See what will happen)
snakemake -np
- Use snakemake to run the workfolw
snakemake --cores 10 --use-conda
-
Download CHM13 v1.0 draft fasta file.
-
Download population based VCF (CHM13 v1.0 draft). Run GATK FixVcfHeader before launching the pipeline!
-
Download gene annotation file (gff3, for CHM13 v1.0 draft).
-
Modify config.yaml file
ref_genome: data/hm13.draft_v1.0.fasta # Reference genome
gff: data/chm13.draft_v1.0.gene_annotation.v4.gff3 # Reference genome annotation
vcf: 1KG_CHM13.vcf # Variant call dataset with allele frequencies
vcf_clinvar: clinvar_liftover_tochm13_v0draft.vcf # Variant call dataset for additional annotation of filtered variants
af_flag: "AF" # Allele frequency flag in the INFO field to use for the variant filtering
1. Identification of common variants from VCF file
- Variant call filtering criteria (biallelic SNP, AAF > 5%, custom AF flag)
bcftools view --max-alleles 2 --exclude-types indels -i 'INFO/AF > 0.05' {input.vcf} > {input.vcf}_filtered.vcf
2. Insertion of common variants into the Reference genome
cat {input.reference} | vcf-consensus {input.vcf} > {input.gff}
3. Identifying stop codon sites in CHM13 fasta file (Shangzhe, Muhamad, Bryce, ChunHsuan)
- Extract the protein sequences from GFF annotation and FASTA file
gffread -g {input.reference} {input.gff} -y pep.fasta
- Custom script to extract the positions of in-frame stop codons
python3 determine.in-frame.stop-codon.py pep.fasta {input.gff} > {output_stop-codons.bed}
- Riboseq-validation (for the ORF & inframe-stop-codon sites where variants located)
(Pseudocodes)
To download paired RNASeq.fastq and RiboSeq.fastq
To quality control fastq files
(adapter trimming, remove low quality reads, etc.)
To align the reads to CHM13 refgenome
(get RNASeq.bam and RiboSeq.bam)
To peak calling for read coverage of aligned bam files
(This will reveal real ORF and stop-codon sites by comparing RNAseq & Riboseq at same time.)
To validate our targeted variant sites by the peak calling results,
and to clasify them into true ones and false ones.
Rscript ./RiboSeq_pipeline.R
(Details were described in https://github.com/collaborativebioinformatics/RPG_Pikachu/blob/main/others/RiboSeq_validation/RiboSeq_pipeline.R)
1. Statistical visualization
- 6,728,753 common alleles were identified, 17 of which are overlapping with in-frame stop codons.
- AF analysis on subpopluation
Fig.2 - AF analysis on subpopluation
2. Postions of in-frame stop codons
There are 3,166 in-frame stop codons for 603 transcripts of 334 genes.
3. Biologically annotated variants (CHM13 based).
~0.7% of the final dataset of common variants were annotated with respective hits in ClinVar database (liftover hg38 -> CHM13 v1.0)
ClinVar annotation | Percentage of common variants |
---|---|
Benign | 0.6 |
Likely benign | 8e-3 |
Likely pathogenic | 8e-5 |
Pathogenic | 3e-4 |
- Complete high-quality reference genome is crucial for improved read mapping and variant calling in resequencing data analysis. Reference panel generator (RPG) pipeline incorporate common allele information at the rare sites of high-quality genome sequences and fixing rare in-frame stop-codons. Such approach would facilitate sequencing analysis of diverse set of populations.
- However, incorporating common alleles might leads to unknown genomic changes, such as introduction of new start codons or unknown impacts on gene structure. Therefore, revising gene annotation files are also anticipated.
- Extensive benchmarking will be required in order to validate the utility of the newly generated reference panels.
- Sergey Aganezov et.al, A complete reference genome improves analysis of human genetic variation. bioRxiv 2021.07.12.452063; doi: https://doi.org/10.1101/2021.07.12.452063
- T2T Consortium: https://sites.google.com/ucsc.edu/t2tworkinggroup/technology?authuser=0
- Lifting over variants from GRCh38 to T2T-CHM13: https://github.com/mccoy-lab/t2t-variants/tree/main/liftover_vcfs
- Ribosome-Profiling: https://github.com/FDA/Ribosome-Profiling
- Fritz J. Sedlazeck
- Ben Busby