Project 1

Project 1: Building a transcriptomic map of small intestine neuroendocrine tumors

Background

The panNENomics project aims to unveil the molecular pathways underlying the development of the understudied neuroendocrine neoplasms (NENs) from all body sites. Although we have recently integrated all transcriptomic studies of lung NENs (Gabriel, Mathian et al. Gigascience, 2020), a comprehensive molecular map spanning NENs from all body sites, including gastro-intestinal NENs (Alvarez et al. Nat Genet 2018) has yet to be generated.

Data

RNA-seq for 81 small intestine neuroendocrine tumors (GEO identifier GSE98894, available at https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP107025&o=acc_s%3Aa)
RNA-seq for 28 normal enteroendocrine cells (GEO identifier GSE146799, available at https://www.ncbi.nlm.nih.gov/sra?term=SRP252411)

Requirements

Light understanding of the two practicals:

Q1 of practical 1 (launching a nextflow pipeline)
analysing data with PCA and UMAP

Steps

download and convert to fastq (see fastq-dump or fasterq-dump from the SRAtoolkit; https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump)
process the data with pipelines from the github IARCbioinfo platform to ensure smooth integration with the other data, using parameters files (see nextflow option -params-file)

download the genome reference GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play (https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/)
run IARCbioinfo/RNAseq-nf, adapting the parameters in file Documents_Project1/params-1-RNAseq-nf.yml and if needed the input file Documents_Project1/input_RNAseq-nf-batch1.tsv, and using the bed file at Documents_Project1/hg38_Gencode_V33.bed; be careful that some files correspond to paired-end libraries and some to single-end libraries
run the local realignment pipeline IARCbioinfo/abra-nf on the resulting BAM files, adapting the parameters in file Documents_Project1/params-2-abra-nf.yml and using the bed file at Documents_Project1/hg38_Gencode_V33_merged.bed
run the base quality score recalibration pipeline IARCbioinfo/BQSR-nf on the bam files from BQSR-nf to obtain the final alignments, adapting the parameters in file Documents_Project1/params-3-BQSR-nf.yml and using the VCFs with known SNPs--Homo_sapiens_assembly38.dbsnp138.vcf and its idx index, and Mills_and_1000G_gold_standard.indels.hg38.vcf.gz and its tbi index--from the GATK bundle
run IARCbioinfo/RNAseq-transcript-nf to obtain gene and transcript quantifications, adapting the parameters file at Documents_Project1/params-4-RNAseq-transcript-nf.yml and input file Documents_Project1/input_RNAseq-transcript-nf.tsv

perform unsupervised analyses with R (dimensionality reduction with PCA and UMAP, clustering), and assess the distribution of specific neuroendocrine markers (NEUROD1, NEUROG3, CHGA, SYP, INSM1, HES6, DDC, UCHL1, NCAM1, CALCA, SSTR2).

Expected difficulties

Storage (100+ RNA-seq ~1Tb) and computation (STAR requires 40-50Gb RAM, HPC Required, long processing time-parallelization is key).

Tips:

downloading and processing by small batches necessary, prioritizing primary tumors
compressing (gzip) fastq files from the SRA is an option if they are kept a long time; removing intermediate files can also save space
adapt memory and cpu specifications to the cluster (see params files; also see -qs and -bg nextflow options)
might be better to explore the processed read counts given as supplement of the dataset while the pipelines are running (available https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98894)

Resources

[email protected] (Nicolas Alcala)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly