GTF (and other) files generated from the CLS3 data:
The master table GTF for human and mouse can be downloaded using the following links:
This is the main GTF file generated from the CLS3 dataset after merging the transcripts obtained across all the tissues/samples.
The CLS3 long-read data is processed through LyRic to obtaing high-confidence transcript models per sample through both pacBioSII and ONT technologies. Further, to obtain a "master" annotation for all the transcripts obtained through the different tissues and technologies, the transcripts across samples were "anchored" merged together. Meaning the transcript ends with orthologonal data support were preserved so as to not merge to other transcripts. This reducued the redundancy in the dataset while also preserving the supported transcript ends. The supported transcript ends were anchored using the script anchorTranscriptsEnds.pl, followed by transcript merging using tmerge.
Following is a snapshot from the master table GTF:
The attribute tags description can be found here on the main gencode-cls-master-table github page. gene_id and transcript_id both specify the anchored merged transcript identifier in the format anchTMxxxxxxxxxxxx.
The final refined master table has the following artifact models included, in addition to the genuine models: polyASJdisag, recountSlt50, spliceSiteMisalign, tRepeatOverlap.
For additional end support data, we used proCapNet predictions for supporting the human CLS3 TSSs. A proCapNet score of moreThanEqualTo5 (MTE5) within 100bp window of a TSS in any of the proCap datasets is taken as a positive support. The anchTMs with proCapNet supported TSSs can be found here:
The chain GTF for human and mouse can be downloaded using the following links:
- Human chain GTF - spliced
- Human chain GTF - monoexonic
- Mouse chain GTF - spliced
- Mouse chain GTF - monoexonic
This GTF was created from the main master table to further reduce the transcript redundancy, while preserving the master table transcript identifiers (anchTMxxxxxxxxxxxx) in the attributes.
Different strategies were followed for the spliced and monoexonic transcripts.
- For the spliced transcripts, those with identical intron chains were merged into a single intron chain (anchICxxxxxxxxxxxx), irrespective of the end support.
- While for the monoexonic transcripts, those with 50% overlap were merged together into a single chain (anchUCxxxxxxxxxxxx), again irrespective of the end support.
A snapshot from the "intron" chain master table GTF:
gene_id and transcript_id both specify the merged chain identifier in the format anchICxxxxxxxxxxxx for the spliced transcripts, while anchUCxxxxxxxxxxxx for the monoexonic transcripts.
In addition to the master table attributes (target, spliced, sampleN, samplesMetadata, expression, artifact), a new attribute contained_anchTMs, is also present which specifies the anchTMs contaiined within each chain.
For the master table GTF attributes, the definition remains the same, while these attributes now specify the respective tags for all the anchTMs contained within the respective chain.
An exception is the endSupport attribute, wherein the highest support available from the contained anchTMs is selected for the respective chain. While for the refCompare and currentCompare attributes, the status of the merged chain against the specific Gencode reference annotations is recalculated (using GffCompare) and specified.
The final refined "chain" master tables have the following artifact types included, in addition to the genuine models: polyASJdisag, recountSlt50, spliceSiteMisalign, tRepeatOverlap.
Please use only after completely understanding the generation process. This is an annotation with all the refined CLS loci and might harbour read-throughs; this is not the final gencode annotation, but useful for analysis pertaining to all the obtained CLS3 loci
The loci GTFs can be downloaded using the following links.
- Human loci GTF - gencode v27 tagged
- Human loci GTF - gencode v43 tagged
- Mouse loci GTF - gencode vM16 tagged
The loci information remains the same across different gencode reference versions, the only difference is the gencode overlap attributes. For this loci level GTF, the transcripts are clustered at locus level. This is done by first reducing the redundancy by merging the anchTMs using tmerge. Further, the transcripts with any overlap on the same strand (bedtools intersect) are clustered into a single locus.
A snapshot from the loci master table GTF:
The loci GTF has gene, transcript and exon features. gene_id specifies the CLS3 locus ID (CLS3:LOC_xxxxxxxxxxxx) while transcript_id specifies the merged transcript ID (TM_xxxxxxxxxxxx).
Each exon has the transcript level attributes obtained through tmerge, description here.
Further, additional attributes overlapping_gencode_gene, overlapping_gencode_transcript have been added that specify the transcript overlap with gencode genes and transcripts at the exonic level. While gLlevel_gencodeGeneOverlaps specifies the locus overlap with gencode genes at the exonic level.
Each locus can be linked to the underlying anchTMs using the contains attribute tag accesed through the exon records for each transcript (TM_xxxxxxxxxxxx), thus linking the main master table GTF to this loci GTF.
Further, we have created a mapping file for the master table GTF (anchTMs) and the loci GTF. For each locus, there exists the following mapping information: "CLSgeneLoci - anchTMs - samples - gencode v27 exonic overlaps"
These can be found here:
- Human loci - anchTMs - gencode v27 mappings
- Human loci - anchTMs - gencode v43 mappings
- Mouse loci - anchTMs - gencode vM16 mappings
The final loci master tables have the following artifact types included, in addition to the genuine models: polyASJdisag, recountSlt50, spliceSiteMisalign, tRepeatOverlap.
Please use only after completely understanding the generation process. This is the genocde "reference" annotation enhanced with the refined novel "intergenic" CLS loci built from ONLY the spliced, refined (additional removal of the polyASJdisag artifacts) anchTMs and might harbour read-throughs; this is not the final gencode annotation
The aim of generating this annotation was to enhance the existing reference gencode annotations with only the reliable intergenic CLS3 transcripts/loci (i.e., spliced transcripts, tagged as "no", "recountSlt50", "spliceSiteMisalign", "tRepeatOverlap" for the attributes tag)
The enhanced annotation GTFs can be downloaded using the following links.
- Human gencode v27 enhanced annotation
- Human gencode v43 enhanced annotation
- Mouse gencode vM16 enhanced annotation
The CLS loci within these files are different from the "loci master table GTF" loci, and have been named as CLS3i:LOC_xxxxxxxxxxxx.
This is the latest, yet to be released gencode annotation, improved w.r.t. the v46 through the addition of gencode havana team approved CLS3 lncRNAs. Since these files will be public only with the next release of gencode, for now the access to these files is limited. Please reach out in case you require access.
Links for downloading the gencode annotations:
The mapping across v47 ENSTs and the CLS3 anchICs they were extended/created from. These are mostly lncRNAs, and ~300 protein coding transcripts.
The file has additional tags, "CLS3_anchIC_gffComparev27" for the status of the respective anchIC(s) w.r.t. gencode v27 annotation. "v47-CLS3_mappingTag" states the mapping strategy used, internal details as follows:
-
direct_anchICUC_mapping. (147774 ENSTs):
anchIC or anchUC were used for creating/extending the v47 ENSTs -> direct mapping to current ICtable using the anchIC/anchUC. -
oldmTanchTM_mapping. (174 ENSTs):
older version of anchTMs were used for creating/extending the v47 ENSTs -> mapping to older version of anchTMs. -
readID(old)-anchIC(new)_mapping. (638 ENSTs):
older unsplit reads were used for creating/extending the v47 ENSTs -> mapped corresponding new split readID-lid-anchTM-anchIC as old and new readIDs refer to same read. Therefore, mapping old reads to current ICtable.
Some (488 ENSTs) still not found, tagged "UNMAPPED" in the "CLS3_anchIC" and "CLS3_anchIC_gffComparev27" columns. Most probably these reads were not used by LyRic to build TMs. -
LID-anchIC_mapping. (3032 ENSTs):
LIDs (current version) used for creating/extending the v47 ENSTs -> LIDs mapped to anchICs. Some may not be mapped: are compmerge, alignID, etc.; correspond to old CLS or other datasets. In such cases, "CLS3_anchIC" and "CLS3_anchIC_gffComparev27" column has the value UNMAPPED. The LIDs used to create/extend an ENST could be UNMAPPED (1577), or some could be mapped as well.
lid: LyRic TM ID
rid: read ID
The mapping across v47 ENSTs and the CLS3 anchICs they were extended/created from, with added details like novelty at the transcript as well as gene level.
For each transcript (ENST) created/extended in v47 due to CLS3 (anchICs), the file lists:
geneID_v47 | v47 gene (ENSG) that the transcript belongs to |
|
transcriptID_v47 | v47 transcript ID (ENST) |
|
created/extended | tag specifying whether the transcript was created or extended using TAGENE/manually |
|
CLS3_anchIC | CLS3 anchIC(s) that led to the addition of the transcript to v47 |
|
CLS3_anchIC_gffComparev27 | gffcompare classification for the anchIC(s) w.r.t. v27 (reference annotation) |
|
v47-CLS3_mappingTag | states the mapping strategy used; details in the above section |
|
v47_biotype | v47 biotype |
|
transcriptClassification | transcript (ENST) novelty status taking into account the different gffcompare classifications from all the underlying anchICs |
|
geneClassification | gene (ENSG) novelty status taking into account the different gffcompare classifications from all the underlying transcripts |
|
CLS3_anchTM | CLS3 anchTM(s) that led to the addition of the transcript to v47. Mapped through the anchICs. |
The gffcompare novelty status definitions w.r.t. v27 for the anchICs, ENSTs and ENSGs:
The gencode v47 annotation was further enhanced by adding just the most reliable "intergenic" transcript loci (i.e., spliced transcripts, tagged as "no" or "recountSlt50" for the attributes tag). This was done specifically for the purpose of RNA-Seq analysis studies requantifications, adding some transcripts/loci that have been not added to the gencode annotation yet due to stringent filters or because the transcripts still need to be reviewed for addition to gencode.
The latest gencode enhanced annotations can be downloaded here:
For any analysis related to the target catalog stats, we recommend using the merged version of the targets.
Merged targets (used for most of the analysis and for probe designing as well):
Exon-level unmerged targets:
The human-mouse mapping for the targets that were lifted over from human to mouse originally for the target design, upto the exon level as well. Mappings for all the liftedOver targets can be found here:
This is a tab separated file, with the targetID (same for different exons of the same target element), targetID_human (with hg38 coordinates) and the corresponding targetID_mouse (with mm10 coordinates).
For the target design, catalogs (8 catalogs - CMfinderCRSs, GWAScatalog, UCE, VISTAenhancers, fantomCat, fantomEnhancers, bigTranscriptome, miTranscriptome) were liftedOver from human to mouse. For designing targets, the features were extended 100bp in both directions in case a feature is <=200bp. The human and mouse IDs in the above file are the final ones (bed format coordinates included in the ID), as have been used in the masterTables and elsewhere as well.
The file shared above has all the features that were liftedOver from human to mouse. To access only the ones selected for the final targetDesign in mouse, please use this file: