The GENCODE Primary transcript set

Long-read sequencing technologies such as those developed by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) have driven a step-change in our ability to capture the transcriptome. As a reference gene annotation resource GENCODE aims to capture this transcript diversity in human and mouse and present it in an organised way to support its use in downstream analysis.

Historically, we have presented two categories of transcripts in GENCODE gene annotation: GENCODE Comprehensive, which captures all annotated transcripts, and GENCODE Basic, a smaller subset of transcripts containing only full-length transcripts at protein-coding genes and enriched for alternative splicing at lncRNA genes. The length and depth of the reads generated in ONT and PacBio experiments will eventually overwhelm the GENCODE Basic category, for example by adding many tens or hundreds of full-length alternatively spliced transcripts at protein-coding genes. While this rich set of annotations will benefit many use-cases, we recognise that very many transcripts can hinder others, not least viewing GENCODE transcripts alongside other data in the Ensembl and UCSC genome browsers. As such we will be changing the way transcripts are represented, adding a new classification with a small subset of GENCODE Comprehensive transcripts enriched for signals of functional potential.

The GENCODE Primary transcripts capture a minimal set of transcripts at protein coding genes and include protein-coding exons with evidence of evolutionary constraint and alternative splicing (including exon skips) with high expression. The GENCODE Primary subset includes all MANE Select/Ensembl Canonical and MANE Plus Clinical transcripts by default.

For the human reference genome, in Ensembl 113/GENCODE 47, GENCODE Primary flags will be available on the Ensembl website and in release files, while the default annotation displayed in the Ensembl browser will be GENCODE Basic. In Ensembl 114/GENCODE 48 the default annotation displayed in the Ensembl browser will be GENCODE Primary. The annotation sets GENCODE Comprehensive and GENCODE Basic can be added to the browser.

The GENCODE Primary set provides an opportunity to simplify variant interpretation by reducing the number of variant transcript predicted molecular consequences needing review. These data are only now displayed on variant and transcript pages in the browser for GENCODE Primary transcripts. Annotation of your own variants with the Ensembl Variant Effect Predictor now also takes advantage of this transcript set. Annotation can be restricted to only these transcripts, all transcripts (the GENCODE Comprehensive set), or GENCODE Basic transcripts.

Mouse has no MANE Select transcripts annotated. However, we are developing a GENCODE Primary pipeline for mouse based on the Ensembl Canonical transcript set and we anticipate adding GENCODE Primary annotation in Ensembl 115/GENCODE M38.

Unlike MANE transcripts, which are very stable, the GENCODE Primary set is expected to be dynamic over the initial releases due to both the addition of new transcript models based on long transcriptomic data and active pipeline development to capture all relevant transcripts at protein-coding genes and lncRNAs.

GENCODE Primary transcripts at protein-coding genes are made with reference to the MANE Select transcripts [1]. MANE Plus Clinical transcripts are added to the GENCODE Primary set by default.

Starting with the GENCODE Comprehensive annotation:

All GENCODE Primary transcripts will be manually reviewed to validate their selection (this work is currently in progress).

GCP1

GCP2

For all biotypes other than protein-coding genes in human, the GENCODE Primary flag is added to the Ensembl Canonical transcript in Ensembl 114/GENCODE 48, and for lncRNA genes only this will be the transcripts with the longest genomic span. We are working on a specific lncRNA pipeline to enable the addition of relevant highly expressed transcripts to the set.

References

  1. Morales J, Pujar S, Loveland JE, et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature. 2022 Apr;604(7905):310-315. DOI: 10.1038/s41586-022-04558-8. PMID: 35388217; PMCID: PMC9007741.
  2. Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics (Oxford, England). 2011 Jul;27(13):i275-82. DOI: 10.1093/bioinformatics/btr209. PMID: 21685081; PMCID: PMC3117341.
  3. https://data.broadinstitute.org/compbio1/PhyloCSFtracks/trackHub/hub.DOC.html
  4. Wilks C, Zheng SC, Chen FY, et al. recount3: summaries and queries for large-scale RNA-seq expression and splicing. Genome Biology. 2021 Nov;22(1):323. DOI: 10.1186/s13059-021-02533-6. PMID: 34844637; PMCID: PMC8628444.
  5. Nellore A, Jaffe AE, Fortin JP, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biology. 2016 Dec;17(1):266. DOI: 10.1186/s13059-016-1118-6. PMID: 28038678; PMCID: PMC5203714.