Frequently asked questions

What is the difference between reference and non-reference releases?

In the release history pages, the "Reference release?" columns highlights releases that were used as the reference annotation in GENCODE or ENCODE analysis publications. The latest release is also shown as a reference release by default.

What is the difference between GENCODE and Ensembl annotation?

The GENCODE annotation is made by merging the manual gene annotation produced by the Ensembl-Havana team and the Ensembl-genebuild automated gene annotation. The GENCODE annotation is the default gene annotation displayed in the Ensembl browser. The GENCODE releases coincide with the Ensembl releases, although we can skip an Ensembl release if there is no update to the annotation with respect to the previous release. In practical terms, the GENCODE annotation is essentially identical to the Ensembl annotation.

What is the difference between GENCODE GTF and Ensembl GTF?

The gene annotation is the same in both files. Until release 43 (Ensembl release 109), the only exception to this was that the GENCODE GTF included both copies of the genes that are common to the human chromosome X and Y pseudoautosomal regions (PAR), whereas the Ensembl file only contained the chromosome X PAR genes.

In addition, the GENCODE GTF contains a number of attributes not present in the Ensembl GTF, including annotation remarks, APPRIS tags and other tags highlighting transcripts experimentally validated by the GENCODE project or 3-way-consensus pseudogenes (predicted by Havana, Yale and UCSC). See our complete list of tags for more information.

Which are the reference chromosomes?

The reference chromosomes are those in the primary genome assemblies, ie. chromosomes 1 to 22, X and Y in human; chromosomes 1 to 19, X and Y in mouse. The mitochondrial chromosome is also considered as part of the reference chromosomes. Some GENCODE files contain annotation on reference chromosomes only, thus excluding other sequence regions as unlocalized and unplaced scaffolds, assembly patches and alternate loci (haplotypes).

What is the "basic" annotation in the GTF/GFF3?

The transcripts tagged as "basic" form part of a subset of representative transcripts for each gene. This subset prioritises full-length protein coding transcripts over partial or non-protein coding transcripts within the same gene, and intends to highlight those transcripts that will be useful to the majority of users.

What do "HAVANA" and "ENSEMBL" mean in the GTF/GFF3?

The second field in the GTF/GFF3 files shows the annotation source for each feature. The HAVANA team have relocated from the Wellcome Trust Sanger Institute to the EMBL-EBI to join the Ensembl team and as such the terms "HAVANA" and "ENSEMBL" are anachronistic in this context. However, in the files "HAVANA" indicates that the feature was manually annotated, although it may also be the product of the merge between Havana manual annotation and Ensembl-genebuild automated annotation. "ENSEMBL" refers exclusively to annotation provided by the automated Ensembl-genebuild pipeline.

What is the gene/transcript biotype in the GTF/GFF3?

The biotype is an indicator of biological significance of a gene or transcript. There is a large number of possible biotypes in our annotation files but these can be classified into four broad categories: protein-coding, long non-coding RNAs, pseudogenes and small RNAs. See our biotype definitions page for more information.

What is the gene/transcript status in the GTF/GFF3?

The gene_status and transcript_status fields were removed after releases 25 and M11 because they no longer served their original purpose. The KNOWN status indicates that the gene has cross references to curated cDNA and/or protein resources, so it could be used to distinguish well supported annotation. However, the vast majority of GENCODE genes are now supported by RefSeq cDNAs or UniProt proteins. In fact, releases 25 and M11 had over 96% and 99% of KNOWN genes, respectively. There are other fields in the GTF file that can be used to find well-supported annotation at the transcript level, such as the transcript_support_level.

Prior to release 25/M11 the status indicated the type of evidence supporting the annotation.

KNOWN:
Identical to known cDNAs or proteins from the same species and has an entry in species specific model databases: EntrezGene for human, MGI for mouse.
NOVEL:
Identical or homologous to cDNAs from the same species, or proteins from all species.
PUTATIVE:
Identical or homologous to spliced ESTs from the same species.
KNOWN_BY_PROJECTION:
Based on a known orthologue gene in another species.

Why do some gene and transcript ids start with ENSGR or ENSTR in the GTF/GFF3?

The Ensembl ids, by convention, are made of a species index ("ENS" for human and "ENSMUS" for mouse) followed by a feature type indicator ("G" for gene, "T" for transcript, "E" for exon, "P" for translation) and an 11-number figure.

Until release 43, the GENCODE GTF/GFF3 files made an exception to this rule in the case of the pseudoautosomal regions (PAR) of chromosome Y. The gene annotation in these regions is identical between chromosomes X and Y. Ensembl did not provide different feature ids for both chromosomes until release 110 (equivalent to GENCODE release 44). Before that release, the Ensembl GTF file only included this annotation for chromosome X. However, we decided that the GENCODE GTF/GFF3 files would include the annotation in the PAR regions of both chromosomes.

Since the GTF convention dictates that feature ids have to be unique for different genome regions, we modified the Ensembl feature id by replacing the first zero with an "R". Thus, "ENSG00000182378.10" in chromosome X became "ENSGR0000182378.10" in chromosome Y. This modification was applied until release 24.

Between releases 25 and 43, the PAR genes and transcripts had the "_PAR_Y" suffix appended to their identifiers.

From release 44 onwards, the chromosome Y PAR annotation has their own identifiers.

This annotation is also labeled using the tag "PAR".

What does level 1, 2 or 3 mean in the GTF/GFF3?

We supply genome-wide features on three different confidence levels:

Level 1 - validated
Pseudogene loci that were jointly predicted by the Yale Pseudopipe and UCSC Retrofinder pipelines as well as by Havana manual annotation; other transcripts that were verified experimentally by RT-PCR and sequencing through the GENCODE experimental pipeline.
Level 2 - manual annotation
Havana manual annotation (and Ensembl annotation where it is identical to Havana).
Level 3 - automated annotation
Ensembl loci where they are different from the Havana annotation or where no Havana annotation can be found.

Please note that not all transcripts have been tested by the GENCODE experimental pipeline and that level 2/3 transcripts may have been experimentally validated elsewhere.

What does the transcript support level mean in the GTF/GFF3?

The transcript support level indicates how well supported a transcript model is, based on mRNA and EST alignments supplied by UCSC and Ensembl. See the Ensembl glossary for more information. Please note that this transcript support level classification is completely independent from the three-confidence-level classification described above.

What are the OTT gene/transcript ids in the GTF/GFF3?

The 'havana_gene' and 'havana_transcript' attributes indicate the internal gene and transcript stable ids used by Havana and are also the main identifiers in the (now archived) Vega genome browser.They start with 'OTTHUM' and 'OTTMUS' for human and mouse respectively.

What is the gene name in the GTF/GFF3?

Gene names are usually HGNC or MGI-approved gene symbols mapped to the GENCODE genes by the Ensembl xref pipeline. Sometimes, when there is no official gene symbol, the Havana clone-based name is used.