Tag description in GENCODE

The following tags can be found in the GENCODE GTF/GFF3 files. Read more about the GTF file format

3_nested_supported_extension: 3' end extended based on RNA-seq data.
3_standard_supported_extension: 3' end extended based on RNA-seq data.
454_RNA_Seq_supported: annotated based on RNA-seq data.
5_nested_supported_extension: 5' end extended based on RNA-seq data.
5_standard_supported_extension: 5' end extended based on RNA-seq data.
alternative_3_UTR: shares an identical CDS but has alternative 5' UTR with respect to a reference variant.
alternative_5_UTR: shares an identical CDS but has alternative 3' UTR with respect to a reference variant.
appris_principal_1: (This flag corresponds to the older flag "appris_principal") Where the transcript expected to code for the main functional isoform based solely on the core modules in the APPRIS database. The APPRIS core modules map protein structural and functional information and cross-species conservation to the annotated variants.
appris_principal_2: (This flag corresponds to the older flag "appris_candidate_ccds") Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. If one (but no more than one) of these candidates has a distinct CCDS identifier it is selected as the principal variant for that gene. A CCDS identifier shows that there is consensus between RefSeq and GENCODE/Ensembl for that variant, guaranteeing that the variant has cDNA support.
appris_principal_3: Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. Consensus CDS annotated earlier are likely to have more cDNA evidence. Consecutive CCDS identifiers are not included in this flag, since they will have been annotated in the same release of CCDS. These are distinguished with the next flag.
appris_principal_4: (This flag corresponds to the Ensembl 78 flag "appris_candidate_longest_ccds") Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant.
appris_principal_5: (This flag corresponds to the Ensembl 78 flag "appris_candidate_longest_seq") Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant.
appris_alternative_1: Candidate transcript(s) models that are conserved in at least three tested non-primate species.
appris_alternative_2: Candidate transcript(s) models that appear to be conserved in fewer than three tested non-primate species.
appris_principal: transcript expected to code for the main functional isoform based on a range of protein features (APPRIS pipeline).
appris_candidate: where there is no single 'appris_principal' variant the main functional isoform will be translated from one of the 'appris_candidate' genes.
appris_candidate_ccds: the "appris_candidate" transcript that has an unique CCDS.
appris_candidate_highest_score: where there is no 'appris_principal' variant, the candidate with highest APPRIS score is selected as the primary variant.
appris_candidate_longest: where there is no 'appris_principal' variant, the longest of the 'appris_candidate' variants is selected as the primary variant.
appris_candidate_longest_ccds: the "appris_candidate" transcripts where there are several CCDS, in this case APPRIS labels the longest CCDS.
appris_candidate_longest_seq: where there is no "appris_candidate_ccds" or "appris_candidate_longest_ccds" variant, the longest protein of the "appris_candidate" variants is selected as the primary variant.
artifactual_duplication: annotated on an artifactual duplicate region of the genome assembly.
basic: identifies a subset of representative transcripts for each gene; prioritises full-length protein coding transcripts over partial or non-protein coding transcripts within the same gene, and intends to highlight those transcripts that will be useful to the majority of users.
bicistronic: transcript contains two confidently annotated CDSs. Support may come from eg proteomic data, cross-species conservation or published experimental work.
CAGE_supported_TSS: transcript 5' end overlaps ENCODE or Fantom CAGE cluster.
CCDS: member of the consensus CDS gene set, confirming coding regions between ENSEMBL, UCSC, NCBI and HAVANA.
cds_end_NF: the coding region end could not be confirmed.
cds_start_NF: the coding region start could not be confirmed.
dotter_confirmed: transcript QC checked using dotplot to identify features eg splice junctions, end of homology.
downstream_ATG: an upstream ATG is used where a downstream ATG seems more evolutionary conserved.
Ensembl_canonical: most representative transcript of the gene. This will be the MANE_Select transcript if there is one, or a transcript chosen by an Ensembl algorithm otherwise.
exp_conf: transcript was tested and confirmed experimentally.
fragmented_locus: locus consists of non-overlapping transcript fragments either because of genome assembly issues (i.e., gaps or mis-assemblies), or because supporting transcripts (e.g., from another species) cannot be completely mapped, or because the supporting transcripts are non-overlapping end pairs (i.e., 5' and 3' ESTs from a single cDNA).
GENCODE_Primary: belongs to a minimal set that contains MANE Select, MANE Plus Clinical and Ensembl Canonical transcripts and transcripts containing any conserved exons and common alternative splicing events (including exons skips) that are absent from the MANE and Ensembl Canonical transcripts for protein-coding genes. Other biotypes will have the GENCODE_Primary flag added to the Ensembl Canonical transcript and, for lncRNA genes only, this will be the transcript with the longest genomic span.
inferred_exon_combination: transcript model contains all possible in-frame exons supported by homology, experimental evidence or conservation, but the exon combination is not directly supported by a single piece of evidence and may not be biological. Used for large genes with repetitive exons (e.g. titin (TTN)) to represent all the exons individual transcript variants can pool from.
inferred_transcript_model: transcript model is not supported by a single piece of transcript evidence. May be supported by multiple fragments of transcript evidence or by combining different evidence sources e.g. protein homology, RNA-seq data, published experimental data.
low_sequence_quality: transcript supported by transcript evidence that, while ampping best-in-genome, shows regions of poor sequence quality.
mRNA_end_NF: the mRNA end could not be confirmed.
mRNA_start_NF: the mRNA start could not be confirmed.
MANE_Select: the transcript belongs to the MANE Select data set. The Matched Annotation from NCBI and EMBL-EBI project (MANE) is a collaboration between Ensembl-GENCODE and RefSeq to select a default transcript per human protein coding locus that is representative of biology, well-supported, expressed and conserved. This transcript set matches GRCh38 and is 100% identical between RefSeq and Ensembl-GENCODE for 5' UTR, CDS, splicing and 3' UTR.
MANE_Plus_Clinical: the transcript belongs to the MANE Plus Clinical data set. Within the MANE project, these are additional transcripts per locus necessary to support clinical variant reporting, for example transcripts containing known pathogenic or likely pathogenic clinical variants not reportable using the MANE Select data set. This transcript set matches GRCh38 and is 100% identical between RefSeq and Ensembl-GENCODE for 5' UTR, CDS, splicing and 3' UTR.
NAGNAG_splice_site: in-frame type of variation where, at the acceptor site, some variants splice after the first AG and others after the second AG.
ncRNA_host: the locus is a host for small non-coding RNAs.
nested_454_RNA_Seq_supported: annotated based on RNA-seq data.
NMD_exception: the transcript looks like it is subject to NMD but publications, experiments or conservation support the translation of the CDS.
NMD_likely_if_extended: codon if the transcript were longer but cannot currently be annotated as NMD as does not fulfil all criteria - most commonly lack of an intron downstream of the stop codon.
non_ATG_start: the CDS has a non-ATG start and its validity is supported by publication or conservation.
non_canonical_conserved: the transcript has a non-canonical splice site conserved in other species.
non_canonical_genome_sequence_error: the transcript has a non-canonical splice site explained by a genomic sequencing error.
non_canonical_other: the transcript has a non-canonical splice site explained by other reasons.
non_canonical_polymorphism: the transcript has a non-canonical splice site explained by a SNP.
non_canonical_TEC: the transcript has a non-canonical splice site that needs experimental confirmation.
non_canonical_U12: the transcript has a non-canonical splice site explained by a U12 intron (i.e. AT-AC splice site).
non_submitted_evidence: a splice variant for which supporting evidence has not been submitted to databases, i.e. the model is based on literature or collaborator evidence.
not_best_in_genome_evidence: a transcript is supported by evidence from same species paralogous loci.
not_organism_supported: evidence from other species was used to build model.
orphan: protein-coding locus with no paralogues or orthologs.
overlapping_locus: exon(s) of the locus overlap exon(s) of a readthrough transcript or a transcript belonging to another locus.
overlapping_uORF: a low confidence upstream ATG existing in other coding variant would lead to NMD in this trancript, that uses the high confidence canonical downstream ATG.
PAR: annotation in the pseudo-autosomal region, which is duplicated between chromosomes X and Y.
pseudo_consens: member of the pseudogene set predicted by YALE, UCSC and HAVANA.
readthrough_gene: protein-coding gene that has a readthrough transcript.
readthrough_transcript: a transcript that overlaps two or more independent loci but is considered to belong to a third, separate locus.
reference_genome_error: locus overlaps a sequence error or an assembly error in the reference genome that affects its annotation (e.g., 1 or 2bp insertion/deletion, substitution causing premature stop codon). The main effect is that affected transcripts that would have had a CDS are currently annotated without one.
retained_intron_CDS: internal intron of CDS portion of transcript is retained.
retained_intron_final: final intron of CDS portion of transcript is retained.
retained_intron_first: first intron of CDS portion of transcript is retained.
retrogene: protein-coding locus created via retrotransposition.
RNA_Seq_supported_only: transcript supported by RNAseq data and not supported by mRNA or EST evidence.
RNA_Seq_supported_partial: transcript annotated based on mixture of RNA-seq data and EST/mRNA/protein evidence.
RP_supported_TIS: transcript that contains a CDS that has a translation initiation site supported by Ribosomal Profiling data.
seleno: contains a selenocysteine.
semi_processed: a processed pseudogene with one or more introns still present. These are likely formed through the retrotransposition of a retained intron transcript.
sequence_error: transcript contains at least 1 non-canonical splice junction that is associated with a known or novel genome sequence error.
stop_codon_readthrough: Transcript whose coding sequence contains an internal stop codon that does not cause the translation termination
TAGENE: Transcript created or extended using assembled RNA-seq long reads.
upstream_ATG: an upstream ATG exists when a downstream ATG is better supported.
upstream_uORF: a low confidence upstream ATG existing in other coding variant would lead to NMD in this trancript, that uses the high confidence canonical downstream ATG.