Format description of GENCODE GTF
A. TAB-separated standard GTF columns
column-number | content | values/format |
---|---|---|
1 | chromosome name | chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M} or GRC accession a |
2 | annotation source | {ENSEMBL,HAVANA} |
3 | feature type | {gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine} |
4 | genomic start location | integer-value (1-based) |
5 | genomic end location | integer-value |
6 | score(not used) | . |
7 | genomic strand | {+,-} |
8 | genomic phase (for CDS features) | {0,1,2,.} |
9 | additional information as key-value pairs | see below |
a Scaffolds, patches and haplotypes names correspond to their GRC accessions. Please note that these are different from the Ensembl names.
B. Key-value pairs in 9th column (format: key "value"; )
B.1. Mandatory fields
key name | feature type(s) | value format | release |
---|---|---|---|
gene_id | all | ENSGXXXXXXXXXXX.X b,c _Xg | all |
transcript_id d | all except gene | ENSTXXXXXXXXXXX.X b,c _Xg | all |
gene_type | all | list of biotypes | all |
gene_status e | all | {KNOWN, NOVEL, PUTATIVE} | until 25 and M11 |
gene_name | all | string | all |
transcript_type d | all except gene | list of biotypes | all |
transcript_statusd,e | all except gene | {KNOWN, NOVEL, PUTATIVE} | until 25 and M11 |
transcript_name d | all except gene | string | all |
exon_number f | all except gene/transcript/Selenocysteine | integer (exon position in the transcript from its 5' end) | all |
exon_id f | all except gene/transcript/Selenocysteine | ENSEXXXXXXXXXXX.X b _Xg | all |
level | all |
1 (verified loci), 2 (manually annotated loci), 3 (automatically annotated loci) |
all |
b From version 7 the gene/transcript version number was appended to gene and transcript ids (eg. ENSG00000160087.16).
c Gene and trancript ids on the chrY PAR regions have "_PAR_Y" appended (from release 25), or are in the format ENSGRXXXXXXXXXX and ENSTRXXXXXXXXXX (until release 24) to avoid redundancy.
d Until releases 21 and M4, the gene lines included transcript attributes.
e The 'gene_status' and 'transcript_status' attributes were removed after releases 25 (human) and M11 (mouse).
f Except in gene and transcript lines.
g In the annotation mapped back to GRCh37, mapping versions are appended to the identifiers (eg. ENSG00000228327.3_2).
B.2. Optional fields
key name | value format |
---|---|
tag | part of a special set [*]: list of tags |
ccdsid | official CCDS id [*]; CCDS* |
havana_gene | gene id in the havana db [0,1]; OTTHUMGXXXXXXXXXXX.X |
havana_transcript | transcript id in the havana db [0,1] ; OTTHUMTXXXXXXXXXXX.X |
protein_id | ENSPXXXXXXXXXXX.X [0,1] |
ont | pseudogene (or other) ontology ids [*]; {PGO:0000004 and others} |
transcript_support_level | {1,2,3,4,5,NA} [0,1] transcripts are scored according to how well mRNA and EST alignments match over its full length: 1 (all splice junctions of the transcript are supported by at least one non-suspect mRNA), 2 (the best supporting mRNA is flagged as suspect or the support is from multiple ESTs), 3 (the only support is from a single EST), 4 (the best supporting EST is flagged as suspect), 5 (no single transcript supports the model structure), NA (the transcript was not analyzed) |
remap_status remap_original_id remap_original_location remap_num_mappings remap_target_status remap_substituted_missing_target |
Mapping attributes [0,1] - only for GRCh38 annotation lifted back to GRCh37. |
hgnc_id | HGNC id in human [0,1]; HGNC:* |
mgi_id | MGI id in mouse [0,1]; MGI:* |
Number of occurrences: [*] - zero or multiple, [0,1] - zero or one
Example GTF lines:
chr19 HAVANA gene 405438 409170 . - . gene_id "ENSG00000183186.7"; gene_type "protein_coding"; gene_name "C2CD4C"; level 2; havana_gene "OTTHUMG00000180534.3"; chr19 HAVANA transcript 405438 409170 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3"; chr19 HAVANA exon 409006 409170 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 1; exon_id "ENSE00001322986.5"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3"; chr19 HAVANA exon 405438 408401 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3"; chr19 HAVANA CDS 407099 408361 . - 0 gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3"; chr19 HAVANA start_codon 408359 408361 . - 0 gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3"; chr19 HAVANA stop_codon 407096 407098 . - 0 gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3"; chr19 HAVANA UTR 409006 409170 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 1; exon_id "ENSE00001322986.5"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3"; chr19 HAVANA UTR 405438 407098 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3"; chr19 HAVANA UTR 408362 408401 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
Examples for fetching specific parts from the file [Unix command line]:
- Get all "gene" lines:
awk '{if($3=="gene"){print $0}}' gencode.gtf
- Get all "protein-coding transcript" lines:
awk '{if($3=="transcript" && $20=="\"protein_coding\";"){print $0}}' gencode.gtf
- Get level 1 & 2 annotation (manually annotated) only:
awk '{if($0~"level (1|2);"){print $0}}' gencode.gtf
Example for parsing the file [Perl]
#!/usr/bin/perl use strict; my $gencode_file = "gencode.v23.annotation.gtf"; open(IN, "<$gencode_file") or die "Can't open $gencode_file.\n"; my %all_genes; while(<IN>){ next if(/^##/); #ignore header chomp; my %attribs = (); my ($chr, $source, $type, $start, $end, $score, $strand, $phase, $attributes) = split("\t"); #store nine columns in hash my %fields = ( chr => $chr, source => $source, type => $type, start => $start, end => $end, score => $score, strand => $strand, phase => $phase, attributes => $attributes, ); my @add_attributes = split(";", $attributes); # store ids and additional information in second hash foreach my $attr ( @add_attributes ) { next unless $attr =~ /^\s*(.+)\s(.+)$/; my $c_type = $1; my $c_value = $2; $c_value =~ s/\"//g; if($c_type && $c_value){ if(!exists($attribs{$c_type})){ $attribs{$c_type} = []; } push(@{ $attribs{$c_type} }, $c_value); } } #work with the information from the two hashes... #eg. store them in a hash of arrays by gene_id: if(!exists($all_genes{$attribs{'gene_id'}->[0]})){ $all_genes{$attribs{'gene_id'}->[0]} = []; } push(@{ $all_genes{$attribs{'gene_id'}->[0]} }, \%fields); } print "Example entry ENSG00000183186.7: ". $all_genes{"ENSG00000183186.7"}->[0]->{"type"}.", ". $all_genes{"ENSG00000183186.7"}->[0]->{"chr"}." ". $all_genes{"ENSG00000183186.7"}->[0]->{"start"}."-". $all_genes{"ENSG00000183186.7"}->[0]->{"end"}."\n";
GTF Parsers in Other Programming Language
A number of programming languages already have GTF parsers developed by third party libraries. We have listed a number of these below and should be used in preference to writing your own parser.
For further questions, please contact our helpdesk.