Methods for

 Computational Gene Prediction
 



DATA SETS





Genomic sequences and annotations

These data sets consist of FASTA files containing entire genomic contigs or chromosomes, and GFF files containing the coordinates of coding exons within those contigs. In order to use this data you will have to extract the training features (i.e., exons, introns, splice sites, etc.) yourself. For pre-extracted data scroll down to the next table below.


(NOTE: files hosted at TIGR may disappear in the near future...)

Data Set

Description



Burge's data sets
human data used by Chris Burge to train GENSCAN



rice100.tar.gz

100 training genes and 100 test genes from rice



Arabidopsis.thaliana.tar.gz
 
GFF coordinates & FASTA file (hosted at TIGR)



Aspergillus.fumigatus.tar.gz

GFF coordinates & FASTA file (hosted at TIGR)



Aspergillus.spp.tar.gz

GFF coordinates & FASTA file (hosted at TIGR)



Homo.sapiens.tar.gz

GFF coordinates & FASTA file (hosted at TIGR)



Mus.musculus.tar.gz

GFF coordinates & FASTA file (hosted at TIGR)



Plasmodium.falciparum.tar.gz

GFF coordinates & FASTA file ( hosted at TIGR)

SEE ALSO: Gene Prediction Data Consortium hosted at bioinformatics.org.

NOTE: the synthetic ("toy genome") data from Chapter 5 is available on another page.


Sequence features

These data sets consist of FASTA files containing pre-extracted training features, such as coding exons, splice sites, etc.  Because they have been pre-extracted, you only get fixed window sizes, etc.  To extract custom features, use the data from the table above.

File

Description



rice-features.tar.gz  
Exons, introns, splice sites, start/stop codons, intergenic regions from rice genome.  All signals have an 80bp margin on either side of the consensus sequence.



aedes-features.tar.gz

The mosquito Aedes aegypti.   Same format as for rice (above).



human-features.tar.gz

Features from nonredundant human genes on chromosome 1 only.  Same format as data sets above (e.g., 80bp margin on either side of signal consensuses). NCBI release 35.



Classification data sets

These data sets contain vectors of numerical features computed from sequences of both coding and non-coding ORFs.  These can be used to train a classifier for distinguishing coding from non-coding sequence. Both training and test data are provided, so that classifier accuracy can be quantified on the hold-out set.

Features were: (1) log probability of the ORF length, (2) weight-matrix score of the 5' signal of the ORF, (3) weight-matrix score of the 3' signal of the ORF, (4) log-likelihood ratio computed from hexamer frequencies in the ORF.  More details are given in the README.pdf file included in each tarball. (See also section 10.17 of the book).


File

Description



human-classify.tar.gz  
2200 Homo sapiens ORFs.



mouse-classify.tar.gz

2200 Mus musculus ORFs.



plasmodium-classify.tar.gz

872 Plasmodium falciparum ORFs.



toxoplasma-classify.tar.gz

2282 Toxoplasma gondii ORFs



aspergillus-classify.tar.gz

2199 Aspergillus fumigatus ORFs



arabidopsis-classify.tar.gz

2200 Arabidopsis thaliana ORFs



all-classify.tar.gz

all of the above in a single file






See also: