GeneMark

GeneMark is a generic name for a family of ab initio gene prediction programs developed at the Georgia Institute of Technology in Atlanta. Developed in 1993, original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of Haemophilus influenzae, and in 1996 for the first archaeal genome of Methanococcus jannaschii. The algorithm introduced inhomogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. Species specific parameters of the models were estimated from training sets of sequences of known type (protein-coding and non-coding). The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being "protein-coding" (carrying genetic code) in each of six possible reading frames (including three frames in complementary DNA strand) or being "non-coding". Original GeneMark (developed before the HMM era in Bioinformatics) is an HMM-like algorithm; it can be viewed as approximation to known in the HMM theory posterior decoding algorithm for appropriately defined HMM.

GeneMark
Original author(s)Bioinformatics group of Mark Borodovsky
Developer(s)Georgia Institute of Technology
Initial release1993
Operating systemLinux, Windows, and Mac OS
LicenseFree binary-only for academic, non-profit or U.S. Government use
Websiteopal.biology.gatech.edu/GeneMark

Prokaryotic gene prediction

The GeneMark.hmm algorithm (1998) was designed to improve gene prediction accuracy in finding short genes and gene starts. The idea was to integrate the Markov chain models used in GeneMark into a hidden Markov model framework, with transition between coding and non-coding regions formally interpreted as transitions between hidden states. Additionally, the ribosome binding site model was used to improve accuracy of gene start prediction. Next step was done with development of the self-training gene prediction tool GeneMarkS (2001). GeneMarkS has been in active use by genomics community for gene identification in new prokaryotic genomic sequences. GeneMarkS+, extension of GeneMarkS integrating information on homologous proteins into gene prediction is used in the NCBI pipeline for prokaryotic genomes annotation; the pipeline can annotate up to 2000 genomes daily (www.ncbi.nlm.nih.gov/genome/annotation_prok/process).

Heuristic Models and Gene Prediction in Metagenomes and Metatransciptomes

Accurate identification of species specific parameters of the GeneMark and GeneMark.hmm algorithms was the key condition for making accurate gene predictions. However, the question was raised, motivated by studies of viral genomes, how to define parameters for gene prediction in a rather short sequence that has no large genomic context. In 1999 this question was addressed by development of a "heuristic method" computations of the parameters as functions of the sequence G+C content. Since 2004 models built by the heuristic approach have been used in finding genes in metagenomic sequences. Subsequently, analysis of several hundred prokaryotic genomes led to developing more advanced heuristic method (implemented in MetaGeneMark) in 2010.

Eukaryotic gene prediction

In eukaryotic genomes modeling of exon borders with introns and intergenic regions presents a major challenge addressed by use of HMMs. The HMM architecture of eukaryotic GeneMark.hmm includes hidden states for initial, internal, and terminal exons, introns, intergenic regions and single exon genes located in both DNA strands. Initial eukaryotic GeneMark.hmm needed training sets for estimation of the algorithm parameters. In 2005 first version of self-training algorithm GeneMark-ES was developed. In 2008 the GeneMark-ES algorithm was extended to fungal genomes by developing a special intron model and more complex strategy of self-training. Then, in 2014, GeneMark-ET the algorithm that augmented self-training by information from mapped to genome unassembled RNA-Seq reads was added to the family. Gene prediction in eukaryotic transcripts can be done by the new algorithm GeneMarkS-T (2015)

GeneMark Family of Gene Prediction Programs

Bacteria, Archaea

  • GeneMark
  • GeneMarkS
  • GeneMarkS+

Metagenomes and Metatranscriptomes

  • MetaGeneMark

Eukaryotes

  • GeneMark
  • GeneMark.hmm [1]
  • GeneMark-ES: gene finding algorithm for eukaryotic genomes that performs automatic training in unsupervised ab initio mode.[2]
  • GeneMark-ET: augments GeneMark-ES with a novel method that integrates RNA-Seq read alignments into the self-training procedure.[3]
  • GeneMark-EX: a fully automatic integrated tool for genome annotation that shows robust performance across the input data of various size, structure and quality. The algorithm selects the approach to parameter estimation depending on the volume, quality and features of the input data, size of RNA-seq dataset, phylogenetic position of the species, degree of assembly fragmentation. It is able to automatically modify the HMM architecture to fit the features of the genome in question and to integrate transcript and protein information into the process of gene prediction.[4]

Viruses, phages and plasmids

  • Heuristic models

Transcripts assembled from RNA-Seq read

  • GeneMarkS-T

See also

References

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.