Identification of Genes in a Genomic DNA Sequence

	About this book
	4. Principles and Methods of Sequence Analysis
	4.1. Identification of Genes in a Genomic DNA Sequence
	4.2. Principles of Sequence Similarity Searches
	4.3. Algorithms for Sequence Alignment and Similarity Search
	4.4. Practical Issues: How to Get the Most Out of BLAST
	4.5. The Road to Discovery
	4.6. Protein Annotation in the Absence of Detectable Homologs
	4.7. Conclusions and Outlook
	4.8. Further Reading

4.1. Identification of Genes in a Genomic DNA Sequence

4.1.1. Prediction of protein-coding genes

Archaeal and bacterial genes typically comprise uninterrupted stretches of DNA between a start codon (usually ATG, but in a minority of genes, GTG, TTG, or CTG) and a stop codon (TAA, TGA, or TAG; alternative genetic codes of certain bacteria, such as mycoplasmas, have only two stop codons). Rare exceptions to this rule involve important but rare mechanisms, such as programmed frameshifts. There seem to be no strict limits on the length of the genes. Indeed, the gene rpmJ encoding the ribosomal protein L36 (Figure 2.1) is only 111 bp long in most bacteria, whereas the gene for B. subtilis polyketide synthase PksK is 13,343 bp long. In practice, mRNAs shorter than 30 codons are poorly translated, so protein-coding genes in prokaryotes are usually at least 100 bases in length. In prokaryotic genome-sequencing projects, open reading frames (ORFs) shorter than 100 bases are rarely taken into consideration, which does not seem to result in substantial underprediction. In contrast, in multicellular eukaryotes, most genes are interrupted by introns. The mean length of an exon is ~50 codons, but some exons are much shorter; many of the introns are extremely long, resulting in genes occupying up to several megabases of genomic DNA. This makes prediction of eukaryotic genes a far more complex (and still unsolved) problem than prediction of prokaryotic genes.

4.1.1.1. Prokaryotes

For most common purposes, a prokaryotic gene can be defined simply as the longest ORF for a given region of DNA. Translation of a DNA sequence in all six reading frames is a straightforward task, which can be performed on line using, for example, the Translate tool on the ExPASy server (http://www.expasy.org/tools/dna.html) or the ORF Finder at NCBI (http://www.ncbi.nlm.nih.gov/gorf/gorf.html).

Of course, this approach is oversimplified and may result in a certain number of incorrect gene predictions, although the error rate is rather low. Firstly, DNA sequencing errors may result in incorrectly assigned or missed start and/or stop codons, because of which a gene might be truncated, overextended, or missed altogether. Secondly, on rare occasions, among two overlapping ORFs (on the same or the opposite DNA strand), the shorter one might be the real gene. The existence of a long “shadow” ORF opposite a protein-coding sequence is more likely than in a random sequence because of the statistical properties of the coding regions. Indeed, consider the simple case where the first base in a codon is a purine and the third base is a pyrimidine (the RNY codon pattern). Obviously, the mirror frame in the complementary strand would follow the same pattern, resulting in a deficit of stop codons [235]. Figure 4.1 shows the ORFs of at least 100 bp located in a 10-kb fragment of the E. coli genome (from 3435250 to 3445250) that encodes potassium transport protein TrkA, mechanosensitive channel MscL, transcriptional regulator YhdM, RNA polymerase alpha subunit RpoA, preprotein translocase subunit SecY, and ribosomal proteins RplQ (L17), RpsD (S4), RpsK (S11), RpsM (S13), RpmJ (L36), RplO (L15), RpmD (L30), RpsE (S5), RplR (L18), RplF (L6), RpsH (S8), RpsN (S14), RplE (L5), and RplX (L24). Although the two ORFs in frame +1 (top line, on the right) are longer (207 aa and 185 aa) than the ORFs in frame −3 (bottom line, 117 aa, 177 aa, 130 aa, and 101 aa), it is the latter that encode real proteins, namely the ribosomal proteins RplR, RplF, RpsH, and RpsN.

Because of these complications, it is always desirable to have some additional evidence that a particular ORF actually encodes a protein. Such evidence comes along many different lines and can be obtained using various methods, e.g. the following ones:

The ORF in question encodes a protein that is similar to previously described ones (search the protein database for homologs of the given sequence).
The ORF has a typical GC content, codon frequency, or oligonucleotide composition (calculate the codon bias and/or other statistical features of the sequence, compare to those for known protein-coding genes from the same organism).
The ORF is preceded by a typical ribosome-binding site (search for a Shine-Dalgarno sequence in front of the predicted coding sequence).
The ORF is preceded by a typical promoter (if consensus promoter sequences for the given organism are known, check for the presence of a similar upstream region).

The most reliable of these approaches is a database search for homologs. In several useful tools, DNA translation is seamlessly bound to the database searches. In the ORF finder, for example, the user can submit the translated sequence for a BLASTP or TBLASTN (see 4.4) search against the NCBI sequence databases. In addition, there is an opportunity to compare the translated sequence to the COG database (see 3.4). A largely similar Analysis and Annotation Tool (http://genome.cs.mtu.edu/aat.html), developed by Xiaoqiu Huang at Michigan Tech [361], also compares the translated protein sequences to nr and SWISS-PROT; in addition, it checks them against two cDNA databases, the dbEST at the NCBI and Human Gene Index at TIGR.

Other methods take advantage of the statistical properties of the coding sequences. For organisms with highly biased GC content, for example, the third position in each codon has a highly biased (very high or very low) frequency of G and C. FramePlot, a program that exploits this skew for gene recognition [380], is available at the Japanese Institute of Infectious Diseases (http://www.nih.go.jp/~jun/cgi-bin/frameplot.pl) and at the TIGR web site (http://tigrblast.tigr.org/cmr-blast/GC_Skew.cgi). The most useful and popular gene prediction programs, such as GeneMark and Glimmer (see 3.1.2), build Markov models of the known coding regions for the given organism and then employ them to estimate the coding potential of uncharacterized ORFs.

Inferring genes based on the coding potential and on the similarity of the encoded protein sequences to those of other proteins represent the intrinsic and extrinsic approaches to gene prediction [110], which ideally should be combined. Two programs that implement such a combination, developed specifically for analysis of prokaryotic genomes, are ORPHEUS (http://pedant.gsf.de/orpheus [249]) and CRITICA ([67], source code at http://www.math.uwaterloo.ca/~jhbadger/). Several other algorithms that incorporate both these approaches are aimed primarily at eukaryotic genomes and are discussed further in this section.

4.1.1.2. Unicellular eukaryotes

Genomes of unicellular eukaryotes are extremely diverse in size, the proportion of the genome that is occupied by protein-encoding genes and the frequency of introns. Clearly, the smaller the intergenic regions and the fewer introns are there, the easier it is to identify genes. Fortunately, genomes of at least some simple eukaryotes are quite compact and contain very few introns. Thus, in yeast S. cerevisiae, at least 67% of the genome is protein-coding, and only 233 genes (less than 4% of the total) appear to have introns [660]. Although these include some biologically important and extensively studied genes, e.g. those for aminopeptidase APE2, ubiquitin-protein ligase UBC8, subunit 1 of the mitochondrial cytochrome oxidase COX1, and many ribosomal proteins, introns comprise less than 1% of the yeast genome. The tiny genome of the intracellular eukaryotic parasite Encephalitozoon cuniculi appears to contain introns in only 12 genes and is practically prokaryote-like in terms of the “wall-to-wall” gene arrangement [425]. Malaria parasite Plasmodium falciparum is a more complex case, with ~43% of the genes located on chromosome 2 containing one or more introns [272]. Protists with larger genomes often have fairly high intron density. In the slime mold Physarum polycephalum, for example, the average gene has 3.7 introns [851]. Given that the average exon size in this organism (165 ± 85 bp) is comparable to the length of an average intron (138 ± 103 bp), homology-based prediction of genes becomes increasingly complicated.

Because of this genome diversity, there is no single way to efficiently predict protein-coding genes in different unicellular eukaryotes. For some of them, such as yeast, gene prediction can be done by using more or less the same approaches that are routinely employed in prokaryotic genome analysis. For those with intron-rich genomes, the gene model has to include information on the intron splice sites, which can be gained from a comparison of the genomic sequence against a set of ESTs from the same organism. This necessitates creating a comprehensive library of ESTs that have to be sequenced in a separate project. Such dual EST/genomic sequencing projects are currently under way for several unicellular eukaryotes (see Appendix 2).

4.1.1.3. Multicellular eukaryotes

In most multicellular eukaryotes, gene organization is so complex that gene identification poses a major problem. Indeed, eukaryotic genes are often separated by large intergenic regions, and the genes themselves contain numerous introns, many of them long. Figure 4.2 shows a typical distribution of exons and introns in a human gene, the X chromosome-located gene encoding iduronate 2-sulfatase (IDS_HUMAN), a lysosomal enzyme responsible for removing sulfate groups from heparan sulfate and dermatan sulfate. Mutations causing iduronate sulfatase deficiency result in the lysosomal accumulation of these glycosaminoglycans, clinically known as Hunter's syndrome or type II mucopolysaccharidosis (OMIM entry 309900) [896]. A number of clinical cases have been shown to result from aberrant alternative splicing of this gene's mRNA, which emphasizes the importance of reliable prediction of gene structure [631].

Obviously, the coding regions compose only a minor portion of the gene. In this case, positions of the exons could be unequivocally determined by mapping the cDNA sequence (i.e. iduronate sulfatase mRNA) back to the chromosomal DNA. Because of the clinical phenotype of the mutations in the iduronate sulfatase gene, we already know the “correct” mRNA sequence and can identify various alternatively spliced variants as mutations. However, for many, perhaps the majority of the human genes, multiple alternative forms are part of the regular expression pattern [118,576], and correct gene prediction ideally should identify all of these forms, which immensely complicates the task.

Ideally, gene prediction should identify all exons and introns, including those in the 5′-untranslated region (5′-UTR) and the 3′-UTR of the mRNA, in order to precisely reconstruct the predominant mRNA species. For practical purposes, however, it is useful to assemble at least the coding exons correctly because this allows one to deduce the protein sequence.

Correct identification of the exon boundaries relies on the recognition of the splice sites, which is facilitated by the fact that the great majority of splice sites conform to consensus sequences that include two nearly invariant dinucleotides at the ends of each intron, a GT at the 5′ end and an AG at the 3′ end. Non-canonical splice signals are rare and come in several variants [329,582]. In the 5′ splice sites, the GC dinucleotide is sometimes found instead of GT. The second class of exceptions to the splice site consensus includes so-called “AT-AC” introns that have the highly conserved /(A,G)TATCCT(C,T) sequence at their 5′ sites. There are additional variants of non-canonical splice signals, which further complicate prediction of the gene structure.

The available assessments of the quality of eukaryotic gene prediction achieved by different programs show a rather gloomy picture of numerous errors in exon/intron recognition. Even the best tools correctly predict only ~40% of the genes [697]. The most serious errors come from genes with long introns, which may be predicted as intragenic sequences, resulting in erroneous gene fission, and pairs of genes with short intergenic regions, which may be predicted as introns, resulting in false gene fusion. Nevertheless, most of the popular gene prediction programs discussed in the next section show reasonable performance in predicting the coding regions in the sense that, even if a small exon is missed or overpredicted, the majority of exons are identified correctly.

Another important parameter that can affect ORF prediction is the fraction of sequencing errors in the analyzed sequence. Indeed, including frameshift corrections was found to substantially improve the overall quality of gene prediction [133]. Several algorithms were described that could detect frameshift errors based on the statistical properties of coding sequences [224]. On the other hand, error correction techniques should be used with caution because eukaryotic genomes contain numerous pseudogenes, and non-critical frameshift correction runs the risk of wrongly “rescuing” pseudogenes. The problem of discriminating between pseudogenes and frameshift errors is actually quite complex and will likely be solved only through whole-genome alignments of different species or, in certain cases, by direct experimentation, e.g., expression of the gene(s) in question.

4.1.2. Algorithms and software tools for gene identification

As discussed in the previous section, recognizing genes in the DNA sequences remains one of the most pressing problems in genome analysis. Several different approaches to gene prediction have been developed, and there are several popular programs that are most commonly used for this task (see Table 4.1). Some of these tools perform gene prediction ab initio, relying only on the statistical parameters in the DNA sequence for gene identification. In contrast, homology-based methods rely primarily on identifying homologous sequences in other genomes and/or in public databases using BLAST or Smith-Waterman algorithms. Many of the commonly used methods combine these two approaches.

The absence of introns and relatively high gene density in most genomes of prokaryotes and some unicellular eukaryotes provides for effective use of sequence similarity searches as the first step in genome annotation. Genes identified by homology can be used as the training set for one of the statistical methods for gene recognition, and the resulting statistical model can then be used for analyzing the remaining parts of the genome. In most eukaryotes, the abundance of introns and long intergenic regions makes it difficult to use homology-based methods as the first step unless, of course, one can rely on synteny between several closely related genomes (e.g. human, mouse, and rat). As a result, gene prediction for genome sequences of multicellular eukaryotes usually starts with ab initio methods, followed by similarity searches with the initial exon assemblies.

A detailed comparison of the algorithms and tools for gene prediction is beyond the scope of this book. We would like to emphasize only that each of these methods has its own advantages and limitations, and none of them is perfect. Therefore, it is advisable to use at least two different programs for gene prediction in a new DNA sequence, especially if it comes from a eukaryote or a poorly characterized prokaryote. A comparison of predictions generated by different programs reveals the cases where a given program performs the best and helps in achieving consistent quality of gene prediction. Such a comparison can be performed, for example, using the TIGR Combiner program (http://www.tigr.org/softlab), which employs a voting scheme to combine predictions of different gene-finding programs, such as GeneMark, GlimmerM, GRAIL, GenScan, and Fgenes.

We describe only several computational tools that are most commonly used for gene prediction in large-scale genome annotation projects.

GeneMark

GeneMark (http://opal.biology.gatech.edu/genemark, mirrored at the EBI web site http://www.ebi.ac.uk/genemark) was developed by Mark Borodovsky and James McIninch in 1993 [108]. GeneMark was the first tool for finding prokaryotic genes that employed a non-homogeneous Markov model to classify DNA regions into protein-coding, non-coding, and non-coding but complementary to coding. It has been shown previously that, by multivariate codon usage analysis, the E. coli genes could be classified into so-called typical, highly typical, and atypical gene sets, with the latter two groups apparently corresponding to highly expressed genes and horizontally transferred genes [562]. Accordingly, more than one Markov model was required to adequately describe different groups of genes in the same genome [109,110].

Like other gene prediction programs (see below), GeneMark relies on organism-specific recognition parameters to partition the DNA sequence into coding and non-coding regions and thus requires a sufficiently large training set of known genes from a given organism for best performance. The program has been repeatedly updated and modified and now exists in separate variants for gene prediction in prokaryotic, eukaryotic, and viral DNA sequences [88,528,768].

Glimmer

Gene Locator and Interpolated Markov Modeler (GLIMMER, http://www.tigr.org/softlab), developed by Steven Salzberg and colleagues at Johns Hopkins University and TIGR, is a system for finding genes in prokaryotic genomes. To identify coding regions and distinguish them from noncoding DNA, Glimmer uses interpolated Markov models, i.e. series of Markov models with the order of the model increasing at each step and the predictive power of each model separately evaluated [735]. Like GeneMark, Glimmer requires a training set, which is usually selected among known genes, genes coding for proteins with strong database hits, and/or simply long ORFs. Glimmer is used as the primary gene finder tool at TIGR, where it has been applied to the annotation of numerous microbial genomes [241,243,610,891].

Recently, Salzberg and coworkers developed GlimmerM, a modified version of Glimmer specifically designed for gene recognition in small eukaryotic genomes, such as the malaria parasite Plasmodium falciparum [736].

Grail

Gene Recognition and Assembly Internet Link (GRAIL, http://compbio.ornl.gov), developed by Ed Uberbacher and coworkers at the Oak Ridge National Laboratory, is a tool that identifies exons, polyA sites, promoters, CpG islands, repetitive elements, and frameshift errors in DNA sequences by comparing them to a database of known human and mouse sequence elements [858]. Exon and repetitive element prediction is also available for Arabidopsis and Drosophila sequences.

Grail has been recently incorporated into the Oak Ridge genome analysis pipeline (http://compbio.ornl.gov/tools/pipeline), which provides a unified web interface to a number of convenient analysis tools. For prokaryotes, it offers gene prediction using Glimmer (see above) and Generation programs, followed by BLASTP searches of predicted ORFs against SWISS-PROT and NR databases and a HMMer search against Pfam. There is also an option of BLASTN search of the submitted DNA sequence against a variety of nucleotide sequence databases.

For human and mouse sequences, the Oak Ridge pipeline offers gene prediction using GrailEXP and GenScan (see below), also followed by BLASTP searches of predicted ORFs against SWISS-PROT and NR databases and a HMMer search against Pfam. Again, the user can perform BLASTN search of the submitted DNA sequence against a variety of nucleotide sequence databases, as well as search for CpG islands, repeat fragments, tRNAs, and BAC-end pairs. As discussed above, the possibility to directly compare gene predictions made by two different programs is a valuable feature, which is available at the Oak Ridge web site.

GenScan

GenScan (http://genes.mit.edu/GENSCAN.html) was developed by Chris Burge and Samuel Karlin at Stanford University and is currently hosted in the Burge laboratory at the MIT Department of Biology. This program uses a complex probabilistic model of the gene structure that is based on actual biological information about the properties of transcriptional, translational, and splicing signals. In addition, it utilizes various statistical properties of coding and noncoding regions. To account for the heterogeneity of the human genome that affects gene structure and gene density, GenScan derives different sets of gene models for genome regions with different GC content [131,132]. Its high speed and accuracy make GenScan the method of choice for the initial analysis of large (in the megabase range) stretches of eukaryotic genomic DNA. GenScan has being used as the principal tool for gene prediction in the International Human Genome Project [488].

GeneBuilder

GeneBuilder (http://www.itba.mi.cnr.it/webgene) performs ab initio gene prediction using numerous parameters, such as GC content, dicodon frequencies, splicing site data, CpG islands, repetitive elements, and others. It also utilizes a unique approach that is based on evaluating relative frequencies of synonymous and nonsynonymous substitutions to identify likely coding sequences. In addition, it performs BLAST searches of predicted genes against protein and EST databases, which helps to refine the boundaries of predicted exons using the BLAST hits as guides. The program allows the user to change certain parameters, which permits interactive gene structure prediction. As a result, GeneBuilder is sometimes able to predict the gene structure with a good accuracy, even when the similarity of the predicted ORF to a homologous protein sequence is low [569,708].

Splice site prediction

Programs for predicting intron splice sites, which are commonly used as subroutines in the gene prediction tools, can also be used as stand-alone programs to verify positions of splice sites or predict alternative splicing sites. Such programs (Table 4.2) can be particularly useful for predicting non-coding exons, which are commonly missed in gene prediction studies.

Recognition of the splice sites by these programs usually relies on statistical properties of exons and introns and on the consensus sequences of splicing signals. A detailed study of the performance of one such program, SpliceView, showed that, although the fraction of missed splicing signals was relatively low (~5%), the false-positive rate was quite high (typically, one potential splicing signal per 150–250 bases). One should note, however, that such false-positive signals might correspond to rare alternative splice forms or cryptic splice sites (splice sites that are not active in normal genes and become activated as a result of mutations in major splicing signals) [710].

Combining various gene prediction tools

While the first step of gene identification in long genomic sequences utilizes ab initio programs that can rapidly and with reasonable accuracy predict multiple genes, the next step validates these predictions through similarity searches. Predicted genes are compared to nucleotide sequence databases, including EST databases, and protein sequences encoded by these predicted genes are compared to protein sequence databases. These data are then combined with the information about repetitive elements, CpG islands, and transcription factor binding sites and used for further refinement of gene structure. Thus, homology information is ultimately incorporated into every gene prediction pipeline (see above). There are, however, several programs that primarily rely on similarity search for gene prediction (Table 4.3). Although differing in details, they all search for the best alignment of the given piece of DNA to the homologous nucleotide or protein sequences in the database.