BLAST(Basic Local Alignment Search Tool) is heuristic?

BLAST(Basic Local Alignment Search Tool) is heuristic?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

How can we say that BLAST is based on a heuristic algorithm, as after finding one common word in the query sequence and a database sequence it performs pairwise alignment by dynamic programming - which is an exhaustive algorithm? Also, BLAST provides quantitative analysis by giving bit scores and E-values. As it gives quantitative results, why do we say it is based on a "heuristic algorithm", such as 'word algorithm'?

You might be confused about what "heuristic" means. "Heuristic" doesn't mean random or arbitrary, instead, an algorithm is termed "heuristic" if it employs a shortcut which means that it does not necessarily yield the theoretically best result.

For BLAST, this shortcut is the assumption of first matching the fixed-length words prior to extending the match. Hypothetically, there could be a query sequence and a database where the best scoring match for the query sequence doesn't contain any k-length sequences in common with the database sequence. In this case, BLAST would be unable to find that match. - So the heuristic portion of the algorithm isn't in the dynamic programming portion, it's in that first step of finding pairs which are to be aligned.

The failure of BLAST to find a match is likely to be rare, though, especially with short word lengths. If you find a decent match, it's highly likely that there is a 3-mer (5-mer, etc.) of identical sequence in the pair. But because it's not a guarantee, the use of k-mer word in the process means that BLAST is a heuristic algorithm.

The ability for BLAST to provide quality and statistical metrics is not limited by its heuristic nature. To build an e-value, you don't need to know that BLAST is able to find the absolute best match. Instead, you just need to be able to build a model of the sort of matches BLAST would find by spurious random chance, when applied to a similar database with no "true" matches. Similarly for bit-score. You don't need to know what the absolute best match would be, you just need to know how BLAST would behave on a hypothetical true-match-free database of similar properties.

Basic Local Alignment Search Tool

A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust it can be implemented in a number of ways and applied in a variety of contexts including straight-forward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.

1. Introduction

The discovery of sequence homology to a known protein or family of proteins often provides the first clues about the function of a newly sequenced gene. As the DNA and amino acid sequence databases continue to grow in size they become increasingly useful in the analysis of newly sequenced genes and proteins because of the greater chance of finding such homologies. There are a number of software tools for searching sequence databases but all use some measure of similarity between sequences to distinguish biologically significant relationships from chance similarities. Perhaps the best studied measures are those used in conjunction with variations of the dynamic programming algorithm (Needleman & Wunsch, 1970 Sellers, 1974 Sankoff & Kruskal, 1983 Waterman, 1984). These methods assign scores to insertions, deletions and replacements, and compute an alignment of two sequences that corresponds to the least costly set of such mutations. Such an alignment may be thought of as minimizing the evolutionary distance or maximizing the similarity between the two sequences compared. In either case, the cost of this alignment is a measure of similarity the algorithm guarantees it is optimal, based on the given scores. Because of their computational requirements, dynamic programming algorithms are impractical for searching large databases without the use of a supercomputer (Gotoh & Tagashira, 1986) or other special purpose hardware (Coulson et al. , 1987).

Rapid heuristic algorithms that attempt to approximate the above methods have been developed (Waterman, 1984), allowing large databases to be searched on commonly available computers. In many heuristic methods the measure of similarity is not explicitly defined as a minimal cost set of mutations, but instead is implicit in the algorithm itself. For example, the FASTP program (Lipman & Pearson, 1985 Pearson & Lipman, 1988) first finds locally similar regions between two sequences based on identities but not gaps, and then rescores these regions using a measure of similarity between residues, such as a PAM matrix (Dayhoff et al. , 1978) which allows conservative replacements as well as identities to increment the similarity score. Despite their rather indirect approximation of minimal evolution measures, heuristic tools such as FASTP have been quite popular and have identified many distant but biologically significant relationships.

In this paper we describe a new method, BLAST (Basic Local Alignment Search Tool), which employs a measure based on well-defined mutation scores. It directly approximates the results that would be obtained by a dynamic programming algorithm for optimizing this measure. The method will detect weak but biologically significant sequence similarities, and is more than an order of magnitude faster than existing heuristic algorithms.

2. Methods

(a) The maximal segment pair measure

Sequence similarity measures generally can be classified as either global or local. Global similarity algorithms optimize the overall alignment of two sequences, which may include large stretches of low similarity (Needleman & Wunsch, 1970). Local similarity algorithms seek only relatively conserved subsequences, and a single comparison may yield several distinct subsequence alignments unconserved regions do not contribute to the measure of similarity (Smith & Waterman, 1981 goad & Kanehisa, 1982 Sellers, 1984). Local similarity measures are generally preferred for database searches, where cDNAs may be compared with partially sequenced genes, and where distantly related proteins may share only isolated regions of similarity, e.g. in the vicinity of an active site.

Many similarity measures, including the one we employ, begin with a matrix of similarity scores for all possible pairs of residues. Identities and conservative replacements have positive scores, while unlikely replacements have negative scores. For amino acid sequence comparisons we generally use the PAM-120 matrix (a variation of that of Dayhoff et al. , 1978), while for DNA sequence comparisons we score identities +5, and mismatches -4 other scores are of course possible. A sequence segment is a contiguous stretch of residues of any length, and the similarity score for two aligned segments of the same length is the sum of the similarity values for each pair of aligned residues.

Given these rules, we define a maximal segment pair (MSP) to be the highest scoring pair of identical length segments chosen from 2 sequences. The boundaries of an MSP are chosen to maximize its score, so an MSP may be of any length. The MSP score, which BLAST heuristically attempts to calculate, provides a measure of local similarity for any pair of sequences. A molecular biologist, however, may be interested in all conserved regions shared by 2 proteins, not only in their highest scoring pair. We therefore define a segment pair to be locally maximal if its score cannot be improved either by extending or by shortening both segments (Sellers, 1984). BLAST can seek all locally maximal segment pairs with scores above some cutoff.

Like many other similarity measures, the MSP score for 2 sequences may be computed in time proportional to the product of their lengths using a simple dynamic programming algorithm. An important advantage of the MSP measure is that recent mathematical results allow the statistical significance of MSP scores to be estimated under an appropriate random sequence model (Karlin & Altschul, 1990 Karlin et al. , 1990). Furthermore, for any particular scoring matrix ( e.g. PAM-120) one can estimate the frequencies of paired residues in maximal segments. This tractability to mathematical analysis is a crucial feature of the BLAST algorithm.

(b) Rapid approximation of MSP scores

In searching a database of thousands of sequences, generally only a handful, if any, will be homologous to the query sequence. The scientist is therefore interested in identifying only those sequence entries with MSP scores over some cutoff score S . These sequences include those sharing highly significant similarity with the query as well as some sequences with borderline scores. This latter set of sequences may include high scoring random matches as well as sequences distantly related to the query. the biological significance of the high scoring sequences may be inferred almost solely on the basis of the similarity score, while the biological context of the borderline sequences may be helpful in distinguishing biologically interesting relationships.

Recent results (Karlin & Altschul, 1990 Karlin et al. , 1990) allow us to estimate the highest MSP score S at which chance similarities are likely to appear. To accelerate database searches, BLAST minimizes the time spent on sequence regions whose similarity with the query has little chance of exceeding this score. Let a word pair be a segment pair of fixed length w . The main strategy of BLAST is to seek only segment pairs that contain a word pair with a score of at least T . Scanning through a sequence, one can determine quickly whether it contains a word of length w that can pair with the query sequence to produce a word pair with a score greater than or equal to the threshold T . Any such hit is extended to determine if it is contained within a segment pair whose score is greater than or equal to S . The lower the threshold T , the greater the chance that a segment pair with a score of at least S will contain a word pair with a score of at least T . A small value for T , however, increases the number of hits and therefore the execution time of the algorithm. Random simulation permits us to select a threshold T that balances these considerations.

In our implementation of this approach, details of the 3 algorithmic steps (namely compiling a list of high-scoring words, scanning the database for hits, and extending hits) vary somewhat depending on whether the database contains proteins or DNA sequences. For proteins, the list consists of all words ( w -mers) that score at least T when compared to some word in the query sequence. Thus, a query word may be represented by no words in the list ( e.g. for common w -mers using PAM-120 scores) or by many. (One may, of course, insist that every w -mer in the query sequence be included in the word list, irrespective of whether pairing the word with itself yields a score of a least T .) For values of w and T that we have found most useful (see below), there are typically of the order of 50 words in the list for every residue in the query sequence, e.g. 12,500 words for a sequence of length 250. If a little care is taken in programming, the list of words can be generated in time essentially proportional to the length of the list.

The scanning phase raised a classic algorithmic problem, i.e. search a long sequence for all occurrences of certain short sequences. We investigated 2 approaches. Simplified, the first works as follows. suppose that w = 4 and map each word to an integer between 1 and 20 4 , so a word can be used as an index into an array of size 20 4 = 160,000. Let the i th entry of such an array point to the list of all occurrences in the query sequence of the i th word. Thus, as we scan the database, each database word leads us immediately to the corresponding hits. Typically, only a few thousand of the 20 4 possible words will be in this table, and it is easy to modify the approach to use far fewer than 20 4 pointers.

The second approach we explored for the scanning phase was the use of a deterministic finite automaton or finite state machine (Mealy, 1955 Hopcroft & Ullman, 1979). An important feature of our construction was to signal acceptance on transitions (Mealy paradigm) as opposed to on states (Moore paradigm). In the automaton's construction, this saved a factor in space and time roughly proportional to the size of the underlying alphabet. This method yielded a program that ran faster and we prefer this approach for general use. With typical query lengths and parameter settings, this version of BLAST scans a protein database at approximately 500,00 residues/s.

Extending a hit to find a locally maximal segment pair containing that hit is straightforward. To economize time, we terminate the process of extending in one direction when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions. This introduces a further departure from the ideal of finding guaranteed MSPs, but the added inaccuracy is negligible, as can be demonstrated by both experiment and analysis ( e.g. for protein comparisons the default distance is 20, and the probability of missing a higher scoring extension is about 0.001).

For DNA, we use a simpler word list, i.e. the list of all contiguous w -mers in the query sequence, often with w = 12. Thus, a query sequence of length n yields a list of n-w+ 1 words, and again there are commonly a few thousand words in the list. It is advantageous to compress the database by packing 4 nucleotides into a single byte, using an auxiliary table to delimit the boundaries between adjacent sequences. Assuming w &ge 11, each hit must contain an 8-mer hit that lies on a byte boundary. This observation allows us to scan the database byte-wise and thereby increase speed 4-fold. For each 8-mer hit, we check for an enclosing w -mer hit if found, we extend as before. Running on a SUN4, with a query of typical length ( e.g. several thousand bases), BLAST scans at approximately 2 x 10 6 bases/s. At facilities which run many such searches a day, loading the compressed database into memory once in a shared memory scheme affords substantial saving in subsequent search times.

It should be noted that DNA sequences are highly non-random, with locally biased base composition ( e.g. A+T-rich regions), and repeated sequence elements ( e.g. Alu sequences) and this has important consequences for the design of a DNA database search tool. If a given query sequence has, for example, an A+T-rich subsequence, or a commonly occurring repetitive element, then a database search will produce a copious output of matches with little interest. We have designed a somewhat ad hoc but effective means of dealing with these 2 problems. The program that produces the compressed version of the DNA database tabulates the frequencies of all 8-tuples. Those occurring much more frequently than expected by chance (controllable by parameter) are stored and used to filter &ldquouninformative&rdquo words from the query word list. Also, preceding full database searches, a search of a sublibrary of repetitive elements is performed, and the locations in the query of significant matches are stored. Words generated by these regions are removed from the query word list for the full search. Matches to the sublibrary, however, are reported in the final output. These 2 filters allow alignments to regions with biased composition, or to regions containing repetitive elements to be reported, as long as adjacent regions not containing such features share significant similarity to the query sequence.

The BLAST strategy admits numerous variations. We implemented a version of BLAST that uses dynamic programming to extend hits so as to allow gaps in the resulting alignments. Needless to say, this greatly slows the extension process. While the sensitivity of amino acid searches was improved in some cases, the selectivity was reduced as well. Given the trade-off of speed and selectivity for sensitivity, it is questionable whether the gap version of BLAST constitutes an improvement. We also implemented the alternative of making a table of all occurrences of the w -mers in the database, then scanning the query sequence and processing hits. The disk space requirements are considerable, approximately 2 computer words for every residue in the database. More damaging was that for query sequences of typical length, the need for random access into the database (as opposed to sequential access) made the approach slower, on the computer systems we used, than scanning the entire database.

3. Results

To evaluate the utility of our method, we describe theoretical results about the statistical significance of MSP scores, study the accuracy of the algorithm for random sequences at approximating MSP scores, compare the performance of the approximation to the full calculation on a set of related protein sequences and, finally, demonstrate its performance comparing long DNA sequences.

(a) Performance of BLAST with random sequences

Theoretical results on the distribution of MSP scores from the comparison of random sequences have recently become available (Karlin & Altschul, 1990 Karlin et al. , 1990). In brief, given a set of probabilities for the occurrence of individual residues, and a set of scores for aligning pairs of residues, the theory provides two parameters &lambda and K for evaluating the statistical significance of MSP scores. When two random sequences of lengths m and n are compared, the probability of finding a segment pair with a score greater than or equal to S is:

where y = Kmn e -&lambdaS . More generally, the probability of finding c or more distinct segment pairs, all with a score of at least S , is given by the formula:

1 - e -y Sum_i=0-> c -1 ( y i /i !). (2)

Using this formula, two sequences that share several distinct regions of similarity can sometimes be detected as significantly related, even when no segment pair is statistically significant in isolation.

While finding an MSP with a p -value of 0.001 may be surprising when two specific sequences are compared, searching a database of 10,000 sequences for similarity to a query sequence is likely to turn up ten such segment pairs simply by chance. Segment pair p -values must be discounted accordingly when the similar segments are discovered through blind database searches. Using formula (1), we can calculate the approximate score an MSP must have to be distinguishable from chance similarities found in a database.

We are interested in finding only segment pairs with a score above some cutoff S . The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T . It is therefore of interest to know what proportion of segment pairs with a given score contain such a word pair. This question makes sense only in the context of some distribution of high-scoring segment pairs. For MSPs arising from the comparison of random sequences, Dembo & Karlin (1991) provide such a limiting distribution. Theory does not yet exist to calculate the probability q that such a segment pair will fail to contain a word pair with a score of at least T . However, one argument suggests that q should depend exponentially upon the score of the MSP. Because the frequencies of paired letters in MSPs approaches a limiting distribution (Karlin & Altschul, 1990), the expected length of an MSP grows linearly with its score. Therefore, the longer an MSP, the more independent chances it effectively has for containing a word with a score of at least T , implying that q should decrease exponentially with increasing MSP score S .

To test this idea, we generated one million pairs of &ldquorandom protein sequences&rdquo (using typical amino acid frequencies) of length 250, and found the MSP for each using PAM-120 scores. In Figure 1, we plot the logarithm of the fraction q of MSPs with score S that do not contain a word pair of length four with score at least 18. Since the values shown are subject to statistical variation, error bars represent one standard deviation. A regression line is plotted, allowing for heteroscedasticity (differing degrees of accuracy of the y -values). The correlation coefficient for -ln( q ) and S is 0.999, suggesting that for practical purposes our model of the exponential dependence of q upon S is valid.

We repeated this analysis for a variety of word lengths and associated values of T . Table 1 shows the regression parameters a and b found for each instance the correlation was always greater than 0.995. Table 1 also shows the implied percentage q = e -( aS+b ) of MSPs with various scores that would be missed by the BLAST algorithm. These numbers are of course properly applicable only to chance MSPs. However, using a log-odds score matrix such as the PAM-120 that is based upon empirical studies of homologous proteins, high-scoring chance MSPs should resemble MSPs that reflect true homology (Karlin & Altschul, 1990). Therefore, Table 1 should provide a rough guide to the performance of BLAST on homologous as well as chance MSPs.

Based on the results of Karlin et al. (1990), Table 1 also shows the expected number of MSPs found when searching a random database of 16,000 length 250 protein sequences with a length 250 query. (These numbers were chosen to approximate the current size of the PIR database and the length of an average protein.) As seen from Table 1, only MSPs with a score over 55 are likely to be distinguishable from chance similarities. With w = 4 and T = 17, BLAST should miss only about a fifth of the MSPs with this score, and only about a tenth of MSPs with a score near 70. We will consider below the algorithm's performance on real data.

(b) The choice of word length and threshold parameters

On what basis do we choose the particular setting of the parameters w and T for executing BLAST on real data? We begin by considering the word length w .

The time required to execute BLAST is the sum of the times required (1) to compile a list of words that can score at least T when compared with words from the query (2) to scan the database for hits ( i.e. matches to words on this list) and (3) to extend all hits to seek segment pairs with scores exceeding the cutoff. The time for the last of these tasks is proportional to the number of hits, which clearly depends on the parameters w and T . given a random protein model and a set of substitution scores, it is simple to calculate the probability that two random words of length w w will have a score of at least T , i.e. the probability of a hit arising from an arbitrary pair of words in the query and the database. Using the random model and scores of the previous section, we have calculated these probabilities for a variety of parameter choices and recorded them in Table 1. For a given level of sensitivity (chance of missing an MSP), one can ask what choice of w minimizes the chance of a hit. Examining Table 1, it is apparent that the parameter pairs ( w = 3, T = 14), ( w = 4, T = 16) and ( w = 5, T = 18) all have approximately equivalent sensitivity over the relevant range of cutoff scores. The probability of a hit yielded by these parameter pairs is seen to decrease for increasing w the same also holds for different levels of sensitivity. This makes intuitive sense, for the longer the word pair examined the more information gained about potential MSPs. Maintaining a given level of sensitivity, we can therefore decrease the time spent on step (3), above, by increasing the parameter w . However, there are complementary problems created by large w . For proteins there are 20 w possible words of length w , and for a given level of sensitivity the number of words generated by a query grows exponentially with w . (For example, using the 3 parameter pairs above, a 30 residue sequence was found to generate word lists of size 296, 3561 and 40,939 respectively.) This increases the time spent on step (1), and the amount of memory required. In practice, we have found that for protein searches the best compromise between these considerations is with a word size of four this is the parameter setting we use in all analyses that follow.

Although reducing the threshold T improves the approximation of MSP scores by BLAST, it also increases execution time because there will be more words generated by the query sequence and therefore more hits. What value of T provides a reasonable compromise between the considerations of sensitivity and time? To provide numerical data, we compared a random 250 residue sequence against the entire PIR database (Release 23.0, 14,372 entries and 3,977,903 residues) with T ranging from 20 to 13. In Figure 2 we plot the execution time (user time on a SUN4-280) versus the number of words generated for each value of T . Although there is a linear relationship between the number of words generated and execution time, the number of words generated increases exponentially with decreasing T over this range (as seen by the spacing of x values). This plot and a simple analysis reveal that the expected-time computational complexity of BLAST is approximately aW + bN + cNW/ 20 w , where W is the number of words generated, N is the number of residues in the database and a , b and c are constants. The W term accounts for compiling the word list, the N term covers the database scan, and the NW term is for extending the hits. Although the number of words generated, W , increases exponentially with decreasing T , it increases only linearly with the length of the query, so that doubling the query length doubles the number of words. We have found in practice that T = 17 is a good choice for the threshold because, as discussed below, lowering the parameter further provides little improvement in the detection of actual homologies.

BLAST's direct tradeoff between accuracy and speed is best illustrated by Table 2. Given a specific probability q of missing a chance MSP with score S , one can calculate what threshold parameter T is required, and therefore the approximate execution time. Combining the data of Table 1 and Figure 2, Table 2 shows the central processing unit times required (for various values of q and S ) to search the current PIR database with a random query sequence of length 250. To have about a 10% chance of missing an MSP with the statistically significant score of 70 requires about nine seconds of central processing unit time. To reduce the chance of missing such an MSP to 2% involves lowering T , thereby doubling the execution time. Table 2 illustrates, furthermore, that the higher scoring (and more statistically significant) an MSP, the less time is required to find it with a given degree of certainty.

(c) Performance of BLAST with homologous sequences

To study the performance of BLAST on real data, we compared a variety of proteins with other members of their respective superfamilies (Dayhoff, 1978), computing the true MSP scores as well as the BLAST approximation with word length four and various settings of the parameter T . Only with superfamilies containing many distantly related proteins could we obtain results usefully comparable with the random model of the previous section. Searching the globins with woolly monkey myoglobin (PIR code MYMQW), we found 178 sequences containing MSPs with scores between 50 and 80. Using word length four and T parameter 17, the random model suggests BLAST should miss about 24 of these MSPs in fact, it misses 43. This poorer than expected performance is due to the uniform pattern of conservation in the globins, resulting in a relatively small number of high-scoring words between distantly related proteins. A contrary example was provided by comparing the mouse immunoglobulin &kappa chain precursor V region (PIR code KVMST1) with immunoglobulin sequences, using the same parameters as previously. Of the 33 MSPs with scores between 45 and 65, BLAST missed only two the random model suggests it should have missed eight. In general, the distribution of mutations along sequences has been shown to be more clustered than predicted by a Poisson process (Uzzell & Corbin, 1971), and thus the BLAST approximation should, on average, perform better on real sequences than predicted by the random model.

BLAST's great utility is for finding high-scoring MSPs quickly. In the examples above, the algorithm found all but one of the 89 globin MSPs with a score over 80, and all of the 125 immunoglobulin MSPs with a score over 50. The overall performance of BLAST depends upon the distribution of MSP scores for those sequences related to the query. In many instances, the bulk of the MSPs that are distinguishable from chance have a high enough score to be found readily by BLAST, even using relatively high values of the T parameter. Table 3 shows the number of MSPs with a score above a given threshold found by BLAST when searching a variety of superfamilies using a variety of T parameters. In each instance, the threshold S is chosen to include scores in the borderline region, which in a full database search would include chance similarities as well as biologically significant relationships. Even with T equal to 18, virtually all the statistically significant MSPs are found in most instances.

Comparing BLAST (with parameters w = 4, T = 17) to the widely used FASTP program (Lipman & Pearson, 1985 Pearson & Lipman, 1988) in its most sensitive mode ( ktup = 1), we have found that BLAST is of comparable sensitivity, generally yields fewer false positives (high-scoring but unrelated matches to the query), and is over an order of magnitude faster.

(d) Comparison of two long DNA sequences

Sequence data exist for a 73,360 bp section of the human genome containing the &beta-like globin gene cluster and for a corresponding 44,595 bp section of the rabbit genome (Margot et al ., 1989). The pair exhibits three main classes of locally similar regions, namely genes, long interspersed repeats and certain anticipated weaker similarities, as described below. We used the BLAST algorithm to locate locally similar regions that can be aligned without introduction of gaps.

The human gene cluster contains six globin genes, denoted &epsilon, G &gamma , A &gamma , &eta , &delta and &beta . (Actually, rabbit &delta is a pseudogene.) Each of the 24 gene pairs, one human gene and one rabbit gene, constitutes a similar pair. An alignment of such a pair requires insertion and deletions, since the three exons of one gene generally differ somewhat in their lengths from the corresponding exons of the paired gene, and there are even more extensive variations among the introns. Thus, a collection of the highest scoring alignments between similar regions can be expected to have at least 24 alignments between gene pairs.

Mammalian genomes contain large numbers of long interspersed repeat sequences, abbreviated LINES . In particular, the human &beta -like globin cluster contains two overlapped L1 sequences (a type of LINE ) and the rabbit cluster has two tandem L1 sequences in the same orientation, both around 6000 bp in length. These human and rabbit L1 sequences are quite similar and their lengths make them highly visible in similarity computations. In all, eight L1 sequences have been cited in the human cluster and five in the rabbit cluster, but because of their reduced length and/or reversed orientation, the other published L1 sequences do not affect the results discussed below. Very recently, another piece of an L1 sequence has been discovered in the rabbit cluster (Huang et al. , 1990).

Evolution theory suggests that an ancestral gene cluster arranged as 5'- &epsilon-&gamma-&eta-&delta-&beta -3' may have existed before the mammalian radiation. Consistent with this hypothesis, there are inter-gene similarities within the &beta clusters. For example, there is a region between human &epsilon and G &gamma that is similar to a region between rabbit &epsilon and &gamma .

We applied a variant of the BLAST programs to these two sequences, with match score 5, mismatch score -4 and, initially, w = 12. The program found 98 alignments scoring over 200, with 1301 being the highest score. Of the 57 alignments scoring over 350, 45 paired genes (with each of the 24 possible gene pairs represented) and the remaining 12 involved L1 sequences. Below 350, inter-gene similarities (as described above) appear, along with additional alignments of genes and of L1 sequences. Two alignments with scores between 200 and 350 do not fit the anticipated pattern. One reveals the newly discovered section of L1 sequence. The other aligns a region immediately 5' from the human &beta gene with a region just 5' from rabbit &delta . This last alignment may be the result of an intrachromosomal gene conversion between &delta and &beta in the rabbit genome (Hardison & Margot, 1984).

With smaller values of w , more alignments are found. In particular, with w = 8, an additional 32 alignments are found with a score above 200. All of these fall in one of the three classes discussed above. Thus, use of a smaller w provides no essentially new information. The dependence of various values on w is given in Table 4. Time is measured in seconds on a SUN4 for a simple variant of BLAST that works with uncompressed DNA sequences.

4. Conclusion

The concept underlying BLAST is simple and robust and therefore can be implemented in a number of ways and utilized in a variety of contexts. As mentioned above, one variation is to allow for gaps in the extension step. For the applications we have had in mind, the tradeoff in speed proved unacceptable, but this may not be true for other applications. We have implemented a shared memory version of BLAST that loads the compressed DNA file into memory once, allowing subsequent searches to skip this step. We are implementing a similar algorithm for comparing a DNA sequence to the protein database, allowing translation in all six reading frames. This permits the detection of distant protein homologies even in the face of common DNA sequencing errors (replacements and frame shifts). C.B. Lawrence (personal communication) has fashioned score matrices derived from consensus pattern matching methods (Smith & Smith, 1990), and different from the PAM-120 matrix used here, which can greatly decrease the time of database searches for sequence motifs.

The BLAST approach permits the construction of extremely fast programs for database searching that have the further advantage of amenability to mathematical analysis. Variations of the basic idea as well as alternative implementations, such as those described above, can adapt the method for different contexts. Given the increasing size of sequence databases, BLAST can be a valuable tool for the molecular biologist. A version of BLAST in the C programming language is available from the authors upon request (write to W. Gish) it runs under both 4.2 BSD and AT&T System V UNIX operating systems.

W.M. is supported in part by NIH grant LM05110, and E.W.M. is supported in part by NIH grant LM04960.


Collins JF, Coulson AF, Lyall A. (1987). Comput. J. 30 :420-424.

Dayhoff MO. (1978). Editor of Atlas of Protein Sequence and Structure , vol. 5, suppl. 3, Nat. Biomed. Res. Found., Washington, DC.

dayhoff MO, Schwartz RM, Orcutt BC. (1978). In Atlas of Protein Sequence and Structure (Dayhoff MO ed.), vol. 5, suppl. 3, pp. 345-352, Nat. Biomed. Res. Found., Washington, DC.

Dembo A, Karlin S. (1991). Ann. Prob. in the press.

Goad WB, Kanehisa MI. (1982). Nucl. Acids Res. 10 :247-263.

Hardison RC, Margot JB. (1984). Mol. Biol. Evol. 1 :302-316.

Hopcroft JE, Ullman JD. (1979). In Introduction to Automata Theory, Languages, and Computation , pp. 42-45, Addison-Wesley, Reading, MA.

Huang X, Hardison RC, Miller W. (1990). Comput. Appl. Biosci. In the press.

Karlin S, Altschul SF. (1990). Proc. Natl. Acad. Sci. U.S.A. 87 :2264-2268.

Karlin S, Dembo A, Kawabata T. (1990). Ann. Stat. 18 :571-581.

Lipman DJ, Pearson WR. (1985). Science 227 :1435-1441.

Margot JB, Demers GW, Hardison RC. (1989). J. Mol. Biol. 205 :15-40.

Mealy GH. (1955). Bell System Tech. J. 34 :1045-1079.

Needleman SB, Wunsch CD. (1970). J. Mol. Biol. 48 :443-453.

Pearson WR, Lipman DJ. (1988). Proc. Natl. Acad. Sci. U.S.A. 85 :2444-2448.

Sankoff D, Kruskal JB. (1988). Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison , Addison-Wesley, Reading, MA.

Sellers PH. (1974). SIAM J. Appl. Math. 26 :787-793.

Sellers PH. (1984). Bull. Math. Biol. 46 :501-514.

Smith RF, Smith TF. (1990). Proc. Natl. Acad. Sci. U.S.A. 87 :118-122.

Smith TF, Waterman MS. (1981). Advan. Appl. Math. 2 :482-489.

Uzzell T, Corbin KW. (1971). Science 172 :1089-1096.


Table 1. The probability of a hit at various settings of the parameters w and T , and the proportion of random MSPs missed by BLAST

Table 2. The central processing unit time required to execute BLAST as a function of the approximate probability q of missing an MSP with score S

Table 3. The number of MSPs found by BLAST when searching various protein superfamilies in the PIR database (Release 22.0)

Table 4. The time and sensitivity of BLAST on DNA sequences as a function of w


Figure 1. The probability q of BLAST missing a random maximal segment pair as a function of its score S .

Figure 2. The central processing unit time required to execute BLAST on the PIR protein database (Release 23.0) as a function of the size of the word list generated. Points correspond to values of the threshold parameter T ranging from 13 to 20. Greater values of T imply fewer words in the list.

The BLAST algorithm

  1. Split query into overlapping words of length W (the W-mers)
  2. Find a &ldquoneighborhood&rdquo of similar words for each word (see below)
  3. Lookup each word in teh neighborhood in a hash table to find the location in the database where each word occurs. Call these the seeds, and let S be the collection of seeds.
  4. Extend the seeds in S until the score of the alignment drops off below some threshold X.
  5. Report matches with overall highest scores

Figure 3.13: The BLAST Algorithm

The pre-processing step of BLAST makes sure that all substrings of W nucleotides will be included in our database (or in a hash table). These are called the W -mers of the database. As in step 1, we first split the query by looking at all substrings of W consecutive nucleotides in the query. To find the neighborhood of these W-mers, we then modify these sequences by changing them slightly and computing their similarity to the original sequence. We generate progressively more dissimilar words in our neighborhood until our similarity measure drops below some threshold T. This affords us flexibility to find matches that do not have exactly W consecutive matching characters in a row, but which do have enough matches to be considered similar, i.e. to meet a certiain threshold score.

Then, we look up all of these words in our hash table to find seeds of W consecutive matching nucleotides. We then extend these seeds to find our alignment using the Smith-Waterman algorithm for local alignment, until the score drops below a certain threshold X. Since the region we are considering is a much shorter segment, this will not be as slow as running the algorithm on the entire DNA database.

It is also interesting to note the influence of various parameters of BLAST on the performance of the algorithm vis-a-vis run-time and sensitivity:

  • W Although large W would result in fewer spurious hits/collisions, thus making it faster, there are also tradeoffs associated, namely: a large neighborhood of slightly different query sequences, a large hash table, and too few hits. On the other hand, if W is too small, we may get too many hits which pushes runtime costs to the seed extension/alignment step.
  • T If T is higher, the algorithm will be faster, but you may miss sequences that are more evolutionarily distant. If comparing two related species, you can probably set a higher T since you expect to find more matches between sequences that are quite similar.
  • X Its influence is quite similar to T in that both will control the sensitivity of the algorithm. While W and T affect the total number of hits one gets, and hence affect the runtime of the algorithm dramatically, setting a really stringent X despite less stringent W and T, will result runtime costs from trying unnecessary sequences that would not meet the stringency of X. So, it is important to match the stringency of X with that of W and T to avoid unnecessary computation time.


Low-complexity regions have fewer sequence characters in them because of repeats of the same sequence character or pattern. These sequences produce artificially high-scoring alignments that do not accurately convey sequence relationships in sequence similarity searches. Regions of low complexity or repetitive sequences may be readily visualized in a dot matrix analysis of a sequence against itself. Low-complexity regions with a repeat occurrence of the same residue can appear on the matrix as horizontal and vertical rows of dots representing repeated matches of one residue position in one copy of the sequence against a series of the same residue in the second copy. Repeats of a sequence pattern appear in the same matrix as short diagonals of identity that are offset from the main diagonal. Such sequences should be excluded from sequence similarity searches.

The BLAST programs include a feature for filtering the query sequence through programs that search for low-complexity regions. Filtering is applied only to the query sequence and not to the database sequences. Low-complexity regions are marked with an X (protein sequences) or N (nucleic acid sequences) and are then ignored by the BLAST program. Removing low-complexity and repeat sequences increases emphasis on the more significant database hits. The NCBI programs SEG and PSEG are used to mask amino acid sequences, and NSEG is used to mask nucleic acid sequences (Wootten and Federhen 1993, 1996). The SEG programs are available by anonymous FTP from, including documentation. The program DUST is also used for DNA sequences (see Filter under BLAST Search Parameters at RepeatMasker (described later) is another program for this same purpose.

The compositional complexity in a window of sequence of length L is given by (Wootten and Federhen 1996):

where N is 4 for nucleic acid sequences and 20 for protein sequences, and are the numbers of each residue in the window. K will vary from 0 for very low complexity to 1 for high complexity. Thus, complexity is given by:

Compositional complexities are sometimes calculated to produce K scores in bit units of logarithms to the base 2. A sliding window (usually 12 residues) is moved along the sequence, and the complexity is calculated at each position. Regions of low complexity are identified using Equation 1, neighboring low-complexity regions are then joined into longer regions, and the resulting region is then reduced to a single optimal segment by a minimization procedure. The SEG program is used for analysis of either proteins or nucleic acids by the above methods. PSEG and NSEG are similar to SEG but are set up for analysis of protein and nucleic acid sequences, respectively. These versatile programs may also be used for locating specific sequence patterns that are characteristic of exons or protein structural domains. In database searches involving comparisons of genomic DNA sequences with EST sequence libraries, use of repeat masking is important for filtering output to the most significant matches because of the presence of a variety of repetitive sequences ranging from mononucleotide repeats to larger repeated elements in genomes (Claverie 1996).

In addition to low-complexity regions, BLAST will also filter out repeat elements (such as human SINE and LINE retroposons). Another filtering program for repeats of periodicity <10 residues called XNU (Claverie and States 1993) is used by the BLAST stand-alone programs, but is not available on the NCBI server.

Another important Web server, RepeatMasker (, screens sequences for interdispersed repeats known to be present in mammalian genomes and also can filter out low-complexity regions (A.F.A. Smeet and P. Green, see Web site above). A dynamic programming search program, cross-match (P. Green, see Web site), performs a search of a repeat database with the query sequence (Claverie 1996). A database of repetitive elements (Repbase) maintained at by the Genetics Information Research Institute (Jurka 1998) can also be used for this purpose.

Behind the scenes of BLAST

The NCBI estimates that about 200,000 “queries” (that’s your submission of a sequence) are made every week. However, depending on how many sequences you enter and how long those sequences are, you can get results back in a few minutes, possibly a handful of seconds.

BLAST works by detecting local alignments between sequences that work the best. The BLAST computers start with a small set of three letters, which they call the “query word.” These letters will represent three amino acids or nucleotides, in a specific order (for example, the nucleotides ATC, in that order). The BLAST search then looks for the number of times (and places along the sequence) in which this three-letter “word” appears. It will also look for closely related “words” in which one letter is different. Then, each query is scored to determine which database is “in the neighborhood” of your sample.

BLAST(Basic Local Alignment Search Tool) is heuristic? - Biology

BIO306 Genetics: Lab BLAST Assignment (35 pts)

You have already obtained the DNA sequence that encodes the normal (“wild type”) version of the mutant gene (or genes) you are studying. There are several predictions that can be made using this information. The simplest is to predict the amino acids that are encoded by the sequence. Analysis of the amino acid sequence of a protein can yield information about the possible functions and structures found in the protein.

Another form of analysis is to compare the DNA sequence to other known sequences. This can indicate how conserved a gene may be between different species. Often, if we do not know the function of our gene, we can use the comparison of our DNA sequence with those previously described from other species to predict a similar gene whose function may be known, or at least more easily discovered.

In this experiment you will use your gene’s DNA sequence to do a BLASTN (BLAST, basic local alignment search tool, N, nucleotide) search of the Genbank database. The list of results will consist of sequences from other non-Drosophila organisms.

The output of the BLASTN search is a table of sequences and scores. We will only focus on two columns of information from this table. In the second column of the table, the name of the DNA sequence is listed as entered by the researchers who submitted the sequence. Further to the right is a column labeled “E value”. The E value is a score that predicts the likelihood that your sequence and the sequence in the table aligned by random chance.

Click here for instructions on how to perform the BLASTN search.

Click here for an example of a search with an analysis of the data.

Click here to download the questions that you must answer for the assignment.

Click here to access the papers needed to determine the wild-type function of your fly genes.


Reading a book such as this brings home how much BLAST-now in its teenage years-has grown, and provides an occasion for fond reflection. BLAST was born in the first months of 1989 at the National Center for Biotechnology Information (NCBI). The Center had been created at the National Institutes of Health in November 1988, by an act of the U.S. Congress, to foster the development of a field that then had no widely accepted name, but which has since come to be known as "Bioinformatics." In early 1989, David Lipman, my post-doctoral advisor, who at the time was perhaps best known as a codeveloper of the FASTA program, was appointed director of NCBI. On the first of March we moved into new offices at the National Library of Medicine.The NCBI was small, but had large ambitions, and already a number of friends. Several of these well-wishers made it a point to drop by for a visit. Gene Myers, a computer scientist then at Arizona, arrived during a week in which Science was hyping a special-purpose computer chip for sequence comparison. He and David, software partisans both, were unimpressed and over dinner resolved to do better. Their original idea was to find not subtle sequence similarities, but fairly obvious ones, and to do it in a flash. Gene pursued a rigorous approach at first, but David, with a fine Darwinian wisdom, was willing to settle for imperfection. If one were to gamble, what kind of match could one expect a strong alignment to contain? Detailed algorithmic and code development on BLAST by Webb Miller-later to be joined by Warren Gish-had hardly begun before Sam Karlin, a Stanford mathematician, came calling. I had approached him a few months earlier with a conjecture concerning the asymptotic behavior of optimal ungapped local sequence alignments. Since then, he had spun this conjecture into a beautiful theory. Now, for the first time, rigorous statistics were available for alignment scoring systems of more than academic interest, and the essential nature of amino acid substitution matrices also began to come into clear focus. This theory dovetailed perfectly with the work that had just started on BLAST: both informing the selection of its algorithmic parameters, and yielding units for the alignment scores produced.

Although David chose BLAST's name as a bit of a pun on "FASTA" (it was only later that I realized "BLAST" to be an acronym), the new program was never intended to vie with the earlier one. Rather, the idea was to turn the "threshold parameter" way up, to find undoubted homologies before you take more than one sip of coffee. It surprised us all when BLAST started returning most weak similarities as well. Thus was born a sort of friendly competition with Bill Pearson's and David's earlier creation. From the start, BLAST had two major advantages to FASTA and one major disadvantage. In the plus column, BLAST was indeed much the faster, and it also boasted Sam's new statistics, which turned raw scores into E-values. However, BLAST could only produce ungapped local alignments, thereby often eliding large regions of similarity and sometimes completely missing weak alignments that FASTA, in its most sensitive but slowest mode, was able to find. These points of comparative advantage were healthy for both programs. In time, FASTA fit its scores to the extreme value distribution, yielding reliable statistical evaluations of its output. And by the mid '90s, Warren Gish's WU-BLAST from Washington University, and NCBI's BLAST releases, introduced gapped alignments, using differing algorithmic strategies. The result, at least for protein sequence comparisons, is that BLAST and FASTA have converged in many important ways, although there still remain significant differences.

The programs comprehended by the name "BLAST" have multiplied astonishingly in the nearly 15 years since the first one was conceived. Learning the best way to use these various programs for research can be a challenge, and this book is a significant aid.While BLAST's developers have done their best to make the programs' default behavior the most generally applicable, a sophisticated user still has many issues to consider.

To achieve speed, BLAST is a heuristic program. It isn't guaranteed to find every local alignment that passes its reporting criteria, and there are an array of parameters that control the shortcuts it takes.With the introduction of gapped alignments, the programs' complexity increased, as did the number of parameters that influence BLAST's tradeoff of speed and sensitivity. In a certain sense, however, these mechanics are the least important for a user to understand because, except for the occasional appearance or disappearance of a weak similarity, they don't greatly effect the programs' output. Perhaps of more importance is an understanding of attendant matters that are relevant to the effective use of any local alignment search method, such as the filtering of "low-complexity" sequence regions, the proper choice of scoring systems, and the correct interpretation of statistical significance. This book deals with these and many other matters, and nicely combines theoretical considerations with practical advice informed by these considerations.

The BLAST programs have been the fruit of much hard work by scores of talented programmers and scientists. This work continues, linking BLAST output to other databases, improving alignment formatting options, refining the types of queries that may be performed. Newer offshoots, such as PSI-BLAST for protein profile searches, also continue under development, and BLAST is thus a moving and a growing target. This book should prove a valuable guide for those wishing to use the programs to best effect.

Primer-BLAST now designs primers for a group of related sequences

Primer-BLAST now has a “Primers common for a group of sequences” submission tab that allows you to design primers for a group of highly similar sequences. For example, you may want test for expression of any transcript of gene rather than a specific splice variant, so you want to design primers to cover all transcript variants. Or you may want to design primers that will amplify the same gene in closely related bacteria strains. To find primers for a group of related sequences, Primer-BLAST aligns the longest sequence to the rest to find common regions. It uses these to limit the locations of primers. The longest sequence is also used as the representative template sequence in the results. Figure 1 shows an example search for primers that will amplify all of the 15 splice variants for the human TP53 gene.

Figure 1. Primer-BLAST submission page and results for primers designed for the human TP53 transcripts. Top panel: The submission form with the “Primers common for a group of sequences” selected and the 15 RefSeq transcript accessions for TP53. Middle panel: The graphical results showing the longest sequence (NM_001126114.3) as the representative template, the locations of the primer pairs, and the alignment of the other template sequences. Bottom panel: An individual primer pair showing the locations on each of the template sequences.

Please try out this new feature and let us know what you think!

Detailed Introduction

An overview of the BLASTP algorithm (a protein to protein search) is as follows: [7]

    Remove low-complexity region or sequence repeats in the query sequence.

"Low-complexity region" means a region of a sequence composed of few kinds of elements. These regions might give high scores that confuse the program to find the actual significant sequences in the database, so they should be filtered out. The regions will be marked with an X (protein sequences) or N (nucleic acid sequences) and then be ignored by the BLAST program. To filter out the low-complexity regions, the SEG program is used for protein sequences and the program DUST is used for DNA sequences. On the other hand, the program XNU is used to mask off the tandem repeats in protein sequences.

Take k=3 for example, we list the words of length 3 in the query protein sequence (k is usually 11 for a DNA sequence) "sequentially", until the last letter of the query sequence is included. The method is illustrated in figure 1.Fig. 1 The method to establish the k-letter query word list.

This step is one of the main differences between BLAST and FASTA. FASTA cares about all of the common words in the database and query sequences that are listed in step 2 however, BLAST only cares about the high-scoring words. The scores are created by comparing the word in the list in step 2 with all the 3-letter words. By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20^3 possible match scores for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively. For DNA words, a match is scored as +5 and a mismatch as -4, or as +2 and -3. After that, a neighborhood word score threshold Tis used to reduce the number of possible matching words. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. For example, PEG is kept, but PQA is abandoned when T is 13.

This allows the program to rapidly compare the high-scoring words to the database sequences.

The BLAST program scans the database sequences for the remaining high-scoring word, such as PEG, of each position. If an exact match is found, this match is used to seed a possible un-gapped alignment between the query and database sequences.

    The original version of BLAST stretches a longer alignment between the query and the database sequence in the left and right directions, from the position where the exact match occurred. The extension does not stop until the accumulated total score of the HSP begins to decrease. A simplified example is presented in figure 2.

We list the HSPs whose scores are greater than the empirically determined cutoff scoreS. By examining the distribution of the alignment scores modeled by comparing random sequences, a cutoff score S can be determined such that its value is large enough to guarantee the significance of the remaining HSPs.

BLAST next assesses the statistical significance of each HSP score by exploiting theGumbel extreme value distribution (EVD). (It is proved that the distribution of Smith-Waterman local alignment scores between two random sequences follows the Gumbel EVD. For local alignments containing gaps it is not proved.). In accordance with the Gumbel EVD, the probability p of observing a score S equal to or greater than x is given by the equationwhereThe statistical parameters and are estimated by fitting the distribution of the un-gapped local alignment scores, of the query sequence and a lot of shuffled versions (Global or local shuffling) of a database sequence, to the Gumbel extreme value distribution. Note that and depend upon the substitution matrix, gap penalties, and sequence composition (the letter frequencies). and are the effective lengths of the query and database sequences, respectively. The original sequence length is shortened to the effective length to compensate for the edge effect (an alignment start near the end of one of the query or database sequence is likely not to have enough sequence to build an optimal alignment). They can be calculated aswhere is the average expected score per aligned pair of residues in an alignment of two random sequences. Altschul and Gish gave the typical values, , , and , for un-gapped local alignment using BLOSUM62 as the substitution matrix. Using the typical values for assessing the significance is called the lookup table method it is not accurate. The expect score E of a database match is the number of times that an unrelated database sequence would obtain a score S higher than x by chance. The expectation E obtained in a search for a database of D sequences is given byFurthermore, when , E could be approximated by the Poisson distribution asThis expectation or expect value "E" (often called an E score or E-value or e-value) assessing the significance of the HSP score for un-gapped local alignment is reported in the BLAST results. The calculation shown here is modified if individual HSPs are combined, such as when producing gapped alignments (described below), due to the variation of the statistical parameters.

Sometimes, we find two or more HSP regions in one database sequence that can be made into a longer alignment. This provides additional evidence of the relation between the query and database sequence. There are two methods, the Poisson method and the sum-of-scores method, to compare the significance of the newly combined HSP regions. Suppose that there are two combined HSP regions with the pairs of scores (65, 40) and (52, 45), respectively. The Poisson method gives more significance to the set with the maximal lower score (45>40). However, the sum-of-scores method prefers the first set, because 65+40 (105) is greater than 52+45(97). The original BLAST uses the Poisson method gapped BLAST and the WU-BLAST uses the sum-of scores method.

  • The original BLAST only generates un-gapped alignments including the initially found HSPs individually, even when there is more than one HSP found in one database sequence.
  • BLAST2 produces a single alignment with gaps that can include all of the initially-found HSP regions. Note that the computation of the score and its corresponding E-value involves use of adequate gap penalties.

Researcher Tools, Services and Support

The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information.

Popular NCBI Databases:

    • BLAST (Basic Local Alignment Search Tool) compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
        is a searchable database of genes, from RefSeq genomes, and defined by sequence and/or located in the NCBI Map Viewer.
          is a collection of sequences from several sources, including GenBank, RefSeq, TPA and PDB. Genome, gene and transcript sequence data provide the foundation for biomedical research and discovery.
            database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB. Protein sequences are the fundamental determinants of biological structure and function.
              is a bibliographic database of more than 19 million citations for biomedical literature from MEDLINE, life science journals, and online books.

            The power of NCBI's resources is found in their relationship to one another, as most are linked together, providing a comprehensive toolkit for researchers in biomedicine. Online tutorials and help are available at each site, and a nice collection of tutorials can be found on NCBI's YouTube channel.

            Need a brief refresher on the sciences behind biotechnology? Check out NCBI's Science Primer site covering topics from bioinformatics to microarray technology to pharmacogenomics.

            Watch the video: Basic Local Alignment Search Tool - BLAST (August 2022).