Table of Contents
- Introduction to Sequence alignment
- Types of Sequence Alignment
- Methods of pairwise sequence alignment
- Methods of Multiple Sequence Alignment
- Applications of sequence alignment
Introduction to Sequence Alignment
- Sequence comparison is a vital aspect of bioinformatics analysis that entails comparing newly determined biological sequences with those previously stored in databases.
- Sequence alignment is regarded as the most crucial step in comparing biological sequences. It involves arranging two or more nucleotide or amino acid sequences to identify regions of similarity. These similar regions are instrumental in understanding the functional, structural, and evolutionary relationships between the sequences.
- Two commonly used algorithms for sequence alignment are global alignment and local alignment.
- Global alignment: This method compares two sequences by aligning their entire lengths to maximize overall similarity. It is used when the sequences being compared are of the same length.
- Local alignment: This method focuses on aligning only the regions with the highest density of matches rather than the entire length of the sequences. It is particularly useful for identifying short conserved regions in protein or nucleotide sequences.
Types of Sequence Alignment
A. Pairwise Alignment
- Pairwise sequence alignment is a type of sequence alignment that involves aligning two sequences to identify the optimal pairing.
- It relies on a scoring system that assigns positive scores to matching characters and negative scores to mismatches or gaps.
- The primary goal of pairwise sequence alignment is to achieve the highest possible score, reflecting the degree of similarity between the two sequences.
B. Multiple Sequence Alignment
- Multiple Sequence Alignment involves aligning three or more biological sequences to achieve optimal sequence matching.
- These alignments are used to identify conserved sequence regions and to construct phylogenetic trees, which aid in understanding the functional and evolutionary relationships between different species or groups of organisms.
Methods of pairwise sequence alignment
There are three main methods for generating pairwise alignments:
A. Dot-matrix method
- The dot matrix method, also known as the dot plot method, is a graphical sequence alignment technique that involves comparing two sequences by plotting them in a two-dimensional matrix. In a dot matrix, the two sequences to be compared are plotted along the matrix's horizontal and vertical axes.
- The method then scans each residue of one sequence to identify similarities with all residues in the other sequence.
- When a residue in one sequence matches a residue in the other sequence, a dot is placed in the corresponding position in the matrix; otherwise, the matrix position is left blank. If the two sequences being compared are highly similar, the dot plot will display a single line along the matrix's main diagonal. In contrast, if the sequences are less similar, the dot plot will show scattered dots with fewer diagonal lines, indicating lower similarity between the sequences.
- Dot plots can also be used to find repeat elements within a single sequence. Short parallel lines above and below the main diagonal suggest the presence of repeats. Here is an example of comparing two sequences using dot plots.
B. Dynamic programming
- Dynamic programming is a method used to find the optimal alignment between two protein or nucleic acid sequences by comparing all possible pairs of characters in the sequences. It can be employed to produce both global and local alignments.
- The global pairwise alignment algorithm using dynamic programming is based on the Needleman-Wunsch algorithm, whereas the local alignment algorithm is based on the Smith-Waterman algorithm.
This method works in the following three steps:
- Initialization of the scoring matrix: The first step is to create a two-dimensional matrix where the two sequences to be aligned are written along the top and left sides. The matrix is initialized with gap penalties and an initial score of zero at the top-left corner.
- Matrix filling with maximum scores: The next step involves filling the matrix with scores based on a scoring system. For nucleotide sequences, a positive value is assigned for a match and a negative value for a mismatch. For amino acids, BLOSUM and PAM scoring matrices are used. To calculate the alignment scores, the algorithm starts at the upper left corner of the matrix and proceeds one row at a time toward the lower right corner. Each cell in the matrix is filled with the maximum score that can be obtained by aligning the corresponding residues.
- Traceback to identify optimal alignment: After filling the matrix, the algorithm performs a traceback to find the optimal alignment path. Starting from the bottom-right corner and moving towards the top-left corner, adjacent cells are examined in reverse order to determine the best path with the highest total score. The optimal alignment path is the one with the maximum score.
C. Word or k-tuple method
- Word or k-tuple methods are heuristic techniques best known for their use in the database search tools FASTA and BLAST.
- The word method is a fast approach for aligning two sequences. It starts by identifying short identical sequences, known as words or k-tuples, and then uses dynamic programming to align the sequences based on these words.
Methods of Multiple Sequence Alignment
Multiple sequence alignment can be performed using either exhaustive or heuristic approaches.
A. Exhaustive algorithms
- Exhaustive alignment involves examining all possible alignments simultaneously. To perform multiple sequence alignment using the exhaustive algorithm, a multidimensional search matrix is required, similar to the two-dimensional matrix used in dynamic programming for pairwise alignment. For aligning N sequences, an N-dimensional matrix is necessary.
- Dynamic programming is a powerful method for aligning sequences, but as the number of sequences to be aligned increases, the computational time and memory space required also increase. This makes the method computationally impractical for large data sets.
- Consequently, dynamic programming is typically used only for small data sets with fewer than ten short sequences. For larger data sets, heuristic approaches are employed to achieve a more efficient alignment.
B. Heuristic algorithm
i. Progressive method
- The progressive method, also known as the tree-based algorithm, is a stepwise assembly of multiple alignments based on pairwise similarity. Its progressive nature stems from aligning sequences incrementally.
- Initially, it conducts pairwise alignments of all sequences via methods like Needleman–Wunsch global alignment, recording similarity scores. These scores are then translated into evolutionary distances, forming a distance matrix. From this matrix, a guide tree emerges using the neighbor-joining method.
- Subsequently, the guide tree directs the realignment of sequences, starting with the two most closely related ones and progressively incorporating more distant sequences. This process continues until all sequences are aligned.
- Prominent programs employing the progressive alignment strategy include Clustal and T-Coffee.
ii. Iterative Method
- The iterative method revolves around refining an initial suboptimal solution through repeated modifications until an optimal solution is attained.
- Initially, an initial pairwise alignment is conducted to establish a tree, furnishing weights for alignment creation. Subsequently, regions aligned with gaps are identified and iteratively adjusted to elevate the alignment score.
- The alignment with the highest score is employed in a fresh round of calculations to predict a new tree, new weights, and fresh alignments. This process iterates until no further enhancement in the alignment score is observed.
- PRRN stands as a web-based program employing the iterative alignment method.
iii. Block-based method
- The progressive and iterative alignment methods rely on global alignment and might not effectively identify conserved domains and motifs in highly divergent sequences of varying lengths.
- To align such divergent sequences, a local alignment-based approach is required.
- The block-based method represents one such approach, identifying a block of ungapped alignment shared by all sequences.
Applications of sequence alignment
- Identification of unknown sequences: By comparing them with sequences in databases, alignment can help identify unknown sequences.
- Detection of conserved patterns and motifs: Alignment aids in characterizing sequence functions by identifying conserved patterns and motifs.
- Phylogenetic tree construction: By aligning sequences from different species, alignment helps in constructing phylogenetic trees, revealing evolutionary relationships.
- Prediction of protein structures: Alignment can predict secondary and tertiary structures of proteins, aiding in understanding their functions.
- Prediction of gene locations and identification of gene families: Alignment can predict gene locations and identify new members of gene families by comparing sequences with known genes.
- Development of degenerate PCR primers: By analyzing multiple related sequences, alignment assists in developing degenerate PCR primers.