Genius ALIGNment, GALIGN

About Genius ALIGNment

GALIGN is a program written in C++ to do several types of pairwise sequence alignment by dynamic programming algorithms. The program is run in command shell in text mode. There are eight alignment methods provided in the program as follows,

  1. Global Alignment,
  2. Global Alignment with Affine Gap Penalty,
  3. Local Alignment,
  4. Local Alignment with Affine Gap Penalty,
  5. Alignment with Repeated Matches,
  6. Overlap Alignment,
  7. Overlap Alignment with Affine Gap Penalty,
  8. Overlap Alignment with Repeated Matches

Essentially, each of the alignment algorithms finds the way of assigning pairs of letters of two sequences such that a scoring obtains the highest value. The best of different object score could be achieved by different dynamic programming schemes. If you look for the best matching of sequential pairs of whole two sequences, you try to find a “Global Alignment” of these two sequences. By “Local Alignment” we limit ourselves to the part of sequences which touches the highest score when ignoring other parts not hope increase the score. “Overlap Alignment” refers to a sort of “Global Alignment” which doesn’t care about the dangling heads of sequence if they would be left unaligned. In “Alignment with repeated Match”, the second sequence is used as a template and all substrings find out over the entire of the first sequence which gain at least a score “T” (as threshold score) for matching with the template sequence. “Overlap Alignment with repeated Matches” does the same but ignores possibly not matched heads of the first sequence.

All alignment methods but ones called “ungapped alignment”, consider some “gaps” to make alignment between two or more sequences. “Gaps” – usually represented by dashes – indicate insertion or deletion among letters of aligned sequences. More precisely, a dash in the first aligned sequence shows an insertion in the first sequence or a deletion in the second sequence to be aligned. Like a dash in the second sequence says a deletion in the first sequence or equally an insertion in the second sequence. Committing gap should cost some negative values for the alignment score. More gaps lower score. So to take gap penalties into the account we have applied both “Linear gap penalty” and “Affine gap penalty”. The former that is used by default in alignment algorithms, considers negative cost value linearly proportional to the length of the occurred gap. The latter gives a usually more cost to penalize opening a gap while going after fewer constant penalties for each consequence extended gap.

Letters of sequences can be of type of nucleotides or amino-acids. However it was so provided such that you can enter your sequence directly by numeric codes rather symbols of nucleotides or amino-acids. Related numeric codes for nucleotides and amino-acids along with some extensions are as follows,

  • IUPAC Symbols for Nucleotides
  • IUPAC Symbols for Amino-acids
Symbol Code Meaning Full Name
A 1 A Adenine
C 2 C Cytosine
G 3 G Guanine
T 4 T Thymine
R 5 G|A puRine
Y 6 T|C pYrimidine
K 7 G|T Keto
M 8 A|C aMino
S 9 G|C Strong interaction (3 H bonds)
W 10 A|T Weak interaction (2 H bonds)
B 11 G|T|C not-A (B follows A)
D 12 G|A|T not-C (D follows C)
H 13 A|T|C not-G (H follows G)
V 14 G|A|C not-T (or U) (V follows U)
N 15 G|A|T|C aNy nucleotide
- 16 GAP Gap -
U 4 T Uracil (eq. Thymine in DNA)
X 15 G|A|T|C aNy nucleotide
Symbol Code Meaning Full Name
A 1 Ala Alanine
R 2 Arg Arginine
N 3 Asn Asparagine
D 4 Asp Aspartic acid (Aspartate)
C 5 Cys Cysteine
Q 6 Gln Glutamine
E 7 Glu Glutamic acid (Glutamate)
G 8 Gly Glycine
H 9 His Histidine
I 10 Ile Isoleucine
L 11 Leu Leucine
K 12 Lys Lysine
M 13 Met Methionine
F 14 Phe Phenylalanine
P 15 Pro Proline
S 16 Ser Serine
T 17 Thr Threonine
W 18 Trp Tryptophan
Y 19 Tyr Tyrosine
V 20 Val Valine
B 21 Asx Aspartic acid or Asparagine
Z 22 Glx Glutamic acid or Glutamine
X 23 Xaa Any amino acid
* 24 END Termination codon UAA, UAG and UGA
- 25 GAP Gap -

In order to score individual matched letters in the alignment we use “Scoring Matrix”. Each entry in “Scoring Matrix” is score of matching letters which their types are related to corresponded row and column. For example, the score of substituting arginine by lysine would be found in the crossing row 2 and column 12 in the scoring matrix. By theory, score of substitution a by b is equal to score of substitution b by a, so scoring matrix is symmetric. Note that “Linear gap penalty”, the default rule to penalize gaps, takes the score of substituting symbol by ‘-’, the last row or column of the scoring matrix, as the constant cost per each gap. However, the program for alignment methods by “Affine gap penalty” asks for opening gap penalty and extended gap penalty directly from user instead of refer to somewhere in scoring matrix. Scoring matrices could be prepared in individual text files or in interactive text mode. In both case, scoring matrix would be defined by the following format,

<matrix type>
<size of matrix>
<list of matrix elements>

in which, matrix type is one of letters f (full matrix entries), h (half matrix entries) or I (the identity matrix), size of matrix is usually 15 for nucleotides scoring table and 25 for amino-acids scoring table. Matrix elements should be of integer numbers. Columns of matrix are separated by space and rows are separated by new line.

If you would have indicated h as matrix type, only the bottom left part of the matrix should be given. The identity matrices don't need elements declaring. If the scoring matrix information is stored in a text file, put the path of the file after a @ (e.g. @sm/blosum50.txt) in place of matrix type in order to pass the program to the file. A set of some conventional scoring matrix is prepared in folder sm as follows,

Scoring Matrix Sequence Type Default gap penalty
nuc44 Nucleotides -2
blosum50 Amino-acids -5
blosum50g8 Amino-acids -8
blosum65 Amino-acids -5
pam250 Amino-acids -8
pam50 Amino-acids -13

Installation

[ Click here to start download ]

To install the package, download the source code, unpack it, and then compile the source.

$ tar xvzf galign.tar.gz
$ cd galign/src
$ g++ galign5.cpp bioinf.cpp -o galign

Usage

The program can be invoked either in an interactive mode, by applying switch –i, or in non-interactive mode by default. Note that when you would enter nucleotides or amino-acids sequences, you should use capital letters as their symbols.

There are some sample inputs prepared before, to examine the program. Get a command shell and change to the galign containing directory. Now you can test the program by

$ galign < firstAlign.txt

For more info about the usage, invoke ga –i and look inside firstAlign.txt. In addition to the alignment, the program outputs other results which are the alignment score, figures of “identities” and “positives”. The former is the number of paired symbols of the identical similar components, and the latter is number of paired symbols have a positive substitution score.

Development Manual

Class

class SMatrix {
public:
  friend istream & operator>> ( istream &, SMatrix & );
  int operator  (  ) ( int, int ) const;

private:
  int n;
  char Stype;
  int **Stable;

public:
    SMatrix (  ) { Stype = 'I'; }
};

Description

The class SMatrix abstracts a data type for scoring matrices. Initialization performs either by default constructor as an identity matrix or via an istream object. Operator ( ) facilitates access to the matrix elements.

Functions

 void readNTSeq ( istream & is, int *&x, int &n ) 

Read a nucleotide sequence from is into x and translates to numeric codes.

 void readAASeq ( istream & is, int *&x, int &n ) 

Read a amino-acid sequence from is into x and translates to numeric codes.

 void readSeq ( istream & is, int *&x, int &n )

Read a sequence directly by numeric codes from is into x.

 void globalAlignment ( const int *x,int n, const int *y,int m, 
                            const SMatrix & s, int gap_symbol,
                            int &S, int *&path, int &L)

void exGlobalAlignment( const int *x,int n,const int *y,int m, 
                             const SMatrix & s, int gop, int gex,
                             int &S, int *&path, int &L )

void localAlignment ( const int *x,int n, const int *y,int m,
                          const SMatrix & s, int gap_symbol,
                          int &S, int *&path, int &L,
                          int &x0, int &y0 )

void exLocalAlignment( const int *x,int n, const int *y,int m, 
                            const SMatrix & s, int gop, int gex, 
                            int &S, int *&path, int &L,
                            int &x0, int &y0 )

void alignDemo ( ostream & os,
                      const int *x, int n, const int *y, int m,
                      char ( *decode ) ( int ), int *path, int L,
                      int x0 = 0, int y0 = 0,
                      bool annotate = true,
                      const SMatrix & s = SMatrix ( ), int w=64 )

Print out a formatted output of result of alignment algorithms. decode is a function to translate numerical coded sequences x and y to some suitable legible sequences in character form. annotate determines whether a middle line between x and y would be printed for showing identical and positive matches among x and y letters or not. Optional scoring matrix s is used for decide about shape of annotation. w is the number of columns in output window.

Contact

Ali Katanforoush, <a_katanforosh@sbu.ac.ir>, Department of Computer Science, Shahid Beheshti University, G.C., Tehran.

Time-stamp: "2010-09-18 20:17:53 katanforoush"