Genius ALIGNment, GALIGN

About Genius ALIGNment

GALIGN is a program written in C++ to do several types of pairwise sequence alignment by dynamic programming algorithms. The program is run in command shell in text mode. There are eight alignment methods provided in the program as follows,

Global Alignment,
Global Alignment with Affine Gap Penalty,
Local Alignment,
Local Alignment with Affine Gap Penalty,
Alignment with Repeated Matches,
Overlap Alignment,
Overlap Alignment with Affine Gap Penalty,
Overlap Alignment with Repeated Matches

Essentially, each of the alignment algorithms finds the way of assigning pairs of letters of two sequences such that a scoring obtains the highest value. The best of different object score could be achieved by different dynamic programming schemes. If you look for the best matching of sequential pairs of whole two sequences, you try to find a “Global Alignment” of these two sequences. By “Local Alignment” we limit ourselves to the part of sequences which touches the highest score when ignoring other parts not hope increase the score. “Overlap Alignment” refers to a sort of “Global Alignment” which doesn’t care about the dangling heads of sequence if they would be left unaligned. In “Alignment with repeated Match”, the second sequence is used as a template and all substrings find out over the entire of the first sequence which gain at least a score “T” (as threshold score) for matching with the template sequence. “Overlap Alignment with repeated Matches” does the same but ignores possibly not matched heads of the first sequence.

All alignment methods but ones called “ungapped alignment”, consider some “gaps” to make alignment between two or more sequences. “Gaps” – usually represented by dashes – indicate insertion or deletion among letters of aligned sequences. More precisely, a dash in the first aligned sequence shows an insertion in the first sequence or a deletion in the second sequence to be aligned. Like a dash in the second sequence says a deletion in the first sequence or equally an insertion in the second sequence. Committing gap should cost some negative values for the alignment score. More gaps lower score. So to take gap penalties into the account we have applied both “Linear gap penalty” and “Affine gap penalty”. The former that is used by default in alignment algorithms, considers negative cost value linearly proportional to the length of the occurred gap. The latter gives a usually more cost to penalize opening a gap while going after fewer constant penalties for each consequence extended gap.

Letters of sequences can be of type of nucleotides or amino-acids. However it was so provided such that you can enter your sequence directly by numeric codes rather symbols of nucleotides or amino-acids. Related numeric codes for nucleotides and amino-acids along with some extensions are as follows,

IUPAC Symbols for Nucleotides

IUPAC Symbols for Amino-acids

Symbol	Code	Meaning	Full Name
A	1	A	Adenine
C	2	C	Cytosine
G	3	G	Guanine
T	4	T	Thymine
R	5	G\|A	puRine
Y	6	T\|C	pYrimidine
K	7	G\|T	Keto
M	8	A\|C	aMino
S	9	G\|C	Strong interaction (3 H bonds)
W	10	A\|T	Weak interaction (2 H bonds)
B	11	G\|T\|C	not-A (B follows A)
D	12	G\|A\|T	not-C (D follows C)
H	13	A\|T\|C	not-G (H follows G)
V	14	G\|A\|C	not-T (or U) (V follows U)
N	15	G\|A\|T\|C	aNy nucleotide
-	16	GAP	Gap -
U	4	T	Uracil (eq. Thymine in DNA)
X	15	G\|A\|T\|C	aNy nucleotide

Symbol	Code	Meaning	Full Name
A	1	Ala	Alanine
R	2	Arg	Arginine
N	3	Asn	Asparagine
D	4	Asp	Aspartic acid (Aspartate)
C	5	Cys	Cysteine
Q	6	Gln	Glutamine
E	7	Glu	Glutamic acid (Glutamate)
G	8	Gly	Glycine
H	9	His	Histidine
I	10	Ile	Isoleucine
L	11	Leu	Leucine
K	12	Lys	Lysine
M	13	Met	Methionine
F	14	Phe	Phenylalanine
P	15	Pro	Proline
S	16	Ser	Serine
T	17	Thr	Threonine
W	18	Trp	Tryptophan
Y	19	Tyr	Tyrosine
V	20	Val	Valine
B	21	Asx	Aspartic acid or Asparagine
Z	22	Glx	Glutamic acid or Glutamine
X	23	Xaa	Any amino acid
*	24	END	Termination codon UAA, UAG and UGA
-	25	GAP	Gap -

In order to score individual matched letters in the alignment we use “Scoring Matrix”. Each entry in “Scoring Matrix” is score of matching letters which their types are related to corresponded row and column. For example, the score of substituting arginine by lysine would be found in the crossing row 2 and column 12 in the scoring matrix. By theory, score of substitution a by b is equal to score of substitution b by a, so scoring matrix is symmetric. Note that “Linear gap penalty”, the default rule to penalize gaps, takes the score of substituting symbol by ‘-’, the last row or column of the scoring matrix, as the constant cost per each gap. However, the program for alignment methods by “Affine gap penalty” asks for opening gap penalty and extended gap penalty directly from user instead of refer to somewhere in scoring matrix. Scoring matrices could be prepared in individual text files or in interactive text mode. In both case, scoring matrix would be defined by the following format,

in which, matrix type is one of letters f (full matrix entries), h (half matrix entries) or I (the identity matrix), size of matrix is usually 15 for nucleotides scoring table and 25 for amino-acids scoring table. Matrix elements should be of integer numbers. Columns of matrix are separated by space and rows are separated by new line.

If you would have indicated h as matrix type, only the bottom left part of the matrix should be given. The identity matrices don't need elements declaring. If the scoring matrix information is stored in a text file, put the path of the file after a @ (e.g. @sm/blosum50.txt) in place of matrix type in order to pass the program to the file. A set of some conventional scoring matrix is prepared in folder sm as follows,

Scoring Matrix	Sequence Type	Default gap penalty
nuc44	Nucleotides	-2
blosum50	Amino-acids	-5
blosum50g8	Amino-acids	-8
blosum65	Amino-acids	-5
pam250	Amino-acids	-8
pam50	Amino-acids	-13

Installation

[ Click here to start download ]

To install the package, download the source code, unpack it, and then compile the source.

$ tar xvzf galign.tar.gz $ cd galign/src $ g++ galign5.cpp bioinf.cpp -o galign

Usage

The program can be invoked either in an interactive mode, by applying switch –i, or in non-interactive mode by default. Note that when you would enter nucleotides or amino-acids sequences, you should use capital letters as their symbols.

There are some sample inputs prepared before, to examine the program. Get a command shell and change to the galign containing directory. Now you can test the program by

$ galign < firstAlign.txt

For more info about the usage, invoke ga –i and look inside firstAlign.txt. In addition to the alignment, the program outputs other results which are the alignment score, figures of “identities” and “positives”. The former is the number of paired symbols of the identical similar components, and the latter is number of paired symbols have a positive substitution score.

Development Manual

Class

class SMatrix {
public:
  friend istream & operator>> ( istream &, SMatrix & );
  int operator  (  ) ( int, int ) const;

private:
  int n;
  char Stype;
  int **Stable;

public:
    SMatrix (  ) { Stype = 'I'; }
};

Description

The class SMatrix abstracts a data type for scoring matrices. Initialization performs either by default constructor as an identity matrix or via an istream object. Operator ( ) facilitates access to the matrix elements.

Functions

 void readNTSeq ( istream & is, int *&x, int &n )

Read a nucleotide sequence from is into x and translates to numeric codes.

 void readAASeq ( istream & is, int *&x, int &n )

Read a amino-acid sequence from is into x and translates to numeric codes.

 void readSeq ( istream & is, int *&x, int &n )

Read a sequence directly by numeric codes from is into x.

 void globalAlignment ( const int *x,int n, const int *y,int m, 
                            const SMatrix & s, int gap_symbol,
                            int &S, int *&path, int &L)

Precondition.
- x and y sequences of length n and m, respectively.
- s scoring matrix.
- gap_symbol the symbol representing gaps.
Postcondition.
- S score of global alignment for x and y.
- path encoded information of how to done the alignment, (back-tracking path).
- L length of the path.

void exGlobalAlignment( const int *x,int n,const int *y,int m, 
                             const SMatrix & s, int gop, int gex,
                             int &S, int *&path, int &L )

Precondition.
- x and y sequences of length n and m, respectively.
- s scoring matrix.
- gop Opening gap penalty.
- gex Extended gap penalty.
Postcondition.
- S score of global alignment for x and y with affine gap penalty.
- path encoded information of how to done the alignment, (back-tracking path).
- L length of the path.

void localAlignment ( const int *x,int n, const int *y,int m,
                          const SMatrix & s, int gap_symbol,
                          int &S, int *&path, int &L,
                          int &x0, int &y0 )

Precondition.
- x and y sequences of length n and m, respectively.
- s scoring matrix.
- gap_symbol the symbol representing gaps.
Postcondition.
- S score of local alignment for x and y.
- path encoded information of how to done the alignment, (back-tracking path).
- L length of the path.
- x0 and y0 starting positions for local alignment on sequences x and y, respectively.

void exLocalAlignment( const int *x,int n, const int *y,int m, 
                            const SMatrix & s, int gop, int gex, 
                            int &S, int *&path, int &L,
                            int &x0, int &y0 )

Precondition.
- x and y sequences of length n and m, respectively.
- s scoring matrix.
- gop Opening gap penalty.
- gex Extended gap penalty.
Postcondition.
- S score of local alignment for x and y with affine gap penalty.
- path encoded information of how to done the alignment, (back-tracking path).
- L length of the path.
- x0 and y0 starting positions for local alignment on sequences x and y, respectively.

void alignDemo ( ostream & os,
                      const int *x, int n, const int *y, int m,
                      char ( *decode ) ( int ), int *path, int L,
                      int x0 = 0, int y0 = 0,
                      bool annotate = true,
                      const SMatrix & s = SMatrix ( ), int w=64 )

Print out a formatted output of result of alignment algorithms. decode is a function to translate numerical coded sequences x and y to some suitable legible sequences in character form. annotate determines whether a middle line between x and y would be printed for showing identical and positive matches among x and y letters or not. Optional scoring matrix s is used for decide about shape of annotation. w is the number of columns in output window.

Contact

Ali Katanforoush, <a_katanforosh@sbu.ac.ir>, Department of Computer Science, Shahid Beheshti University, G.C., Tehran.

Time-stamp: "2010-09-18 20:17:53 katanforoush"