Ph.D. Dissertation -- Ali Katanforoush

Computational Problems in Haplotype Recognition

October 2009

Abstract

Recently, modern technologies enable us to get access to the large amount of data of Single Nucleotide Polymorphism (SNP). Haplotypes, as the SNP genotypes in haploid phase, provide useful materials for genetic analyses. Two primary processes in computational haplotype study are to phase genotype data into haplotypes and to partition a chromosome into blocks based on haplotype samples of unrelated individuals.

For problem of genotype phasing, we introduce a family of greedy procedures as a general approach finding nearly the best optimal solutions for the phasing problem under the model of maximum parsimony. Then we develop a hybrid method to infer haplotypes from genotype data by incorporating the greedy procedure into a Genetic Algorithm.

Global partitioning based on pairwise associations of SNPs has not previously been used to define haplotype blocks within genomes. Here, we define an association index based on LD between SNP pairs. We use the Fisher's exact test to assess the statistical significance of the LD estimator. By this test, each SNP pair is characterized as associated, independent, or not-statistically-significant. We set limits on the maximum acceptable proportion of independent pairs within all blocks and search for the partitioning with maximal proportion of associated SNP pairs. Essentially, this model is reduced to a constrained optimization problem, the solution of which is obtained by iterating a dynamic programming algorithm.

We also introduce new assessment protocols to evaluate performance of haplotype block partitioning methods for different aspects and applications, including a definition of similarity for two block partitionings, a simulation process to assess the robustness of a block definition, the application in hotspots detection and an application in disease association studies.

Results of the proposed Genetic Algorithm cannot compete neither with the number of inferred haplotypes obtained by previously developed algorithms nor with the inference accuracy, but they are quite near to parsimonious haplotypes, just for the case of small SNP samples. Instead, performance of the proposed haplotype block partitioning algorithm is quite comparable with and even better than other methods.

In comparison with other block partitioning methods, our algorithm reports blocks of larger average size. Nevertheless, the haplotype diversity within the blocks is captured by a small number of tagSNPs. Resampling HapMap haplotypes under a block-based model of recombination shows that our algorithm is robust in reproducing the same partitioning for recombinant samples. Our algorithm performed better than previously reported models in a case-control association study aimed at mapping a single locus trait, based on simulation results that were evaluated by a block-based statistical test. Compared to methods of haplotype block partitioning, our algorithm performs best on detection of recombination hotspots.


Full dissertation:

Presentation slides:


Last Update: Sep, 2010