About the Event
Recent advances in DNA sequencing technologies have made it possible to sequence the entire genomes of several individuals and catalogue the complete spectrum of genetic variation in large populations. The large amounts of raw sequence data generated by high-throughput sequencing instruments and the complexity of human genetic variation pose significant computational challenges and demand the development of new algorithmic (combinatorial and probabilistic) methods. In this talk, I will describe combinatorial and stochastic algorithms for haplotype assembly, i.e reconstructing the two haplotypes for an individual using sequence reads generated from whole-genome sequencing. These algorithms are based on computing cuts in variant-haplotype graphs and have been used to assemble long and accurate haplotypes for the first diploid individual genome. The ability to reconstruct haplotypes from sequence reads depends on the length of the reads, the insert lengths for paired-end sequencing and the polymorphism rate of the genome being sequenced. The 'design' aspect of haplotype assembly can be modeled using distance-random graphs. I will describe analytical and empirical results for the design of sequencing experiments to assemble long haplotypes from short sequence reads. Sequencing of large populations of individuals (healthy and diseased) is a powerful approach for identifying DNA sequence variants and finding the variant(s) that are associated with disease risk. Combining DNA pooling with high-throughput sequencing enables the cost-efficient sequencing of genomic regions in large numbers of individuals. I will describe a probabilistic method for the detection of rare sequence variants from pooled sequencing experiments and combinatorial problems that arise in the design and analysis of these experiments.