Graph-based Haplotype Assembly




Haplotypes have been increasingly used in genetic studies. Analysis of variations among haplotypes has many applications in population genetics and biomedical research. While significant advancements have been made for genome assembly, especially with the emergence of second-generation (NGS) and third-generation (TGS) sequencing technologies, haplotype assembly remains a major challenge. Most existing de novo genome assemblers collapse two haplotypes into a mosaic consensus which does not correspond to either of the haplotypes. The few reference-free tools that can directly assemble long reads into haplotypes suffer from switching errors and phasing gaps due to varying heterozygous rates. Recently, the idea of separating haplotype-specific reads prior to assembly was introduced to resolve haplotypes by making use of two parental genomes and generated very promising results for haplotype assembly. In this project, we will investigate how to develop a reference-free pipeline of haplotype assembly using the NGS and TGS reads from the same individual. We aim to use graphs to model the single nucleotide polymorphisms (SNPs) between two haplotypes and design algorithms to find a bi-partition of NGS and TGS reads that correspond to two original haplotypes, respectively.

Background Literature

  1. Ke, Z., & Vikalo, H. (2020). A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01), 719-726.
  2. Li, Y., Patel, H., & Lin, Y. (2020). Kmer2SNP: reference-free SNP calling from raw reads based on matching. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 208-212). IEEE.


This can be a 12cp or 24cp project.


Genome Assembly, Haplotype Phasing, Single Nucleotide Polymorphism (SNP)

Updated:  10 August 2021/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing