Abstract
Background: De nova haplotype-resolved genome assembly not only plays a critical role in the studies of novel species,
but also is the most comprehensive solution to discover structural variants and understand repeat-rich regions of the human
genome. Moreover, haplotype-resolved assemblies are the fundamental infrastructures for various pangenome references.
Recent advances in accurate long-read sequencing technologies open the opportunity to faithfully build high-quality haplotyperesolved
assemblies, but most assembly algorithms could not take full advantage of the emerging accurate long-read data.
To this end, I have developed a graph-based haplotype-resolved genome assembly algorithm, called hifiasm, which combines
accurate long reads with the additional data providing long-range phasing information. Hifiasm has been widely used by
multiple large-scale sequencing projects, such as the Human Pangenome Reference Consortium (HPRC), the Genome in a
Bottle (GIAB), the Vertebrate Genomes Project (VGP), and the Darwin Tree of Life project. Based on hifiasm, this proposal
focuses on developing a set of new haplotype-resolved assembly algorithms to further improve the assembly quality for
complex regions and genomes, as well as substantially reduce the assembly cost.
Research: My first aim is to develop a hybrid algorithm to produce high-quality haplotype-resolved assemblies for diploid
genomes, especially focusing on resolving highly repetitive regions like centromeres. The proposed algorithm will combine
the advantages of length and accuracy from different types of long-read data to automatically reconstruct the last unexplored
repeat-rich regions of the genome. In the second aim, I will develop a haplotype-aware scaffolding algorithm to achieve
chromosome-level haplotype-resolved assemblies for diploid genomes. In the third aim, I will propose different strategies to
reduce the sequencing cost and the computational cost of the haplotype-resolved assembly, making it feasible for populationscale
studies. I will also develop assembly algorithms to resolve complex genomes including not only two haplotypes. Upon
completion, the proposed studies will offer efficient assembly tools for large-scale sequencing projects, and will pave the way
to personal genome assembly for genomic research and clinical applications.
Career development and training: My long-term career goal is to lead an independent research group focusing on
developing novel computational methods for haplotype-resolved assemblies and the relevant applications. In addition to
further enhancing my training in computational method development with my mentor Dr. Heng Li, I will obtain systematic
training in biomedical research from the advisory committee (Dr. Erich D. Jarvis and Dr. Scott V. Edwards for human and
non-human genomes, Dr. Evan E. Eichler and Dr. Karen H. Miga for repeats and structural variations, as well as Dr. Matthew
Meyerson for complex genomes including not only two haplotypes). The training in career development, including laboratory
management, grant-writing and leadership, will also be carried out during the K99 phase. My experience in computational
method development, especially in genome assemblies, as well as the rigorous mentored support from my mentoring and
advisory team, puts me in a unique position to establish an independent lab studying haplotype-resolved genome assembly
algorithms.