Project Summary/Abstract
Occurring more frequent than point mutations, the recombination events in HIV have resulted in
epidemiologically important founder strains in various geographic regions and contribute to at least 20-30% of
global HIV infections. However, after 33 years of recognizing HIV recombination, two critical questions still
remain elusive as to how HIV recombination events have occurred and how to predict emerging recombinant
clusters to better inform public health decision making. Based on nearly 20 years of HIV experience, we believe
that both questions relate to the flaws inherent in the current classification system for HIV recombinant families
(CRFs), which has inadequately appreciated the rapid virus evolution within and between recombinant families.
As a result, this static definition of CRFs not only creates a small-sample-size problem for most CRFs, but also
makes it difficult to track viral evolution in a dynamic means; while both issues are essential for public health
surveillance and for improved vaccine and antivirals design targeting HIV recombinants. In this proposed study,
our objectives are to: 1) strategically fulfill the inadequate and missing CRF information to overcome the small-
sample-size problem created by the CRF definition, and 2) provide a global profile along the HIV genome about
HIV’s recombination occurrence and evolution via pinpointing HIV predisposed recombination regions based on
the enriched CRF information. Our central hypothesis is that the inadequate and missing CRF information, which
accounts for the limited sample size for most CRFs and the lack of a dynamic view of CRFs, can be strategically
fulfilled by adding information from HIV fragment sequences (i.e., non-full-length sequences) that consist of over
90% of all published HIV data deposited in GenBank. Our hypothesis is based on several important lines of
evidence, including results from our studies. Our rationale for this proposed study is that fulfilling the inadequate
and missing CRF information will increase our capabilities in the surveillance and tracking of existing HIV
recombinants and for improved prediction of emerging HIV recombinants’ clusters. Leveraged by our nearly 20
years of experience in HIV data mining, statistical method development, statistical machine learning, and
statistical genetics, we will develop and validate two new methods in three Aims. By the end of this proposed
project, we expect to obtain new methods and new findings to advance our understanding of the CRFs (e.g.,
recombination mechanisms and evolution) and for improved public health surveillance for existing and emerging
CRF clusters. Finally, our methods and results will be released for free to facilitate other viruses’ research in
recombination.