A novel approach to pinpoint predisposed recombination regions in HIV for a global profile of HIV recombinants' occurrence and evolution - Project Summary/Abstract Occurring more frequent than point mutations, the recombination events in HIV have resulted in epidemiologically important founder strains in various geographic regions and contribute to at least 20-30% of global HIV infections. However, after 33 years of recognizing HIV recombination, two critical questions still remain elusive as to how HIV recombination events have occurred and how to predict emerging recombinant clusters to better inform public health decision making. Based on nearly 20 years of HIV experience, we believe that both questions relate to the flaws inherent in the current classification system for HIV recombinant families (CRFs), which has inadequately appreciated the rapid virus evolution within and between recombinant families. As a result, this static definition of CRFs not only creates a small-sample-size problem for most CRFs, but also makes it difficult to track viral evolution in a dynamic means; while both issues are essential for public health surveillance and for improved vaccine and antivirals design targeting HIV recombinants. In this proposed study, our objectives are to: 1) strategically fulfill the inadequate and missing CRF information to overcome the small- sample-size problem created by the CRF definition, and 2) provide a global profile along the HIV genome about HIV’s recombination occurrence and evolution via pinpointing HIV predisposed recombination regions based on the enriched CRF information. Our central hypothesis is that the inadequate and missing CRF information, which accounts for the limited sample size for most CRFs and the lack of a dynamic view of CRFs, can be strategically fulfilled by adding information from HIV fragment sequences (i.e., non-full-length sequences) that consist of over 90% of all published HIV data deposited in GenBank. Our hypothesis is based on several important lines of evidence, including results from our studies. Our rationale for this proposed study is that fulfilling the inadequate and missing CRF information will increase our capabilities in the surveillance and tracking of existing HIV recombinants and for improved prediction of emerging HIV recombinants’ clusters. Leveraged by our nearly 20 years of experience in HIV data mining, statistical method development, statistical machine learning, and statistical genetics, we will develop and validate two new methods in three Aims. By the end of this proposed project, we expect to obtain new methods and new findings to advance our understanding of the CRFs (e.g., recombination mechanisms and evolution) and for improved public health surveillance for existing and emerging CRF clusters. Finally, our methods and results will be released for free to facilitate other viruses’ research in recombination.