Tackling Big Data problems in biomedical sciences with extended similarity methods - PROJECT SUMMARY/ABSTRACT The overall goal of our research program is to develop new multi-purpose similarity-based tools to extract and analyze information from very large datasets in the biomedical sciences. A central aspect of our work will be the determination of the distance (or similarity) between different objects, a fundamental notion that pervades many aspects of modern data science. Similarity searches are at the core of high-throughput virtual screening, an essential task in medicinal chemistry and drug design. Comparisons also play a key role in rationalizing the results of Molecular Dynamics (MD) simulations by helping us to identify the most important conformations of a system, and how they contribute to its dynamic behavior. Similarity-based techniques are also essential in spectral studies, being the foundation behind the post-processing machinery in Imaging Mass Spectrometry (IMS). However, these applications are currently based on metrics that can only compare two objects at a time, so comparing N objects scales quadratically, which makes them fundamentally ill-equipped to handle the amount of data generated by state-of-the-art simulations and experiments. We recently generalized the pair-wise comparisons, proposing extended similarity indices that allow us to compare an arbitrary number of objects simultaneously. Our indices offer unprecedented efficiency, while also outperforming their binary counterparts in diversity picking, feature selection, and clustering. We will leverage these advantages in three main research directions. (1) We will develop protocols to improve the drug design process via careful exploration of the chemical space. The extended indices will allow us to study the relations among various very large molecular libraries, which will be key in polypharmacology and drug repurposing. They will also lead to better measures of chemical diversity and a deeper understanding of structure-activity relations. This will serve as a guide in generative molecular models, resulting in more robust identification of new drug leads. (2) We will present new workflows to efficiently analyze biological ensembles. Our medoid algorithm will identify conformations close to the folded state of a protein, while our clustering will classify the structures corresponding to other metastable states. Alternatively, we will implement sampling techniques that will allow us to analyze very long MD simulations. These tools can then be combined to gain a deeper understanding of various dynamical processes, including the detailed exploration of protein folding landscapes. (3) We will develop new post-processing techniques to aid with the interpretation of IMS data. Our similarity indices can be used to identify spatially- and molecularly-correlated domains in tissues, without the unphysical artifacts present in other techniques. This will allow us to track the spatial heterogeneity of metabolic processes, which is critical to the validation of IMS data and to establishing new diagnosis tools. The application of our framework to the study of lipid expression in pancreatic tissue will lead to a better understanding of type 1 diabetes metabolism and pathophysiology.