A novel platform for synthetic generation and statistical obfuscation of tabular clinical data, simulated images, and machine-generated text - PROJECT SUMMARY Data is a critical and highly valuable commodity, driving meaningful change in our society, especially when it pertains to patient care and biomedical research. Currently, institutions pay inordinate sums to increase, regain, and complement their data panels. As an extra burden, data legislation and privacy protection regulations introduce barriers to forming effective partnerships between business, clinical, research and educational organizations. As a result, approximately 80% of medical data today can’t be readily shared because they contain personal, protected or sensitive information and remains unstructured and untapped after they are created. There is a growing and urgent unmet need for technology solutions that balance research and commercial organizations interests by supporting flexible general-purpose analytics while guaranteeing privacy protection. There are no effective mechanisms to enable data sharing without either risking inappropriate release of sensitive information or potential degradation of the information content. The currently available few protocols and algorithms for modeling, processing, interrogating, and ultimately sharing large sensitive data (e.g., thousands and millions of records with thousands of heterogeneous features) all share significant limitations and their practical use still lags behind research progress. Two major unmet needs in the data sharing industry are i) the inability to return de-identified clones of the raw data, and ii) lack of scalability requirements of production deployments. GrayRain, LLC is an early-stage Software-as-a-Service company developing a novel platform for statistical obfuscation and de- identification of sensitive structured (numerical, categorical tabular data) and unstructured information (e.g., clinical text, doctors/nurses notes and clinical images, such as MRI, PET). The core of GrayRain’s technology is the novel patented statistical obfuscation algorithm, DataSifter. The technology proposed in this STTR Phase I application will significantly increase the number of secure data transactions in the healthcare sector and beyond, enabling data sharing with fully controllable risk of identification of any sensitive information, including, but not limited to PHI (personal health information), demographic information, or socioeconomic status. GrayRain’s technology is able to produce de-identified clones of raw tabular data, addressing a major limitations encounter across existing data anonymization protocols. As far as scalability, the main goal of this STTR Phase I is to establish feasibility of GrayRain to accurately and efficiently (re: scalability) de- identify and share large-scale complex EHR data repositories with a controlled risk of disclosing protected or personal health information .