The main goal of this project is to build a person-centric and privacy conscious system that generates high quality health data sets in a common data model (CDM) to be used in Artificial Intelligence (AI) training and research. Currently many challenges, including poor interoperability, lack of relevant metadata, incompatible and insufficient use of data standards prevent the efficient use and exchange of Electronic Health Records (EHRs). Healthcare systems and research organizations remain largely disconnected, unable to efficiently access, process, or exchange patient data during delivery of care or research. Even when data are shared, it requires a labor-intensive and costly process to extract, transform, test, and refine data sets to the point that they are ready for clinical quality reports and research use. This project will be a collaboration between DARTNet, a research and clinical support organization focused on the aggregation, standardization and reuse of existing electronic health data, and Cloud Privacy Labs, a privacy technology company, guided by a Technical Expert Panel (TEP). The short-term objective of this project is to build and evaluate a data processing framework that enables semantic harmonization of health data collected from multiple small and large sized health care providers that use different EHR systems and coding conventions. This framework will be built using the Layered Schemas Architecture (LSA) - an open-source technology developed by Cloud Privacy Labs. The LSA enables semantic interoperability of data collected from heterogeneous sources by using schemas to define a common base model and by using overlays to address data variations related to different EHR vendors, conventions, or jurisdictions. The LSA translates incoming data into a property graph enriched with semantic annotations, which is then processed using semantic web tools. The first year of the project will focus on the development of the data proce
ssing framework using synthetic data for FHIR messages, CCDAs, and CSV files. This phase will include the development of a schema repository, integration of a terminology database, and the development of a translation library. After the development phase, DARTNet will collect health data from multiple small and large providers with at least 4 disparate EHR systems: Epic, Medent, eMDS, and one other EHR chosen by the TEP and Office of the National Coordinator for Health Information (ONC). These data will be processed using the new framework to produce high quality outputs in the Observational Medical Outcomes Partnership (OMOP) Common Data Model. With guidance and feed-back from the TEP and the ONC, the research team will systematically evaluate process efficiency and data quality (completeness, conformance, and plausibility) by comparing the LSA approach to a gold standard (DARTNet traditional ETL) from source EHR data. Long-term objectives of this project include: 1) improving AI and Natural Language Processing algorithms using enriched semantics, and 2) building a privacy conscious data commons for the research community. The findings of this project will be disseminated in different academic venues and technical workgroups. The software and documentation will be published in a GitHub repository for open access.