Real-time estimation of influenza vaccine effectiveness through social media mining - Project Summary Influenza has a significant global impact on public health each year even after the COVID-19 pandemic. In the United States, the illness affects over 35 million people, causes 710,000 hospitalizations and about 47,000 deaths per year. The strongest public health response to this is the annual vaccine, planned and manufactured several months ahead of the flu season. Globally, the World Health Organization makes recommendations for vaccine content and national agencies select the most appropriate strain. The effectiveness of the vaccine varies from year-to-year and even within the same season. Vaccine effectiveness (VE) data guides the response of state and local public health agencies to influenza epidemics and pandemics. VE estimates impact the success of vaccination campaigns, allow agencies to estimate the number of illnesses, hospitalizations, and deaths caused by influenza, and to implement targeted public health control measures and outreach campaigns if the VE is low. The CDC estimates VE annually through the Influenza Vaccine Effectiveness Network (US Flu VE Network). While the CDC's efforts to track flu cases (FluView) and vaccination rates (FluVaxView), collect data continuously from hundreds of sites, the US Flu VE Network runs only at participating clinics in a limited number of states, with each site enrolling around 1,000 participants with influenza-like illness (ILI) each year as a part of a test-negative design. The CDC estimates are the US gold standard but have two main limitations: (i) they include only a limited number of states and subjects, and (ii) the interim report is not published until late into the flu season. Here, we propose to use social media (SM) data for addressing these limitations. SM is abundantly available across the US in near real-time and can be used as complementary data for calculating VE. Based on separately funded work, we have already collected suitable Twitter user datasets and we propose to develop automated methods to identify those that report taking a flu test or a diagnosis of flu. For these individuals, we will analyze their tweets over time to determine vaccination status, test results, and demographic information. We have shown that SM data collected using our systematic Natural Language Processing (NLP) approach can be used for epidemiology and, per the latest Pew report, is representative of the population. Our specific aims of this project are to: (1) develop and evaluate an NLP framework to calculate influenza VE including analysis of timelines for concept extraction relevant to VE and (2) develop and evaluate a real-time VE estimation system that uses longitudinal SM data and accounts for biases, uncertainty, and missing data in vaccination status or influenza diagnosis. Real-time, early VE estimation as we propose could aid preparedness from public health authorities and clinicians, potentially reducing influenza morbidity and mortality. If successful, this will be the first automated approach to near real-time estimation of VE in the United States, providing a viable, relatively low-cost alternative solution to a significant problem.