THS: Using Twitter and Big Data Analytics to Track and Predict Health Conditions - Project Summary/Abstract
U.S. health officials are struggling to keep up with information and misinformation related to health conditions,
natural disasters, and disease outbreaks affecting communities nationwide. Early warnings about such events
can be found in public postings made by citizens using social networks like Twitter. However, the sheer volume
of messages posted each day, and the real possibility of false content makes it very difficult to rely on these data
for guidance, education, and decision making. Deep learning and big data solutions could be used to tackle this
problem by providing the means to collect, classify, and validate these messages, sorting out actionable data
from noise. But deep learning models are hard to train and tune, requiring data sets with thousands of examples.
Our long-term goal is to understand how to build, deploy, and maintain an integrated and scalable platform to
search social media posts and analyze their contents in search for clues about health conditions. In this project,
our overall objective is to develop the technology needed to integrate search queries and deep learning models
that are run against social media data to detect conversations that can throw clues about emerging topics,
determine the intent of the messages (e.g., opinion, advise), and to find and group together individual messages
that are similar in content. Our central hypothesis is that we can reduce query time for message search, increase
classifier accuracy and precision for health topic detection, and simplify model training and deployment through
the use of transfer learning, Generative Adversarial Networks (GANs), and similarity-search models based on
neural networks. In Specific Aim 1 we will develop supervised methods to support accurate message similarity
search. We shall use Siamese neural networks to compute a similarity score between tweets and rank them
according to this score. In Specific Aim 2 we will implement data augmentation via GANs to improve model
training time and accuracy. Our GANs will generate synthetic tweets that are realistic enough to help users
produce good training data with less manual effort and yet produce well-trained models. As a proof of concept,
we shall harden our existing open-source THS system by adding these capabilities. Our project is novel because
THS is the first system of its kind, providing a “social data warehouse” to collect, store, integrate, index, and
analyze Twitter data in an open source platform. Its significance stems from the ability to work as a tool to help
health officials analyze tweets, visualize data along disease and spatio-temporal attributes, and make predictive
analytics, all under one roof. This could have a significant impact on public health disease tracking and response.
UPRM is a Hispanic serving institution, with the second largest Hispanic serving engineering school in the U.S.
and with 35% female enrollment. This project provides a unique opportunity to train students in social media
analysis, big data systems, and machine learning. The success of this project could open new opportunities for
UPRM researchers to participate in collaborative NIH proposals with other institutions.