
UNDERSTANDING THE PERCEPTIONS OF COVID-19


OVERVIEW
Twitter’s real time coverage of crisis events serves as both a mode of information sharing and a platform for engaging in topical discourse. These content sharing practices encapsulate the views and opinions of users providing a meaningful avenue for analyzing public sentiment at scale during crises. In addition, important public announcements and policies made by government authorities - ranging from the President to county health departments - are shared and debated on Twitter.
​
Given the divergence in perception of COVID-19, it is vital to understand to what extent mitigation efforts during health crises are influenced by misinformation and institutional trust. This work attempts to examine the linkages between perceptions of COVID-19 on Twitter and real world behaviors.
ROLE
Data Scientist
CONTRIBUTIONS
Data Cleaning
EDA
Sentiment Analysis
Topic Modeling
Machine Learning
Insights
​

“To what degree do mitigation practices reflect social attitudes toward policies aimed at slowing the spread of Covid -19 such as social distancing and quarantines ?”
Limiting our scope to two states (i.e. Washington and Florida) in the country that has the largest number of recorded infections, we hypothesize that there are regional differences in how the population perceives COVID-19 mitigation strategies and that these perceptions and viewpoints will be shared on Twitter and be reflected in the practices and mobility patterns of different states.
The objective of the project is to identify public sentiment towards COVID-19 mitigation policies in the aforementioned states using Twitter data. We will also use this research to understand the degree to which the viewpoints of mitigation policies expressed on Twitter reflect social attitudes and practices and resulting impact on the spread of COVID-19 in the aforementioned states.
DATA COLLECTION

We gathered over a 100 Million Twitter IDs - a 19-digit number that uniquely identifies a Tweet - from the Github repository of two researchers at the University of Southern California (Chen, 2020). The researchers tracked a variety of terms associated with COVID-19.
The final curated dataset with 52,437 observations and three variables; timestamp of the tweet, the full text of each tweet, and whether the tweet was associated with Washington or Florida.

MOBILITY DATA
Google’s Community Mobility Reports (Google, 2020) track how visits and length of stay at different places (including retail, grocery stores, parks, and more) change compared to a baseline, which is the median value from the 5‑week period Jan 3 through Feb 6, 2020. The data exist at the county level, and Google also averages that data to the state level based on county population estimates.

JOHN HOPKINS
The COVID-19 confirmed case count data from the CSSE at Johns Hopkins University (Johns Hopkins, 2020) contains data on the total confirmed cases of COVID-19 on a county level for each day from January 22, 2020 through May 17, 2020. For our purposes, we filtered for counties in Florida and Washington, leaving 117 days worth of data for 106 counties (39 in Washington and 67 in Florida). Johns Hopkins aggregated data from a variety of sources.
We randomly sampled 15.75% of the IDs - over 16 million - and converted each ID into a JSON object that contained all information associated with each tweet (hundreds of variables) using a tool called Twarc. Next we filtered for user defined locations and used reverse geocoding to capture tweets from users based in Florida or Washington.


TOPIC MODELING

In order to identify the abstract topics from our collection of tweets, we used a Latent Dirichlet Allocation (LDA) model. This generative probabilistic model is part of Natural Language Processing and aims to surface similarities in documents or in this case tweets, through overlapping language. The distribution and root meaning of the words within the corpus are used to determine individual topics. LDA models work best when users remove all punctuation or other non-lettered characters, standardize capitalization, and transform words into their root or base state (e.g. running becomes run) using stemming or lemmatization. In addition, most models advise the removal of “stop words”. Stop words are commonly occurring words that can skew the distribution of language in the document and displace more important topically relevant language.
All LDA models were modeled using the Scikit-learn library in Python. Using these best practices, we pre-processed the 52,437 tweets to create a uniform starting point and tokenized the data at the word level using term frequency inverse- document frequency (TF-IDF).
​
By using TF-IDF in our tokenization process, we were able to weigh the importance of a word both within an individual tweet and across our corpus. This method provides additional support for a more distinct separation of the topics. After performing the TF-IDF, we then created four different LDA models to determine the best fit for our data. Two of the models used bigram classification, while the other two used unigram classification. For each set of models within those classifications, we also applied two different sets of stop words.
The first set contained basic stop words that come pre-packaged as part of Scikit-learn, while the second set included additional words added during exploration of word frequency. Using perplexity to tune our hyper-parameters such as number of topics, we found that the ideal number of topics across all four models was three. After running all four models, we imported the data into pyLDAvis to better gauge the relevance of each topic across all models. We found that model 1, which used bigrams and the initial set of stopwords, displayed the most clearly defined topic areas. The results of this model were discussed in the Findings & Limitations section.
SENTIMENT ANALYSIS

We used many models to perform sentiment analysis on our tweets, but the results showed an overwhelmingly positive sentiment, which seemed problematic.
​
For the second attempt, we used VADER (Valence Aware Dictionary and Sentiment Reasoner). This is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER uses a combination of A sentiment lexicon is a list of lexical features (e.g., words), which are generally labeled according to their semantic orientation as either positive or negative. VADER not only tells about the Positivity and Negativity score, but also tells us about how positive or negative a sentiment is.This python library is specifically for sentiment analysis of tweets. Vader uses a combination of capitalization, punctuation, word meaning, and emoji interpretation (which is used frequently in tweets) to give a percentage based score to each tweet according to the positive , negative, or neutral components contained within the tweet.
​
Using this and the perplexity scores for all topics across the LDA models, we determined the ideal number of topics is 3. Further we met as a group and worked through each topic to determine what the overarching theme was, and decided that model 1 was the best fit for our analysis.
​
Graphs represent the percentage of total tweets corresponding to that overall sentiment or topic for each state (Appendix I). We see in the Vader-based sentiment analysis that the overall percentage of positive tweets is actually higher than Florida but, in both states the majority of Tweets are negative. Further in our findings we map these sentiments, and topics with important dates when closures and deployment policies for mitigation were enforced to get a deeper feel of the sentiment of people in these states.

FINDINGS


TOPIC 1
WARNINGS: Health, Symptoms, Testing, Guidelines, General Info
​
​
Both Washington and Florida show similar keywords in topic 1, which represents roughly 19% of tokens for Washington and 37% of tokens for Florida.

TOPIC 1
GUIDELINES: Political, Misinformation, Hoax, Experts, health guidelines
​
​
Topic 2 talks a lot about guidelines and misinformation and is almost equally referenced in both Washington and Florida.

TOPIC 1
MITIGATION: Social Distancing, Science, politics (VP, administration, Pence), Mitigation efforts
​
​
Topic 3 focuses on mitigation, and Washington’s tweets reflect this more explicitly than that Florida’s tweets do.

TOPIC 2 : Guidelines
We have to acknowledge that looking for an overall or average sentiment across these three topics identified above is difficult to interpret, because the majority of tweets are not necessarily strictly about social distancing. Even for individual tweets that were about social distancing, the sentiment score was difficult to interpret; someone could support social distancing with either a negative or positive sentiment ('I love social distancing' or 'people who don't like social distancing are idiots'), so a very negative sentiment for social distancing topic doesn’t mean that people were against social distancing.
With that being said, to understand the public perception of the COVID -19 policies, here we present the sentiment trends for two topics - Guidelines and Mitigation. Here on this visualization you can observe the daily average sentiment categorized by states. To understand how the users responded to guidelines being imposed by the state government, here we have marked the important dates such as schools closing, gathering being restricted and stay at home order being imposed by both the states.

TOPIC 3 : Mitigation
While the sentiments were largely negative before the guidelines were imposed for either state, the trend changes once these were announced.
Topic 2 captures the twitter response sentiment for guidelines being announced. It can be noticed that the announcements in two states receive polar opposite reactions from twitter users. While in Washington, we observe positive sentiment peaks post announcement, in Florida we observe, negative dips.While for guidelines the sentiments are the exact opposite, for mitigation practices as observed in the figure below, the trend differs in magnitude. While a trend similar to guideline topics is observed for Florida for mitigation practices as well.
The highly negative sentiments before the guideline were imposed reduced drastically for Washington state with even a few peaks of positive sentiments. Thus, we found that Florida has an observed negative sentiment towards guidelines & mitigation practices (Topic 2 & 3). Sentiment trends for Washington, on the other hand, reflects positive sentiment towards guidelines & mitigation practices.

Florida’s total confirmed COVID-19 cases actually was more than twice Washington’s total confirmed COVID-19 cases. Therefore, we assumed that Florida’s population movement patterns would be drastically different from Washington’s. To our surprise, urban mobility trends for Florida and Washington are rather similar based on the data we analyzed