Program evaluation using sentiment analysis
For a client, I conducted an evaluation of one of their high-profile programs using a wide range of data science methods. The project involved wrangling data from a variety of sources, and it posed several challenges related to data quality and compatibility, which necessitated extensive cleaning, validation, and various merging techniques. I also had to generate several control variables for the analyses, which involved using GIS methods to generate legislative district-level covariates from county- and precinct-level data (code).
The analysis was divided into two parts. The first part involved building a predictive model of joining the program, and I used logistic regression, random forest, and SHAP values to evaluate the size, direction, and significance of numerous features. The analysis showed which variables mattered most for predicting the outcome, as well as helping identify several non-linear relationships.
The analysis was divided into two parts. The first part involved building a predictive model of joining the program, and I used logistic regression, random forest, and SHAP values to evaluate the size, direction, and significance of numerous features. The analysis showed which variables mattered most for predicting the outcome, as well as helping identify several non-linear relationships.
The second part focused on modeling individual behavior on social media, using a sample of thousands of individuals and roughly 1.15 million social media posts. I used off-the-shelf sentiment analysis modules (VADER) to capture general tone, but I also built a custom dictionary with context markers to capture opposition to or support for specific values.
The modeling effort consisted of a series of methods, including OLS with fixed effects, clustered balanced sampling to handle class imbalance, weighted regression, and bootstrapping.
The modeling effort consisted of a series of methods, including OLS with fixed effects, clustered balanced sampling to handle class imbalance, weighted regression, and bootstrapping.

