In recent years, there has been a socio-political shift in rural and urban areas which has resulted in greater divides at the local and national level. Our project aims to model economic, demographic, and social changes at a county level to predict the potential course of future elections, so we can discern the major factors that contribute to political divide.
Given data collected by the US Census Bureau [1], we want to understand the influence that shifts in demographic, economic, and social factors have on the political scene, and if forecasting future outcomes [2-4] can be accomplished using a machine learning model. The current state-of-the-art looks at a relatively limited set of features, leaving many potentially useful variables unexplored. Our proposed solution will accommodate a large number of Census data attributes to gain a deeper understanding of the context using a series of unsupervised and supervised techniques.
We define our supervised learning task as a binary classification task, in which we predict whether or not the winning party for a particular county (Democrat or Republican) flips for a particular election. We define a positive as a “flip”.
Our dataset consists of a large set of different datasets, each of which captures a different statistical measure. For this project, we have included features from the following datasets:
The ACS covers a broad range of variables regarding social, economic, demographic, and housing information across geographical areas in the United States. On an annual basis, comprehensive 5-Year Estimates are released for the five years leading up to that year. We utilize 175 unique variables from their Data Profile tables:
CBP provides subnational economic data by industry on an annual basis. From this dataset, we get four variables: first quarter payroll, annual payroll, number of employees, and number of establishments.
AMF provides period estimates that measure change in residence. From this dataset, we get six variables.
We used a series of Census API keys to pull data for each of the Census datasets into a CSV format, and we have included a script to download and collate the data. The primary keys for each dataset are:
Every data record has a unique combination of these keys, and the year generally varies from 2009 to 2019 (inclusive), the time range of data we were able to obtain from the census website. The state and county are two designated numeric identifiers, known as FIPS codes. A state is denoted by two digits, and a county is denoted by three digits, together forming a unique five-digit FIPS code. The census data is available for all counties in all states.
We pull county-level data for presidential and senatorial races for the years within the range of available Census data; specifically, the 2012 and 2016 presidential elections, and all senatorial races from 2012 to 2018. This data is publicly available from Dave Leip’s Atlas of U.S. Elections [[8]]. We implemented a script that utilizes Selenium, a browser automator software, to automatically browse pages on the site using URL and use HTML information to pull the data into .csv files. The results, available for each county for a given election year, serve as the basis for our labels.
Data points (county/state/year) | Positive labels | Non-positive labels | |
---|---|---|---|
Presidential data | 5860 | 509 | 5351 |
Senatorial data | 7434 | 1234 | 6200 |
After preprocessing the data, we implemented Principal Component Analysis (PCA) for identifying the most important principal components, and aimed to retain 99% variance. By using PCA, we reduced the dimensionality of the dataset, while observing which features are the most desirable for a model predicting election results. We implemented PCA using the scikit-learn library. Results and selected values of k are shown in plots in the “Results” section.
For classification, we utilized three scikit-learn
models, along with an ensemble model that combines the three. These models are designed to take in the vectors from PCA, generated from one-year changes of the raw census variables. The models are:
In addition to the above models, we implemented a recurrent neural network (RNN) using Keras. The model is designed to take in sequences of length 3, representing the cumulative change in the raw census variables for the three years leading up to an election year.
For all the above approaches, in order to address the large class imbalance in our data, we utilized a SMOTE [9] over-sampler to create synthetic data of the minority class in order to create a class balance of 1 in our training data. In addition, we use an undersampler to reduce the number of samples in the majority class. Both are from the Python library imblearn
.
We performed hyperparameter tuning for each of the models, and arrived at the following values:
PCA results: presidential data | PCA results: senatorial data |
---|---|
![]() |
![]() |
This table shows the F1 scores of each model for each dataset:
AdaBoost | Random Forest | Bagging | Ensemble | RNN | |
---|---|---|---|---|---|
Presidential data | 0.179 | 0.112 | 0.167 | 0.118 | 0.456 |
Senatatorial data | 0.350 | 0.258 | 0.267 | 0.284 | 0.476 |
The following are confusion matrices that correspond to the F1-scores for AdaBoost:
AdaBoost for Presidential:
Non-flip | Flip | |
---|---|---|
Non-flip | 838 | 234 |
Flip | 68 | 33 |
AdaBoost for Senatorial:
Non-flip | Flip | |
---|---|---|
Non-flip | 988 | 518 |
Flip | 176 | 187 |
The following is a plot representing the degree to which the F1-score, precision, and recall vary with the AdaBoost Senatorial classifier. The variable W represents the ratio to which the minority class (“Flip”) is prioritized over the majority class (“Non-flip”) at the time of training / parameter learning. We decided to vary this value in order to explore whether or not modifying it would contribute to improving performance in the face of class imbalance.
One of our key observations is the drastic improvement in performance of the RNN vs the sklearn models. We believe this is because the data that we pass into RNN is more information-rich than that which we pass into the sklearn models. Specifically, we use a sequence of yearly changes over three years prior to an election for the RNN instead of a change over a single year for the sklearn models. This allows us to capture more detailed information about how certain economic, demographic, etc. variables of a particular county has changed over the election-relevant period.
Another one of our key observations is that the presidential models consistently performed worse than the senatorial models. We hypothesize that this is because presidential races are more strongly influenced by attitudes of political partisanship, information that the Census data cannot encode. In addition, senate races are state-level while presidential races are national-level, possibly suggesting that the candidate’s proximity to home has an added effect as well.
Out of the sklearn models, AdaBoost performed the best for both the presidential and senatorial data. We believe this is because AdaBoost is more robust to the curse of dimensionality than the other models, allowing it to perform better on high-dimensional data.
In addition, according to the plot, the F1-score stays roughly the same as the training weight of the minority class increases, buoyed by a gradual increase in recall but a gradual decrease in precision. In other words, it is more likely to classify minority class samples correctly, but at the cost of classifying majority class samples incorrectly. This leads us to conclude that altering this variable did not have a significant effect in improving the performance of the classifier.
Moreover, we make the following broad, general observations:
We were able to successfully produce at least one classifier that could do a decent job at predicting whether or not a county would flip in a presidential or senatorial election. This demonstrates that the census features that we selected can provide a measurable amount of information about when that might happen.
For future work:
https://github.com/V-TERM/cs7641_census_project/blob/668bfdc1adaf488440158ca84231d255c03efdf6/Project_Proposal.mp4
[1] United States Census. “Explore Census Data.” Retrieved from https://data.census.gov/cedsci/.
[2] Caballero, Michael. “Predicting the 2020 US Presidential Election with Twitter.” arXiv preprint arXiv:2107.09640 (2021). Retrieved from https://arxiv.org/abs/2107.09640.
[3] Colladon, Andrea Fronzetti. “Forecasting election results by studying brand importance in online news.” International Journal of Forecasting 36.2 (2020): 414-427. Retrieved from https://arxiv.org/abs/2105.05762.
[4] Sethi, Rajiv, et al. “Models, Markets, and the Forecasting of Elections.” arXiv preprint arXiv:2102.04936 (2021). Retrieved from https://arxiv.org/abs/2102.04936.
[5] United States Census. “American Community Survey 5-Year Data (2009-2019).” Retrieved from https://www.census.gov/data/developers/data-sets/acs-5year.html.
[6] United States Census. “County Business Patterns (CBP).” Retrieved from https://www.census.gov/programs-surveys/cbp.html.
[7] United States Census. “American Community Survey Migration Flows.” Retrieved from https://www.census.gov/data/developers/data-sets/acs-migration-flows.html.
[8] Dave Leip’s Atlas of U.S. Elections. “United States Presidential Election Results.” Retreived from https://uselectionatlas.org/RESULTS/.
[9] Brownlee, Jason. “SMOTE for Imbalanced Classification with Python.” Retrieved from https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/