Skip to content

Latest commit

 

History

History
executable file
·
92 lines (55 loc) · 4.19 KB

README.md

File metadata and controls

executable file
·
92 lines (55 loc) · 4.19 KB

Analyzing Labor Action Events: Predicting Strike Outcomes with R and Tidymodels

Authors: Putra Farrel Azhar, Lauryn Edwards, Meilin Chen, Yanji Wang
Published: March 18, 2024


Table of Contents

  1. Introduction
  2. Data
  3. Model
  4. Conclusion
  5. License

Introduction

The focus of our project is to predict labor actions—specifically, whether they result in a strike or not. After evaluating various approaches, we selected the LASSO logistic regression model due to its strong performance. Our model achieved an accuracy rate of 92.79% on the training set and a ROC AUC of 96.74%, demonstrating effective classification of strike versus non-strike events. Testing on the latest data maintained an accuracy rate of 88.13%. This report elaborates on our data handling, modeling processes, and performance evaluation. The final analysis script can be found here. The following are the .html and .pdf version of the project's memo.


Data

We utilized the Labor Action Tracker (LAT) dataset, supplemented with data from the American Community Survey (ACS) to enrich our analysis. Here’s an overview of our data processing steps:

  1. Data Cleaning:

    • Filtered LAT dataset for single-location labor actions and extracted longitude and latitude.
    • Converted U.S. county boundary shapefile to match our dataset’s coordinate reference system.
    • Spatially joined the datasets to associate labor actions with respective counties.
  2. Data Standardization:

    • Created a binary variable for strikes (1 for "Strike", 0 for "Non-strike").
    • Standardized measurement units for labor action durations.
    • Removed redundant columns and ensured consistency in naming and formatting.
  3. Handling Missing Values:

    • Removed columns with excessive missing values.
    • Filled categorical variables with "Missing" and used median values for numeric variables with few missing entries.

Model

After comparing several models (linear regression, KNN, and Random Forest), we found that logistic regression with LASSO performed best for our binary outcome prediction. Key aspects of our modeling process include:

  • Feature Selection: LASSO's ability to shrink coefficients of less important variables to zero helps identify the most impactful predictors.
  • Parameter Tuning: We conducted a 5-fold cross-validation to find the optimal penalty value, achieving the best model accuracy at 91.3%.

Performance Evaluation

The model's performance was assessed using a Confusion Matrix, indicating an impressive training accuracy of 92.8%. When validated with the latest LAT dataset, the accuracy slightly decreased to 88.1%, confirming the model's generalizability.


Results

Model Accuracy vs. Penalty

Model Accuracy vs. Penalty The figure above illustrates the relationship between the penalty and model accuracy, with error bars representing variability.

Variable Importance

Variables with Greatest Impact on Prediction Model The figure highlights the variables with the greatest impact on the prediction model, categorized by their positive (POS) and negative (NEG) contributions.


Conclusion

Despite challenges such as missing data and complex feature representations, our model effectively identifies the most significant predictors of labor actions. This information is crucial for employers and policymakers to implement proactive measures that address labor concerns before they escalate to strikes.


License

This project is licensed under the MIT License. See the LICENSE file for details.


Contributors

For inquiries or collaboration opportunities, please reach out to: