Project Overview
The goal of this project was to develop a robust text classification system capable of determining whether a comment is neutral, positive, or negative. A key focus was evaluating the impact of training data quality by comparing models trained on "crowdsourced" data versus "gold-standard" (trusted) data.Technical Implementation
- Data Pre-processing: Implemented a custom normalization pipeline to handle misspellings ("netural", "negtaive") and formatting issues common in crowdsourced datasets.
- Feature Engineering: Utilized TF-IDF (Term Frequency-Inverse Document Frequency) with an n-gram range of (1, 2) to capture both individual words and two-word phrases.
- Model Selection: Evaluated multiple models including Multinomial Naive Bayes, LinearSVC, and Logistic Regression. Logistic Regression was selected as the final model for its balanced performance and sensitivity to regularization.
- Hyperparameter Tuning: Conducted manual grid search for the optimal regularization parameter (C=10) and utilized "balanced" class weights to handle distribution imbalances.
Key Findings
- Data Quality Impact: Switching from crowdsourced to gold-standard training data improved model accuracy by approximately 19 percentage points, reaching a final accuracy of 81.5%.
- Agreement Analysis: Calculated Cohen’s Kappa to measure statistical agreement between annotators, revealing a moderate agreement level (44.6%) and highlighting the subjectivity of sentiment labeling.
- Feature Importance: Identified strong linguistic markers for sentiment, such as "fuck" and "worst" for negative sentiment, and "happy" and "love" for positive sentiment.