Ahsan Javed | Software Engineer

Project Overview

The goal of this project was to develop a robust text classification system capable of determining whether a comment is neutral, positive, or negative. A key focus was evaluating the impact of training data quality by comparing models trained on "crowdsourced" data versus "gold-standard" (trusted) data.

Technical Implementation

Data Pre-processing: Implemented a custom normalization pipeline to handle misspellings ("netural", "negtaive") and formatting issues common in crowdsourced datasets.
Feature Engineering: Utilized TF-IDF (Term Frequency-Inverse Document Frequency) with an n-gram range of (1, 2) to capture both individual words and two-word phrases.
Model Selection: Evaluated multiple models including Multinomial Naive Bayes, LinearSVC, and Logistic Regression. Logistic Regression was selected as the final model for its balanced performance and sensitivity to regularization.
Hyperparameter Tuning: Conducted manual grid search for the optimal regularization parameter (C=10) and utilized "balanced" class weights to handle distribution imbalances.

Key Findings

Data Quality Impact: Switching from crowdsourced to gold-standard training data improved model accuracy by approximately 19 percentage points, reaching a final accuracy of 81.5%.
Agreement Analysis: Calculated Cohen’s Kappa to measure statistical agreement between annotators, revealing a moderate agreement level (44.6%) and highlighting the subjectivity of sentiment labeling.
Feature Importance: Identified strong linguistic markers for sentiment, such as "fuck" and "worst" for negative sentiment, and "happy" and "love" for positive sentiment.

Applied — Machine — Learning

Sentiment Classification System

Machine Learning Pipeline

Project Overview

Technical Implementation

Key Findings