01
­{ }
A
Ahsan.
HomeAboutProjects
Technical Laboratory Report

Applied — Machine — Learning

Applied Machine Learning

Sentiment Classification System

Python
Scikit-Learn
Pandas

Developed a text classification system for sentiment analysis using Logistic Regression and TF-IDF, comparing crowdsourced vs. gold-standard data quality.

Machine Learning Pipeline

Data Quality Analysis: Messy Crowdsourced Input

"netural""positie""negtaive""nedat"Raw InputMessy LabelsCleanerRegex / DictionaryVectorizationn-gram (1, 2)Logisic RegressionC=10, Solver: liblinearInputCrowdsourced62.4%AccuracyGold (Trusted)81.5%Accuracy

Integrated NLP pipeline showing the transition from noisy crowdsourced data to high-fidelity feature engineering. Custom normalization strategies successfully resolved linguistic inconsistencies, enabling Logistic Regression to identify critical sentiment markers.

Project Overview

The goal of this project was to develop a robust text classification system capable of determining whether a comment is neutral, positive, or negative. A key focus was evaluating the impact of training data quality by comparing models trained on "crowdsourced" data versus "gold-standard" (trusted) data.

Technical Implementation

  • Data Pre-processing: Implemented a custom normalization pipeline to handle misspellings ("netural", "negtaive") and formatting issues common in crowdsourced datasets.
  • Feature Engineering: Utilized TF-IDF (Term Frequency-Inverse Document Frequency) with an n-gram range of (1, 2) to capture both individual words and two-word phrases.
  • Model Selection: Evaluated multiple models including Multinomial Naive Bayes, LinearSVC, and Logistic Regression. Logistic Regression was selected as the final model for its balanced performance and sensitivity to regularization.
  • Hyperparameter Tuning: Conducted manual grid search for the optimal regularization parameter (C=10) and utilized "balanced" class weights to handle distribution imbalances.

Key Findings

  • Data Quality Impact: Switching from crowdsourced to gold-standard training data improved model accuracy by approximately 19 percentage points, reaching a final accuracy of 81.5%.
  • Agreement Analysis: Calculated Cohen’s Kappa to measure statistical agreement between annotators, revealing a moderate agreement level (44.6%) and highlighting the subjectivity of sentiment labeling.
  • Feature Importance: Identified strong linguistic markers for sentiment, such as "fuck" and "worst" for negative sentiment, and "happy" and "love" for positive sentiment.
Back to Archive
Ahsan.

Software Engineer

© 2026 Ahsan Javed. All rights reserved.
LinkedIn