Natural Language ProcessingPlatform: Personal Project

NLP Sentiment Analysis

1. Context & Objective

A text classification pipeline that determines whether a movie review carries a positive or negative sentiment. Built as a foundational NLP project to learn core text preprocessing and feature extraction techniques.

2. Methodology

1. Cleaned raw text: removed HTML tags, punctuation, and stopwords. 2. Applied stemming using NLTK's PorterStemmer. 3. Vectorized text with TF-IDF (unigrams + bigrams). 4. Trained a Logistic Regression classifier evaluated with precision, recall, and F1.

In [1]:

import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

def clean_text(text):
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    return text.lower()

tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1,2))
X = tfidf.fit_transform(df['review'].apply(clean_text))
y = df['sentiment']

model = LogisticRegression()
model.fit(X, y)

3. Final Learnings

NLP text preprocessing is half the battle — dirty text severely degrades accuracy. Bigrams captured phrases like 'not good' that unigrams miss, boosting F1 by ~4%.

Dataset details

Language

Python

Size

~50k IMDB reviews

Libraries Used

NLTKScikit-LearnPandas