Natural Language ProcessingPlatform: Personal Project
NLP Sentiment Analysis

1. Context & Objective
A text classification pipeline that determines whether a movie review carries a positive or negative sentiment. Built as a foundational NLP project to learn core text preprocessing and feature extraction techniques.
2. Methodology
1. Cleaned raw text: removed HTML tags, punctuation, and stopwords.
2. Applied stemming using NLTK's PorterStemmer.
3. Vectorized text with TF-IDF (unigrams + bigrams).
4. Trained a Logistic Regression classifier evaluated with precision, recall, and F1.
In [1]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
def clean_text(text):
text = re.sub(r'<.*?>', '', text)
text = re.sub(r'[^a-zA-Z]', ' ', text)
return text.lower()
tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1,2))
X = tfidf.fit_transform(df['review'].apply(clean_text))
y = df['sentiment']
model = LogisticRegression()
model.fit(X, y)3. Final Learnings
NLP text preprocessing is half the battle — dirty text severely degrades accuracy. Bigrams captured phrases like 'not good' that unigrams miss, boosting F1 by ~4%.
Dataset details
Language
Python
Size
~50k IMDB reviews
Libraries Used
NLTKScikit-LearnPandas