Data SciencePlatform: Kaggle
Titanic Survival Prediction

1. Context & Objective
The sinking of the Titanic is one of the most infamous shipwrecks in history. This project builds a predictive model that answers: what sorts of people were more likely to survive? — using passenger data (name, age, gender, socio-economic class, etc).
2. Methodology
1. Handled missing data in 'Age' and 'Embarked' columns.
2. Engineered new features like 'FamilySize' from 'SibSp' and 'Parch'.
3. One-hot encoded categorical variables ('Sex', 'Embarked').
4. Trained a Random Forest Classifier and optimized hyperparameters via GridSearchCV.
In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
train_df = pd.read_csv('train.csv')
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
features = ['Pclass', 'Sex', 'Age', 'Fare', 'FamilySize']
X = pd.get_dummies(train_df[features])
y = train_df['Survived']
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X, y)3. Final Learnings
Achieved 82% accuracy on the validation set. Females and first-class passengers had a significantly higher survival rate. Engineering 'FamilySize' proved crucial for model improvement.
Dataset details
Language
Python
Size
891 rows (Training), 418 rows (Test)
Libraries Used
PandasScikit-LearnMatplotlibSeaborn