Data SciencePlatform: Kaggle

Titanic Survival Prediction

1. Context & Objective

The sinking of the Titanic is one of the most infamous shipwrecks in history. This project builds a predictive model that answers: what sorts of people were more likely to survive? — using passenger data (name, age, gender, socio-economic class, etc).

2. Methodology

1. Handled missing data in 'Age' and 'Embarked' columns. 2. Engineered new features like 'FamilySize' from 'SibSp' and 'Parch'. 3. One-hot encoded categorical variables ('Sex', 'Embarked'). 4. Trained a Random Forest Classifier and optimized hyperparameters via GridSearchCV.

In [1]:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier

train_df = pd.read_csv('train.csv')
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1

features = ['Pclass', 'Sex', 'Age', 'Fare', 'FamilySize']
X = pd.get_dummies(train_df[features])
y = train_df['Survived']

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X, y)

3. Final Learnings

Achieved 82% accuracy on the validation set. Females and first-class passengers had a significantly higher survival rate. Engineering 'FamilySize' proved crucial for model improvement.

Dataset details

Language

Python

Size

891 rows (Training), 418 rows (Test)

Libraries Used

PandasScikit-LearnMatplotlibSeaborn