top of page

Disaster Relief - Titanic

After a disaster,  it is often the case that researcher collect data on the event and provide a review. This is done to help disaster relief agency better manage future occurring events. For this project, I will be doing something similar.  I want to create and train a logistic regression model that can help predict the whether or not victims of the disaster will survive, given the information we are able to gather about them​.

Data

I collected my data from an AWS PostgreSQL instance, but the data is also available on Kaggle. The dataset contains information on 891 passengers from the 1912 Titanic disaster. More information on the variables are provided in the data dictionary below:

Exploratory Data Analysis

The next step in the process is exploring the data for better understanding. From the first violinplot, we can see the distribution of age across classes for both those that survived and those that didn't. In class 1, people that survived were, on average, younger than people that didn't. In classes 2 and 3, we see that the average of those that survived was similar to that of those that didn't. Also, there more children in classes 2 and 3, and most of them survived.

The next plot below shows a distribution of males and females that survived across the three classes. There are several things to observe from this plot. First, the survival rate is lower in classes 2 and 3. Women in all classes survived more than men. More women survived than didn't in classes 1 and 2. Very few women in classes 1 and 2 died. It appears to be the case that the proportion of women that survived in class 3 is similar to that of women that didn't survive in class 3. Finally, the lower the class the more men that died.

The last plot shows a distribution of males and females that survival based on their port of embarkment. Fewer men from Queenstown survived and fewer women from Queenstown died. In general, more women survived than men, and more men died than women.

The histogram below shows the age distribution of passengers that were on the titanic. We can see from the chart the over 30% of the passengers were about 30 years old. We can also infer that there were more passengers below the age of 40 thank above. The histogram is quite similar to a normal distribution, centered around a mean age of 30.

Preparing the data

After my analysis, I found that there were missing values for the variables Age, Cabin, and Embarked. Instead of dropping the 177 missing Age values, I chose the replace the then with the overall mean age (which is 30). Since there were only two empty Embarked values, I replaced them with the most popular port of embarkment (Southampton). I dropped Cabin variable because I did not need it for any further analysis. Then I created dummy variables for the Sex, Embarked, and PClass variables and then dropped all unnecessary variables. I also placed the ages of passengers into bins and made them dummy variables. In the end, I had a dataset with 891 data points and 13 variables. The heat map below shows correlations between the variables I will be using in my regression. Even without any regression analysis, we can already see from the map that being female is the most positively related variable to survival.

The Model - Logistic Regression

I used a Logistic Regression to estimate the likelihood of survival for the titanic passengers based on the variables below.  The values below are the exponents of the coefficients. From the results, we can see that SibSp, Q, and all age groups except "Age_16 and under" statistically insignificant at a 5% significance level. For this model, my baseline is a male from Southhampton in the 3rd class between the ages of 16-30. Similar to the first model, SibSp and Parch generally decrease the odds of survival while Fare, Female, C, Q, Class_1, Class_2, and Age_16 and under generally increase the odds of survival. From this model, we find that only children 16 and under in the group had a significantly higher chance of survival.

Next, I split my dataset into a train set (70%) and a test set (30%). I fit Logistic Regression model into the train set and then used the model for predicting the survival in the test dataset. The diagram is the ROC curve for my model. It shows the relationship between the true positive and false positive rates for my model at different thresholds. For a good model, I want to be as close as possible to the top left corner of the graph. In the case, my model will predict more survivals than otherwise. My AUC score is 0.86, indicating a pretty good model (if do say so myself). 

The confusion matrix below describes in detail, how well my model predicted in comparison to the actual scenario. These results were generated at a probability threshold of 0.5. The matrix provided my True Positive (TP, 84), False Positive (FP, 23), True Negative (TN, 129), and False Negative (FN, 32). This matrix also helps in calculating my model's precision, recall, and accuracy. 

The classification report below is derived from the confusion matrix. The precision (TP/TP+FP) shows retrieved instances that are relevant i.e. the proportion of people that actually survived from the total number people that our model predicted survived. Recall (TP/TP+FN) indicates the proportion of people that our model predicted survived from the total number of people that actually survived. The F1-score indicates how well it can predict a passenger surviving relative to predicting a passenger not surviving. For this model, the precision, recall, and F1-score are all 0.79.

Conclusion and Future Work:

From the analysis above, I can imply that people on the titanic were more likely to survive if they were female, aged 16 or below, and in the 1st class. These results were derived from certain assumptions what I will need to review for future work. For instance, I assumed the age of 177 passengers (20% of the dataset) to be the average age of the remaining 714 - this is a huge percentage relative to the dataset. Another problem is that I assumed everyone was in their assigned class while the boat was sinking. It may have been the case that some people from the 1st class cabin (whose cabins were vertically further away from the ocean) were actually at the lower levels of the boat. I really don't see the 1st class passengers chilling at the lower level with the "commoners," but who's to say? Still, from the data we collect, I was able to develop so insights about the chances of survival. With a pretty decent precision and recall rate, I can apply this model to help manage situations in the event of a plane crash. Imagine a scenario when a plane crash occurred, and were are unable to find all the passengers and crew members. We can collect information from each passenger's boarding pass and driver's license/passport, we can use this model to predict (to a good extent) the chance of survival. Obviously, certain other factors will have to be considered given that the disaster in a plane crash and not a ship sinking. 

  • LinkedIn Social Icon
  • Facebook Social Icon
  • Google+ Social Icon

© 2018 by Dami Lasisi

bottom of page