From a beginner’s perspective !!
My first data analysis project as a newbie in data science to identify the different factors that affected survival rates among passengers who were aboard ‘The Titanic’.
This project would mainly focus on survival rates of passengers depending on their sex, age, socio-economic status and a few other factors. I have mainly used the library pandas and also integrated a few sql techniques along. I also included various coding practices I came across.
Understanding the variables in the dataset. What does each column represent?
- PassengerId = the unique number that identifies each passenger
- Survived = Value of “1” indicates the passenger survived and “0” indicates otherwise
- Pclass = Passenger class (1 = 1st class, 2 = 2nd, 3 = 3rd)
- Name = Name of passenger
- Sex = Sex of Passenger
- Age = Age of Passenger
- SibSp = Number of Siblings/Spouses of the passenger aboard
- Parch = Number of Parents/Children of the passenger aboard
- Ticket = Ticket number of Passenger
- Fare = Passenger ticket fare
- Cabin = Cabin passenger travelled in
- Embarked = Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
1. Pclass: proxy for the passenger’s socio-economic status (1 = Upper, 2 = Middle ,3 = Lower)
2. Sibling: brother, sister, stepbrother, or stepsister of passenger aboard
3. Spouse: husband or wife of passenger aboard(mistresses and fiancees ignored)
4. Parent: mother or father of passenger aboard
5. Child: son, daughter, stepson, or stepdaughter of passenger aboard
*Additional Potential Questions*
1. Did having relatives aboard increase chance of survival?
2. Was there any correlation between survival and the port of embarkation??
3. Did the same sex always get higher preference despite the socio-economic class they belonged to?
I created a set containing all unique Passenger Id values which shows there aren’t any since it has the same number of entries as in the original set of data.
Before carrying out the analysis I decided to use the pandas library to convert the data in the csv file into a data frame to make it more accessible and visual. Unnecessary columns are also removed. I have also imported the matplotlib library to represent results obtained through diagrams. Using the ‘isnull’ function helps us spot missing values in all the respective columns in the data frame.
This phase of the project would look into code used to identify factors that contributed to higher survival rates as mentioned in the introduction. It would also identify some interesting statistical facts from the data, for example: Who was the youngest survivor?
I went on to analyze how many survived out of the total aboard and what factors had some correlation with respect to survival rates.
Did the sex of the passenger affect survival?
There are two ways you can go about this as in the image to the left. When using the group by function the size command has to be used along with it and the other method requires the sum command. We can also conclude that more females survived overall.
How can we identify the youngest survivor and how old he/she was?
Which age group had the largest number of survivors?
As per the results survivors consisted of mostly adults but which age group had the highest survival rate?
To do that I found the total number of passengers in each age group using a similar function to the one above with a condition on only age and found the percentage of survivors as shown in the plot. So children had the highest survival rates among all.
Was there any strong correlation between the socio-economic group the passengers belonged to and their survival rates?
This time I improvised on my code and realized using functions reduced repetition and increased accuracy.
I was also curious to find out if females did have a higher chance of survival in all socio economic groups as well.
Then I went onto thinking if having any relatives aboard helped passengers survive or not?
I also thought I should investigate further as to whether all children had parents to accompany them or were there nannies? And did children with nannies have a higher rate of survival?
The data provided is not sufficient enough for thorough analysis but I reached the results as shown below and children with nannies did not have a significantly higher rate of survival.
This project analyzed data regarding passengers aboard the titanic ship and their survival statistics. It being my first data analysis project, I definitely have to say it was an exciting learning experience and answered my question as to how coding makes analyzing much easier. To begin with I was confused as to where to start, but Lesson 1 of the Intro to Data Analysis course in Udacity introduced me to the different stages of data analysis which I followed too.
This course mentioned above also gives us an insight into numpy and pandas to be used along to obtain and represent results from the data much faster which did come in very handy for me. Using a data frame to store my data helped visualizing data and so I continued to check if there were any missing data.
Now it was time to answer my questions regarding the data. I started by seeing how many survived in total and then looking at different factors which could have affected survival rates. Results did prove more females survived overall. Looking at different age groups among passengers, we also can clearly see children had the highest survival rate and rates reduced as age increased.
Taking the socio-economic class each passenger belonged, it was evident that upper class citizens had higher survival rates and an interesting observation made was that among each socio economic class females still did have higher survival rates compared to men. Another factor I looked into was whether having relatives or nannied aboard increased a passenger’s chance of survival and the results did show passenger’s with relatives had higher survival rates.
Overall, it was amazing to code and see it in practice although I did face a few challenges myself. Sometimes multiple errors can be quite frustrating but, understanding what each error meant, using peer review and going through documentation available online did guide me through. Extended hours of coding can also be tiresome but focusing on some other work meanwhile and taking sufficient breaks time to time made me more efficient. Joining relevant meetups also did show me various perspectives on how to analyze data and introduced me to easier and more efficient coding practices.
I would finally like to mention peer review was a major part of my project, since listening to feedback with a different point of view and implementing it only makes your work better, helps you overcome your challenges and I learnt a lot more this way too.