Member-only story
How can we better wrangle, analyze, and visualize data?
Few techniques that could come in handy :]
This article would briefly explain SQL queries, statistical tests, and visualization methods using the New York subway weather data.
These are covered in detail in the course “Intro to Data Science” by Udacity which I highly recommend.
Wrangling subway data
- Using SQL queries: Pandasql makes accessing, reading, and interpreting data stored in data frames much easier, especially for someone who is new to python or pandas. It helps us choose the columns/data we need to make predictions or arrive at conclusions. To gain a better intuition into how SQL works please do refer to this course.
Does the day of the week affect hourly ridership in the subway? This query would help you seek the answer to it.
Firstly create a new data frame with the columns you need, especially when the data frame is too big since this helps you refer to the data in the columns easily. Then type in your query and finally use pandasql to view your results.
Applying statistics to analyze
- Mann-Whitney U-test: A non-parametric statistical test that does not assume our data is drawn from any underlying probability distribution. It allows us to find out if both data sets being compared are identical and have the same mean. Please check this link to understand to code for the test.
Let’s use the test to find out if the rain did affect ridership in the subway, with a significance level of 0.05.
As observed, clearly the means of the two data sets are not significantly different, so using the p-value to reach a conclusion would be more accurate.
Let's perform a one-sided test on the two data sets.
Since the p-value is less than half of 0.05, we can reject the null hypothesis, and…