Few techniques that could come in handy :]
This article would briefly explain SQL queries, statistical tests, and visualization methods using the New York subway weather data.
These are covered in detail in the course “Intro to Data Science” by Udacity which I highly recommend.
Wrangling subway data
- Using SQL queries: Pandasql makes accessing, reading, and interpreting data stored in data frames much easier, especially for someone who is new to python or pandas. It helps us choose the columns/data we need to make predictions or arrive at conclusions. To gain a better intuition into how SQL works please do refer to this course.
Does the day of the week affect hourly ridership in the subway? This query would help you seek the answer to it.
Firstly create a new data frame with the columns you need, especially when the data frame is too big since this helps you refer to the data in the columns easily. Then type in your query and finally use pandasql to view your results.
Applying statistics to analyze
- Mann-Whitney U-test: A non-parametric statistical test that does not assume our data is drawn from any underlying probability distribution. It allows us to find out if both data sets being compared are identical and have the same mean. Please check this link to understand to code for the test.
Let’s use the test to find out if the rain did affect ridership in the subway, with a significance level of 0.05.
As observed, clearly the means of the two data sets are not significantly different, so using the p-value to reach a conclusion would be more accurate.
Let's perform a one-sided test on the two data sets.
Since the p-value is less than half of 0.05, we can reject the null hypothesis, and therefore ridership without rain is greater.
I prefer one-sided compared to a two-sided since although you know that the test statistic fell into one of the critical regions, you are confused as to which one.
Python has a number of plotting libraries to choose from. One of the most popular is matplotlib. We will also look into using plotnine which is a Python implementation of the ggplot2 package from R.
- Using matplotlib : The picture below shows how matplotlib can be used to plot a histogram to show ridership with/without rain.
- Using plotnine (aka ggplot):
To produce a plot with plotnine we must provide three things:
- A data frame containing our data.
- Aesthetics-Which columns of the data frame should be translated into positions, colors, sizes, and shapes of graphical elements
- Geometric objects-The actual graphical elements to display
Let's use this to plot a line chart with the data frame obtained using the SQL query explained above.
As seen in the code attached above, the aesthetics contain the columns containing the data and for a line chart it would be geom_line that has to used. The color of the line also can be specified.
The titles for the chart and axes can also be included in the code.
If you found this article useful please do clap :)