How can we better wrangle, analyze, and visualize data?

Few techniques that could come in handy :]

This article would briefly explain SQL queries, statistical tests, and visualization methods using the .

These are covered in detail in the course “” by Udacity which I highly recommend.

Wrangling subway data

  • Using SQL queries: makes accessing, reading, and interpreting data stored in data frames much easier, especially for someone who is new to python or pandas. It helps us choose the columns/data we need to make predictions or arrive at conclusions. To gain a better intuition into how SQL works please do refer to this .

Does the day of the week affect hourly ridership in the subway? This query would help you seek the answer to it.

Firstly create a new data frame with the columns you need, especially when the data frame is too big since this helps you refer to the data in the columns easily. Then type in your query and finally use pandasql to view your results.

Applying statistics to analyze

  • Mann-Whitney U-test: A non-parametric statistical test that does not assume our data is drawn from any underlying probability distribution. It allows us to find out if both data sets being compared are identical and have the same mean. Please check this to understand to code for the test.

Let’s use the test to find out if the rain did affect ridership in the subway, with a significance level of 0.05.

As observed, clearly the means of the two data sets are not significantly different, so using the p-value to reach a conclusion would be more accurate.

Let's perform a one-sided test on the two data sets.

Since the p-value is less than half of 0.05, we can reject the null hypothesis, and therefore ridership without rain is greater.

I prefer one-sided compared to a two-sided since although you know that the test statistic fell into one of the critical regions, you are confused as to which one.

To understand the Mann-Whitney U-test better do try watching this and this would help you understand how one-sided and two-sided tests are normally performed.

Visualizing data

Python has a number of plotting libraries to choose from. One of the most popular is matplotlib. We will also look into using plotnine which is a Python implementation of the ggplot2 package from R.

  • Using matplotlib : The picture below shows how matplotlib can be used to plot a histogram to show ridership with/without rain.
  • Using plotnine (aka ggplot):

To produce a plot with plotnine we must provide three things:

  1. A data frame containing our data.
  2. Aesthetics-Which columns of the data frame should be translated into positions, colors, sizes, and shapes of graphical elements
  3. Geometric objects-The actual graphical elements to display

Let's use this to plot a line chart with the data frame obtained using the SQL query explained above.

As seen in the code attached above, the aesthetics contain the columns containing the data and for a line chart it would be geom_line that has to used. The color of the line also can be specified.

The titles for the chart and axes can also be included in the code.

To know more about plotnine please do refer to this . For more examples do visit this on my github profile.

If you found this article useful please do clap :)

Srilankan living in Berlin. Mathematics master student at Freie Universitat. Interested in Data science & Machine Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store