# How can we better wrangle, analyze, and visualize data?

**Few techniques that could come in handy :]**

This article would briefly explain SQL queries, statistical tests, and visualization methods using the New York subway weather data.

These are covered in detail in the course “Intro to Data Science” by Udacity which I highly recommend.

**Wrangling subway data**

**Using SQL queries**: Pandasql makes accessing, reading, and interpreting data stored in data frames much easier, especially for someone who is new to python or pandas. It helps us choose the columns/data we need to make predictions or arrive at conclusions. To gain a better intuition into how SQL works please do refer to this course.

Does the day of the week affect hourly ridership in the subway? This query would help you seek the answer to it.

Firstly create a new data frame with the columns you need, especially when the data frame is too big since this helps you refer to the data in the columns easily. Then type in your query and finally use pandasql to view your results.

**Applying statistics to analyze**

**Mann-Whitney U-test**: A non-parametric statistical test that does not assume our data is drawn from any underlying probability distribution. It allows us to find out if both data sets being compared are identical and have the same mean. Please check this link to understand to code for the test.

Let’s use the test to find out **if the rain did affect ridership** in the subway, with a significance level of 0.05.

As observed, clearly the means of the two data sets are not significantly different, so using the p-value to reach a conclusion would be more accurate.

Let's perform a **one-sided test** on the two data sets.

Since the p-value is less than half of 0.05, we can reject the null hypothesis, and therefore ridership without rain is greater.

I prefer one-sided compared to a two-sided since although you know that the test statistic fell into one of the critical regions, you are confused as to which one.

To understand the Mann-Whitney U-test better do try watching this video and this article would help you understand how one-sided and two-sided tests are normally performed.

**Visualizing data**

Python has a number of plotting libraries to choose from. One of the most popular is** matplotlib**. We will also look into using **plotnine** which is a Python implementation of the ggplot2 package from R.

**Using matplotlib :**The picture below shows how matplotlib can be used to plot a histogram to show ridership with/without rain.

**Using plotnine (aka ggplot)**:

To produce a plot with plotnine we must provide three things:

- A data frame containing our data.
- Aesthetics-Which columns of the data frame should be translated into positions, colors, sizes, and shapes of graphical elements
- Geometric objects-The actual graphical elements to display

Let's use this to plot a **line chart** with the data frame obtained using the SQL query explained above.

As seen in the code attached above, the aesthetics contain the columns containing the data and for a line chart it would be geom_line that has to used. The color of the line also can be specified.

The titles for the chart and axes can also be included in the code.

To know more about plotnine please do refer to this link. For more examples do visit this page on my github profile.

If you found this article useful please do clap :)