Link Search Menu Expand Document

The Data Analysis Process - Case Study 1

Jupyter Notebook for Case Study 1

In this first case study, you’ll perform the entire data analysis process to investigate a dataset on wine quality. Along the way, you’ll explore new ways of manipulating data with NumPy and Pandas, as well as powerful visualization tools with Matplotlib.

We’re going to investigate this dataset on physicochemical properties and quality ratings of red and white wine samples. Let’s take a closer look at its attributes and pose some questions for our analysis!

Reading CSV files isn’t always the same process - you won’t always know what to expect. Sometimes there could be different delimiters, missing column labels, blank lines, comments, header text, etc. Most of the time, quick trial and error with Pandas does the trick. Alternatively, you can inspect the file with a text editor or spreadsheet program, like Google Sheets. Although, this is not recommended for large files, as they could really slow or crash the program. A better way to inspect large files would be with your terminal. (You don’t need to know about this for this lesson.)

Assessing Data

Using Pandas, explore winequality-red.csv and winequality-white.csv in the Jupyter notebook below to answer quiz questions below the notebook about these characteristics of the datasets:

  • number of samples in each dataset
  • number of columns in each dataset
  • features with missing values
  • duplicate rows in the white wine dataset
  • number of unique values for quality in each dataset
  • mean density of the red wine dataset

This data was originally taken from here.

Appending and NumPy

Why is C so fast?

Here are nice, short readings on how NumPy was created.

Renaming columns

  • Index does not support mutable operations, in the video all of the column names are assigned to a new list then the list was modified and re assigned as the new columns.

Pandas Groupby

Learn more about pandas Groupby and view its documentation here.

Pandas cut

You can create a categorical variable from a quantitative variable by creating your own categories. pandas’ cut function let’s you “cut” data in groups.

Using this, create a new column called acidity_levels with these categories:

Acidity Levels:

  • High: Lowest 25% of pH values
  • Moderately High: 25% - 50% of pH values
  • Medium: 50% - 75% of pH values
  • Low: 75% - max pH value

Here, the data is being split at the 25th, 50th, and 75th percentile. Remember, you can get these numbers with pandas’ describe()! After you create these four categories, you’ll be able to use groupby to get the mean quality rating for each acidity level.

image

Pandas Query

Another useful function that we’re going to use is pandas’ query function.

In the previous lesson, we selected rows in a dataframe by indexing with a mask. Here are those same examples, along with equivalent statements that use query().

# selecting malignant records in cancer data
df_m = df[df['diagnosis'] == 'M']
df_m = df.query('diagnosis == "M"')

# selecting records of people making over $50K
df_a = df[df['income'] == ' >50K']
df_a = df.query('income == " >50K"')

The examples above filtered columns containing strings. You can also use query to filter columns containing numeric data like this.

# selecting records in cancer data with radius greater than the median
df_h = df[df['radius'] > 13.375]
df_h = df.query('radius > 13.375')

Type & Quality Plot - Part 1

You can make aesthetically pleasing data visualizations with seaborn. Here are some cool examples.

Type & Quality Plot - Part 2

Matplotlib Example

Below is the Type and Quality Plot created with Matplotlib. As you can see, Matplotlib gives us much more control over our visualizations.

image

Before we jump into the making of this plot, let’s walk through a simple example using Matplotlib to create a bar chart. We can use pyplot’s bar function for this.

Quiz-Visualizing in Matplotlib

Type & Quality Plot with Matplotlib

Below is the code used to create this plot with Matplotlib.

https://video.udacity-data.com/topher/2017/September/59ca8967_matplotlib-preview-plot/matplotlib-preview-plot.png

Example Plot

Jupyter Notebooks

  1. Assessing

  2. Appending and NumPy

  3. EDA Visuals

  4. Pandas cut and groupby

  5. Pandas query and groupby

  6. Matplotlib Example

  7. Wine Visualizations Matplotlib

  8. Type and Quality Visualization Matplotlib

top