Why Do We Use Data Visualizations?
Areas of Focus
There are two major areas of focus in this lesson:
- Why are data visualizations more useful for delivering insight than just using summary statistics?
- What plot do you build in a given situation?
Summary Statistics vs. Visualizations
Summary statistics like the mean and standard deviation can be great for attempting to quickly understand aspects of a dataset, but they can also be misleading if you make too many assumptions about how the data distribution looks.
Beyond Anscombe’s Quartet More recently Alberto Cairo created the Datasaurus dataset, which is amazingly insightful and artistic, but is built on the same idea that you just discovered. You can find the full dataset, and the visualizations on the Datasaurus link.
Now take a look at the visualization below, and complete the quiz below by identifying each of the data types.
Univariate Plots
Recommended charts to use
For quantitative data, if we are just looking at one column worth of data, we have four common visuals:
- Histogram
- Normal Quantile Plot
- Stem and Leaf Plot
- Box and Whisker Plot
In most cases, you will want to use a histogram.
For categorical data, if we are looking at just one variable (column), we have three common visuals:
- Bar Chart
- Pie Chart
- Pareto Chart
In most cases, you will want to use a bar chart.
Scatter Plots
Scatter plots
Scatter plots are a common visual for comparing two quantitative variables. A common summary statistic that relates to a scatter plot is the correlation coefficient commonly denoted by r.
Though there are a few different ways to measure correlation between two variables, the most common way is with Pearson’s correlation coefficient. Pearson’s correlation coefficient provides the:
- Strength
- Direction
of a linear relationship. Spearman’s Correlation Coefficient does not measure linear relationships specifically, and it might be more appropriate for certain cases of associating two variables.
Exercises
A positive, strong relationship. The above data actually has a correlation of almost 0.96.
A moderate, negative relationship. The above data actually has a correlation of almost -0.67
The points don’t follow the trend as well as the in the scatterplot above. Therefore, we also see a correlation coefficient that is closer to 0 than the above as well. The negative will occur anytime we have a negative relationship between the variables.
The first two are an example of a perfect positive and a perfect negative relationship. In the final plot, there is clearly a relationship. However, this is a quadratic relationship. So Pearson’s correlation (which only assesses linear relationships) is a value of 0.
Correlation Coefficients
Correlation Coefficients
Correlation coefficients provide a measure of the strength and direction of a linear relationship.
We can tell the direction based on whether the correlation is positive or negative.
A rule of thumb for judging the strength:
It can also be calculated in Excel and other spreadsheet applications using CORREL(col1, col2), where col1 and col2 are the two columns you are looking to compare to one another.
The closer the correlation coefficient to 1 and -1 the stronger the relationship.
A correlation coefficient of 0 doesn’t necessarily mean that there is no relationship between two variables. Rather that there isn’t a linear relationship.
Line Plots
Line plots are a common plot for viewing data over time. These plots allow us to quickly identify overall trends, seasonal occurrences, peaks, and valleys in the data. You will commonly see these used in looking at stock prices over time, but really tracking anything over time can be easily viewed using these plots.
What is the Question?
The key to building great data visualizations is in aiming them at answering the questions you want answered. This presentation gave a number of ways to show the exact same data in different ways depending on the question you want to answer.
What About with More Than Two Variables?
Here you were able to see a number of different visuals for comparing more than two variables - there isn’t a right answer in choosing these plots - and this is where the science can become more of an art. In the next quiz, you will see why we might choose one plotting method vs. another using some of the same data from the video.
Multiple Variables Quiz
Which Plot Is Best?
Below the same data are shown in 4 different ways. You can download the data by clicking the button below. You can also access the data in a Google Sheet here. Use the below plots to answer the following quiz questions.
Notice, that there isn’t one plot that is best for all questions. We really need to show the data in a way that the insight is easy for your audience to see it. Depending on the insight you are trying to highlight, you may choose a different way to display the data.
- Which product had the fastest growth in sales from January to July?
Product 3
- In which month did Product 1 have more than 50% of the sales?
February
- Did total sales ever exceed 1000 units?
Yes, in November
Why Data Dashboards
When We Have Lots of Variables
Hans Rosling shows an amazing visualization that incorporates many variables all at once. Take a look for yourself!
Introduction to Data Dashboards
Hans Rosling’s youtube video summary
Introduction to Visual Encodings
Use the video from the previous concept to assist with this quiz. An image is provided below which may also help with answering the questions if you just need a reminder!
Recap
In this lesson:
-
You motivated the need for data visualization by showing that summary statistics don’t tell the full story. You saw datasets where the summary statistics were the same, but the actual data were very different!
-
You did a review of data types. In general there are quantitative and categorical variables. Quantitative variables can be either discrete or continuous, while categorical variables are either ordinal or nominal.
-
You looked at univariate plots. In most cases a histogram should be used for quantitative data, while a bar chart should be used for categorical data. There are some cases where you might use one of the other plots.
-
You then looked at bivariate plots, where you were comparing two variables to one another. Scatter plots are the most common way to visualize two quantitative variables, while a line chart is common for data that you are watching over time. If you are comparing two categorical variables, the best choice is probably a side-by-side bar chart.
-
You learned about correlation coefficients, which provide the strength and direction of linear relationships. You learned a rule of thumb for determining whether the relationship between two quantitative variables is strong, moderate, or weak.
-
You then looked at cases where we had more than two variables. You learned that using these plots effectively is about building the plot that helps you see the insight that answers the question you have.
-
You gained some insight into visual encodings and data dashboards, which will be a part of the next lessons!