
Chapter 53 In Practice Using Stats
In the upcoming chapters we will introduce you to a number of statistical terms and test types. In general, we wouldn’t stress memorizing specifics of these tests, but instead trying to get an intuitive sense for how to use and interpret the stats in the context of your statistical questions.
In this chapter we are going to discuss some general strategies for how to start to apply the power of stats to your dataset.
53.1 Tip number 1: Always be looking at your data
Data in real life is messy. You know this because we had a whole section about cleaning data! But even after your data is “tidy” in the sense that it is ready to be used, it still may have other oddities.
We’ve discussed that data science is all about questions! This means that a new data science endeavor often starts with a lot of unknowns. There are:
- Knowns! - These are the things you know.
- Known unknowns - these are the things that you know you don’t know – like the answers to your data science question
- Unknown unknowns - These are the things that you don’t know that you don’t know – these can be very scary if they affect your data!
So how do we make unknown unknowns a little more known? We have to do lots of exploring!
There are a few initial explorations that are always a great idea to do with your data. As you work on a dataset, you may want to repeat these whenever you apply a new step or transformation to your data.
Practical tips for looking at your data
- Look at your data in the Environment tab or by printing it out.
- Make a density plot or histogram to look at the shape of your data - more on this in a second!
- Run the
summary()
function on your data. - Make sure to look into the documentation for functions that you use to transform your data so you understand what you are doing to it.
As you do these things with your data, you may find some “weirdness” which brings us to our next tip.
53.2 Tip number 2: Dig into weirdness
Often after you have done some initial exploration of your data or as you are generally working on your analysis, you may see things that make you go “huh?” or are generally just “weird”. One of the most valuable skills of data scientist can have is digging into that weirdness. As you continue to work in the field of data science you will hone your skills of identifying and digging into weirdness. Over time you will get a “spidey sense” and learn what weirdness is worth digging into and what weirdness is actually quite normal.
Digging into peculiar things can lead making unknown unknowns more known. They sometimes may shift the entire strategy of your analysis! And while this may sound like it will add your work, it is how the best data science is done!
At this point, we should let you know we’ve been somewhat lying to you at the beginning of each section when we’ve shown you this map:
In reality, good data science doesn’t happen so step like.
The best data science is a bit messier than this. Original questions lead to new questions; and some pursuits are dead ends; and whole new findings and cool things we never dreamed of sometimes pop up out of nowhere! Meaning that data science is never “write code once and done”, it is “write code, run it, rewrite it, look at your data, learn something, rewrite it again, have new questions, etc, etc, etc”.
This leads us to another point. That is, the best data science is done iteratively – you won’t write your best analysis in one sitting and never come back to it. Your best analyses will be things you work on over time to incrementally improve and look at.
Practical tips for digging into weirdness:
- Look into weird density plot shapes and points in your data (outliers)
- Look into results and outcomes that are unexpected - are these real or a byproduct of a mistake somewhere?
- Look out for results and outcomes that are too perfect – real life and real data are rarely perfect
- Never assume why something is the way it is without proving it to yourself by digging into it more!
Ultimately, when you see weird things prove to yourself that the results are what they seem.
53.3 Tip number 3: Let the data inform you
After following our first two tips, you should have a generally good sense of what your data look like and what kind of weirdness exists. Now you will be ready to use what you’ve learned about your data to inform you how to properly handle them!
We discussed in the previous chapter how to translate your data science questions into a stats test. There’s a second part of this consideration, which is taking into account how your data are behaving to know what kinds of stats and questions are appropriate.
Here’s some common things your data might tell you and how you might find that out:
For the upcoming examples we are going to use this datasets that are included in the ggplot2
package.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
53.4 How to find out how much data is missing
Datasets often having missing data points because the collection process can be messier than we hope. There can be very good and appropriate reasons to have NA
s in the data. Indeed it can often be the case that NA
is a more appropriate way to note a data point than putting some other value. But, before you start doing
Code example for finding missing data (this is not the only way to do this)
Let’s use and set up the Texas Housing sales dataset from ggplot2
.
<- ggplot2::txhousing
tx_df head(tx_df)
## # A tibble: 6 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
If you know that all your missing data are appropriately labeled as NA
you can just use something like:
summary(tx_df)
## city year month sales
## Length:8602 Min. :2000 Min. : 1.000 Min. : 6.0
## Class :character 1st Qu.:2003 1st Qu.: 3.000 1st Qu.: 86.0
## Mode :character Median :2007 Median : 6.000 Median : 169.0
## Mean :2007 Mean : 6.406 Mean : 549.6
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.: 467.0
## Max. :2015 Max. :12.000 Max. :8945.0
## NA's :568
## volume median listings inventory
## Min. :8.350e+05 Min. : 50000 Min. : 0 Min. : 0.000
## 1st Qu.:1.084e+07 1st Qu.:100000 1st Qu.: 682 1st Qu.: 4.900
## Median :2.299e+07 Median :123800 Median : 1283 Median : 6.200
## Mean :1.069e+08 Mean :128131 Mean : 3217 Mean : 7.175
## 3rd Qu.:7.512e+07 3rd Qu.:150000 3rd Qu.: 2954 3rd Qu.: 8.150
## Max. :2.568e+09 Max. :304200 Max. :43107 Max. :55.900
## NA's :568 NA's :616 NA's :1424 NA's :1467
## date
## Min. :2000
## 1st Qu.:2004
## Median :2008
## Mean :2008
## 3rd Qu.:2012
## Max. :2016
##
You’ll see this prints out the summary for each variable in this data frame, including the number of NA
s.
- However, if your missing data are not appropriately labeled NA then you will want to convert them using code described in this article.
- For more on finding missing values.
53.4.1 How to find if you have outliers
Outliers are data points that are extreme. They can throw off whole analyses and bring you to the wrong conclusions. You have outliers or weird samples in your data – you may want to try removing these if appropriate and re-running the test you were using.
Code example for finding outliers (this is not the only way to do this)
Let’s use and set up the Fuel economy dataset from ggplot2
.
<- ggplot2::mpg
cars_df head(cars_df)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
We can make a boxplot with base R.
boxplot(cars_df$hwy, ylab = "hwy")
The points in this boxplot are points you would want to look into as being outliers! You could see what these points are for sure by using
dplyr::arrange()
or any other number of ways.
%>%
cars_df ::arrange(dplyr::desc(hwy)) dplyr
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 volkswagen jetta 1.9 1999 4 manu… f 33 44 d comp…
## 2 volkswagen new beetle 1.9 1999 4 manu… f 35 44 d subc…
## 3 volkswagen new beetle 1.9 1999 4 auto… f 29 41 d subc…
## 4 toyota corolla 1.8 2008 4 manu… f 28 37 r comp…
## 5 honda civic 1.8 2008 4 auto… f 25 36 r subc…
## 6 honda civic 1.8 2008 4 auto… f 24 36 c subc…
## 7 toyota corolla 1.8 1999 4 manu… f 26 35 r comp…
## 8 toyota corolla 1.8 2008 4 auto… f 26 35 r comp…
## 9 honda civic 1.8 2008 4 manu… f 26 34 r subc…
## 10 honda civic 1.6 1999 4 manu… f 28 33 r subc…
## # ℹ 224 more rows
53.4.2 How to know if your data is underpowered for your question
In order to answer particular questions with a dataset, you need to have enough data in the first place! If you don’t have enough data that means you are underpowered. This may happen if you have a lot of missing data, a painfully small dataset, or the effect you are looking for is very small. In these cases you may need to find another dataset or see if the data collector can collect more data to add to this set. So how do you know if your dataset is underpowered?
Code example exploring power (this is not the only way to do this)
Let’s use and set up the Diamonds from ggplot2
.
<- ggplot2::diamonds
diamonds_df head(diamonds_df)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Let’s say we are interested in seeing whether the carat
is correlated with the price of the diamond. Before we test this, we may want to test the power of our dataset to detect this correlation. For this we will use the pwr.r.test()
function from the pwr
library.
install.packages('pwr', repos = 'http://cran.us.r-project.org')
## Installing package into '/usr/local/lib/R/site-library'
## (as 'lib' is unspecified)
library(pwr)
We have to tell it a few pieces of info:
1) How many samples do we have?
2) What correlation do we expect? and…
3) What significance level will we want (standard is to use 0.05 or 0.01).
pwr.r.test(n = nrow(diamonds_df), # How many cases do we have
r = cor(diamonds_df$carat, diamonds_df$price),
sig.level = 0.01)
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 53940
## r = 0.9215913
## sig.level = 0.01
## power = 1
## alternative = two.sided
You’ll see this prints out a 1
for power. This dataset is not underpowered at all. Power is on a scale of 0 to 1. Where 0 means you don’t have the power to detect anything and 1 means you will absolutely see a significant result if there is one to be seen.
But let’s look at a different hypothetical situation. Let’s say instead we only had 10 rows of data and the correlation we expected would be more like 0.3.
pwr.r.test(n = 10, # How many cases do we have
r = 0.3,
sig.level = 0.01)
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 10
## r = 0.3
## sig.level = 0.01
## power = 0.03600302
## alternative = two.sided
Now this is telling us our power is very low – meaning even if our hypothesis is true, we don’t have enough data to see this correlation.
53.4.3 How to know how your data are distributed
Perhaps you want to use a particular stats test, but when you read about this stats test, it has an assumption that the data are normally distributed – stats assumptions are really just requirements for using a method. So if something “assumes a normal distribution” it means in order to use the test on your data it has to be normally distributed.
If you will be using a numeric variable for anything in your analysis it’s a very good idea to plot its density so you know what you are working with!
Code example of looking at distributions
Let’s return to the cars_df dataset we were looking at earlier. To make a density plot, we need to use geom_density()
function.
ggplot(cars_df, aes(x = cty)) +
geom_density()
We can see this looks like a fairly normal distribution – what does that mean? let’s discuss.
53.4.3.1 What does it mean to be “normally distributed?”
How a dataset is distributed has to do with frequency of the data. So in the example below, we’ve made a probability density plot using ggplot2. See this article for details on making density plots. The higher the line is, the more probable that that value would occur.
If your data plot looks like that normal bell-shaped curve, then you have “normally distributed” data. You will want to know what your data distribution looks like so you know what tests are appropriate.
Let’s look at the distribution of a different variable, the sales
in tx_df
data:
ggplot(tx_df, aes(x = sales)) +
geom_density()
## Warning: Removed 568 rows containing non-finite values (stat_density).
Is this normally distributed? There are a lot of values that are lower and only some that are higher. Looks pretty skewed. In this case, we probably don’t need to test these data, but we know they aren’t really normally distributed so we should keep that in mind.
If we want a test rather than just using our eyes, we can use the ks.test()
which asks for us “are my data normally distributed?” In this instance, we’ll use the iris
dataset to test its normality for the variable Sepal.Width
.
shapiro.test(iris$Sepal.Width)
##
## Shapiro-Wilk normality test
##
## data: iris$Sepal.Width
## W = 0.98492, p-value = 0.1012
Because this p value reported is bigger than 0.05 we can consider Sepal.Width
to be a normally distributed variable.
One other important thing to note, if you have too small of a dataset, (say 30 or less) then you can never really consider your data normally distributed and they will always fail these normality tests.
- Read more about normal distributions here.
- There are formal ways to test for normality and these methods are described in this article that has code examples: How to Assess Normality in R
In conclusion, density plots are a super handy tool to see what your data look like before you dive in to any other tests and analyses. These can inform you about what to do with your data.
Practical tips for figuring out what to do with your data:
- Make density plots to visualize your data’s distribution.
- Google and find sources that discuss the problems you are seeing with your data. It is unlikely that the dataset you are working with is the only dataset that has this weirdness and others online may weigh in.
- Consult a more senior data scientist or statistician. Show them the the weirdness you see and ask them what they think of it and what they would do.
- Look for other data analysis examples online that resemble your data and its weirdness. Try to see what others did and if it makes sense to apply to your situation.
53.5 How do I know what test to use?
We’re going to tell you about some common tests, but here’s a flow chart to help you get started. We advise generally having an idea of tests out there but not getting caught up in the minutia of every test. If you end up using a particular test on a set of data, then might be a good time to try to get a practical understanding of the test and how to interpret it but knowing everything about all statistical tests is just not practical and not a good use of your time.
The important point about choosing a test is realizing that not all tests are appropriate. Indeed, you could use a lot of these tests for a particular dataset, but some tests may lead you to erroneous conclusions if the test is not meant to be used on data like what you are working with.
And as we mentioned with the tips above, don’t be afraid to reach out to a statistician or more senior data scientist to help you choose what is appropriate!
For this section, we’re going to borrow the handy cheatsheets and tables from this article by Rebecca Bevans. Don’t worry about memorizing the specifics of these tests, just generally understand this guide and how to use it and know you can come back to it for a reference.
53.5.3 Correlation tests
Highly similar to regression tests because they use a lot of the same math, correlation tests ask if two variables are related.
53.5.4 Nonparametric tests
Remember how we briefly talked about data being normally distributed or not? Turns out all the tests we just mentioned above ONLY work for data that is normally distributed. Aka as statisticians like to say those tests “assume normality”. But if your data isn’t normal, you can still do things with it! There are non-parametric (things that don’t assume normality) equivalents you can use.
53.5.5 Additional Resources
- Which Statistical Test to Use? Follow This Cheat Sheet
- Demystifying Statistical Analysis 1: A Handy Cheat Sheet
- Open Case Studies, by Pei-Lun Kuo, Leah Jager, Margaret Taub, and Stephanie Hicks
- Health Expenditures Case Study
- Case Study on GitHub