
Chapter 5 Reshaping Data
5.0.1 Data Formats
Tidy data generally exist in two forms: wide data and long data. Both types of data are used and needed in data analysis, and fortunately, there are tools that can take you from wide-to-long and from long-to-wide. This makes it easy to work with any tidy data set. We’ll discuss the basics of what wide and long data are and how to go back and forth between the two in R. Getting data into the right format will be crucial later when summarizing data and visualizing it.
5.0.1.1 Wide Data
Wide data has a column for each variable and a row for each observation. Data are often entered and stored in this manner. This is because wide data are often easy to understand at a glance. For example, this is a wide data set:
Wide dataset
This is a dataset we’ve looked at in a previous lesson. As discussed previously, it’s a rectangular and tidy dataset. Now, we can also state that it is a wide dataset. Here you can clearly see what measurements were taken for each individual and can get a sense of how many individuals are contained in the dataset.
Specifically, each individual is in a different row with each variable in a different column. At a glance we can quickly see that we have information about four different people and that each person was measured in four different ways.
5.0.1.2 Long Data
Long data, on the other hand, has one column indicating the type of variable contained in that row and then a separate row for the value for that variable. Each row contains a single observation for a single variable. Below is an example of a long data set:
Long dataset
This long dataset includes the exact same information as the previous wide dataset; it is just stored differently. It’s harder to see visually how many different measurements were taken and on how many different people, but the same information is there.
While long data formats are less readable than wide data at a glance, they are a lot easier to work with during analysis. Most of the tools we’ll be working with use long data. Thus, to go from how data are often stored (wide) to working with the data during analysis (long), we’ll need to understand what tools are needed to do this and how to work with them.
5.0.2 R Packages
Converting your data from wide-to-long or from long-to-wide data formats is referred to as reshaping your data. There are two primary packages in R that will help you reshape your data: tidyr and reshape2. We’ll walk through the important functions of these two packages and work through a few examples using the functions in each. However, as with most helpful packages in R, there is more functionality than what is discussed here, so feel free to explore the additional resources at the bottom to learn even more.
Reshaping data
For these examples, we’ll work with the airquality
dataset available in R. The data in this dataset includes “Daily air quality measurements in New York, May to September 1973.” This is a wide dataset because each day is in a separate row and there are multiple columns with each including information about a different variable (ozone, solar.r, wind, temp, month, and day).
We can see the first few lines of this dataset using the following code:
head(airquality)
Air quality dataset
Again, wide data are easy to decipher at a glance. We can see that we have six different variables for each day, with each one of these variables (measurements) being stored in a separate column.
5.0.2.1 tidyr
Within tidyr, there are two functions to help you reshape your data.
gather()
: go from wide data to long dataspread()
: go from long data to wide data
To get started, you’ll need to be sure that the tidyr
package is installed and loaded into your RStudio session.
install.packages("tidyr")
library(tidyr)
5.0.2.1.1 gather()
As data are often stored in wide formats, you’ll likely use gather()
a lot more frequently than you’ll use spread()
. This will allow you to get the data into a long format that will be easy to use for analysis.
In tidyr
, gather()
will take the airquality
dataset from wide to long, putting each column name into the first column and each corresponding value into the second column. Here, the first column will be called key
. The second column will still be value
.
## use gather() to reshape from wide to long
<- gather(airquality)
gathered
## take a look at first few rows of long data
head(gathered)
gather dataset
However, it’s very easy to change the names of these columns within gather()
. To do so you define what the key and value columns names should be within gather()
:
## to rename the column names that gather provides,
## change key and value to what you want those column names to be
<- gather(airquality, key="variable", value="value")
gathered
## take a look at first few rows of long data
head(gathered)
gather column names changed
However, you’re likely not interested in your day and month variable being separated out into their own variables within the key
column. In fact, knowing the day and month associated with a particular data point helps identify that particular data point. To account for this, you can exclude day
and month
from the variables being included in the key
column by specifying all the variables that you do want included in the key
column. Here, that means specifying ozone
, solar.r
, wind
, and temp
. This will keep day
and month
in their own columns, allowing each row to be identified by the specific day and month being discussed.
## in gather(), after key and value, you can specify which variables
## you want included in the long format
## it will leave the other variables as is
<- gather(airquality, key="variable", value="value", ozone, solar.r, wind, temp)
gathered
## take a look at first few rows of long data
head(gathered)
gather specifying which variables to include in long format
Now, when you look at the top of this object, you’ll see that month
and day
remain in the data frame and that variable combines information from the other columns in airquality (ozone
, solar.r
, wind
, temp
). This is still a long format dataset; however, it has used month
and day
as IDs when reshaping the data frame.
5.0.2.1.2 spread()
To return your long data back to its original form, you can use spread()
. Here you specify two columns: the column that contains the names of what your wide data columns should be (key=variable
) and the column that contains the values that should go in these columns (value=value
). The data frame resulting from spread()
will have the original information back in the wide format (again, the columns will be in a different order). But, we’ll discuss how to rearrange data in the next lesson!
## use gather() to reshape from wide to long
<- spread(gathered, key=variable, value=value)
spread_data
## take a look at the spread data
head(spread_data)
## compare that back to the original
head(airquality)
spread data
5.0.2.2 reshape2
As with many things in R, there is more than one way to solve a problem. While tidyr
provides a more general solution for reshaping data, reshape2
was specifically designed for reshaping data. The details aren’t particularly important yet, but later on as you carry out your own analyses it will be good to know about both packages. To get started using reshape2
, you’ll have to first install the library and load it into your R session:
## install the package
install.packages('reshape2')
## load the package into R Session
library(reshape2)
There are two main functions within the reshape2
package:
melt()
: go from wide data to long datadcast()
: go from long data to wide data
5.0.2.2.1 melt()
The melt()
function will allow you to get the data into a long format that will be easy to use for analysis. When you melt a dataset with the default options, melt()
will take every column, put the column name into a variable
column, and then put the values of those variables into a value
column. For the airquality
data set, below we first assign the melted data frame to the object melted
. Then we take a look at the top (head()
) and bottom(tail()
) of this melted data frame (melted
).
## puts each column name into the 'variable' column
## puts corresponding variable's value in 'value' column
<- melt(airquality)
melted
## let's take a look at the top of the melted data frame
head(melted)
## and at the bottom of that melted data frame
tail(melted)
When you run this code you see that each column from the original data frame (ozone
, solar.r
, wind
, temp
,month
, and day
) are now repeated in the variables column, and each days’ value for that variable is now in the value
column. This is now a long format dataset!
melted data
Now, to use month and day as identifiers as we did with tidyr
above, the approach is slightly different. With gather()
, you specified the column names that you wanted to gather and omitted the column names that you wanted to retain as identifiers. With melt()
you will do the opposite. You will specify day
and month
as identifiers for the dataset and omit the remaining variables. You’ll want to use the following syntax:
## melt the data frame
## specify each row using month and day
<- melt(airquality, id.vars = c("month","day"))
melted
## look at the first few rows of the melted data frame
head(melted)
melted data frame using IDs
Despite the slight change in how the code was specified, the result here using melt()
is the same as what was achieved above using gather()
.
5.0.2.2.2 cast
You’ll likely have to go from long-to-short format less frequently; however, it’s good to know there are two approaches to accomplishing this within reshape2
whenever it is necessary.
acast()
: taking a long frame and returning a matrix/array/vectordcast()
: taking a long frame and returning a data frame
To return our melted data back into its original wide form, we’ll use dcast()
. The syntax here separates what should be used as an identifier for each row in the resulting wide format (month + day
below) and which column includes the values that should be the column headers (variable
below). These two pieces of information are separated by a tilde (~
).
## to get our data back to its original form
## specify which columns should be combined to use as identifiers
## and which column should be used to specify the columns
<- dcast(melted, month + day ~ variable)
original
head(original)
head(airquality)
dcast to obtain original data
As you can see, aside from the column order changing, the information in original
is the same as what was in the data frame we started with (airquality
).
While reshaping data may not be the most exciting topic, having this skill will be indispensable as you start working with data. It’s best to get these skills down early!
5.0.3 Additional Resources
- tidyr, part of the tidyverse and developed by Hadley Wickham and Lionel Henry
- reshape2, developed by Hadley Wickham
- tidyr tutorial by Hadley Wickham
- reshape2 tutorial by Sean C. Anderson
- tidyr vs reshape2 by Alberto Giudici