Chapter 7 Reshaping Data

7.1 Data Formats

Tidy data generally exist in two forms: wide data and long data. Both types of data are used and needed in data analysis. Fortunately, there are tools that can take you from wide-to-long and from long-to-wide. This makes it easy to work with any tidy data set. We’ll discuss the basics of what wide and long data are and how to go back and forth between the two in Python. Getting data into the right format will be crucial later when summarizing data and visualizing it. To help you visualize data as its being reshaped, we recommend you play around with the PandasTidyDataTutor.

7.1.1 Wide Data

Wide data has a column for each variable and a row for each observation. Data are often entered and stored in this manner. This is because wide data are often easy to understand at a glance. For example, this is a wide data set.

ID LastName FirstName HeightInches WeightLbs Insulin Glucose
1004 Smith Jane 65 180 0.60 163
4587 Nayef Mohammed 75 215 1.46 150
1727 Doe Janice 62 124 0.72 177
6879 Jordan Alex 77 160 1.23 205

This is a dataset we’ve looked at in a previous lesson. As discussed previously, it’s a rectangular and tidy dataset. Now, we can also state that it is a wide dataset. Here you can clearly see what measurements were taken for each individual and can get a sense of how many individuals are contained in the dataset.

Specifically, each individual is in a different row with each variable in a different column. At a glance we can quickly see that we have information about four different people and that each person was measured in four different ways.

7.1.2 Long Data

Long data, on the other hand, has one column indicating the type of variable contained in that row and then a separate row for the value for that variable. Each row contains a single observation for a single variable. Below is an example of a long data set.

ID Variable Value
1004 LastName Smith
1004 FirstName Jane
1004 HeightInches 65
1004 WeightLbs 180
1004 Insulin 0.60
1004 Glucose 163
4587 LastName Nayef
4587 FirstName Mohammed
4587 HeightInches 75
4587 WeightLbs 215
4587 Insulin 1.46
4587 Glucose 150
1727 LastName Doe
1727 FirstName Janice
1727 HeightInches 62
1727 WeightLbs 124
1727 Insulin 0.72
1727 Glucose 177
6879 LastName Jordan
6879 FirstName Alex
6879 HeightInches 77
6879 WeightLbs 160
6879 Insulin 1.23
6879 Glucose 205

This long dataset includes the exact same information as the previous wide dataset; it is just stored differently. It’s harder to see visually how many different measurements were taken and on how many different people, but the same information is there.

While long data formats are less readable than wide data at a glance, they are a lot easier to work with during analysis. Most of the tools we’ll be working with use long data. Thus, to go from how data are often stored (wide) to working with the data during analysis (long), we’ll need to understand what tools are needed to do this and how to work with them.

7.1.3 Reshaping data

Converting your data from wide-to-long or from long-to-wide data formats is referred to as reshaping your data. You can do this with the pandas package. In fact, we will use the pandas package for all of our data tidying, wrangling, and reshaping. As with most helpful Python packages, there is more functionality than what is discussed here, so feel free to explore the additional resources at the bottom to learn even more.

For these examples, we’ll work with the billboard dataset available in tidyr. The data in this dataset includes the song rankings for Billboard top 100 in the year 2000. The following code downloads the data from GitHub and displays the first few rows using the head method of the dataframe.

import pandas as pd


url = "https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/billboard.csv"
billboard = pd.read_csv(url)
billboard.head()
year artist track time date.entered wk1 wk2 wk75 wk76
2000 2 Pac Baby Don’t Cry (Keep… 4:22 2000-02-26 87 82 nan nan
2000 2Ge+her The Hardest Part Of … 3:15 2000-09-02 91 87 nan nan
2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 70 nan nan
2000 3 Doors Down Loser 4:24 2000-10-21 76 76 nan nan
2000 504 Boyz Wobble Wobble 3:35 2000-04-15 57 34 nan nan

Again, wide data are easy to decipher at a glance. We can see that we have different weeks for each song, but many of those columns contain the same variable (chart rank) so these data are not tidy.

There are two methods to help you reshape your data.

  • melt to go from wide data to long data.
  • pivot to go from long data to wide data.

7.1.4 melt

As data are often stored in wide formats, you’ll likely use melt a lot more frequently than you’ll use pivot. This will allow you to get the data into a long format that will be easy to use for analysis. Here is an example for the billboard data.

long_df = billboard.melt(
    id_vars=["year", "artist", "track", "time", "date.entered"],
    var_name="week",
    value_name="rank",
)
long_df.head()
year artist track time date.entered week rank
2000 2 Pac Baby Don’t Cry (Keep… 4:22 2000-02-26 wk1 87
2000 2Ge+her The Hardest Part Of … 3:15 2000-09-02 wk1 91
2000 3 Doors Down Kryptonite 3:53 2000-04-08 wk1 81
2000 3 Doors Down Loser 4:24 2000-10-21 wk1 76
2000 504 Boyz Wobble Wobble 3:35 2000-04-15 wk1 57

We set the argument id_var to ["year", "artist", "track", "time", "date.entered"] which tells pandas that all these variables identify each observation and should not be included in the transformation from wide to long.

7.1.5 pivot

To return your long data back to its original form, you can use pivot. Here you specify two columns: the column that contains the names of what your wide data columns should be (columns) and the column that contains the identifier for each observation (index). The data frame resulting from pivot will have the original information back in the wide format. However, the columns may be in a different order, and the identifiers (e.g., the track name) will appear in the index of the data frame. We’ll discuss indices and how to rearrange data in a coming lesson.

# Use pivot to reshape from long to wide.
wide_df = long_df.pivot(
    index=["year", "artist", "track", "time", "date.entered"],
    columns="week",
    values="rank",
)
# Take a look at the wide data.
wide_df.head()
year artist track time date.entered wk1 wk2 wk75 wk76
2000 2 Pac Baby Don’t Cry (Keep… 4:22 2000-02-26 87 82 nan nan
2000 2Ge+her The Hardest Part Of … 3:15 2000-09-02 91 87 nan nan
2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 70 nan nan
2000 3 Doors Down Loser 4:24 2000-10-21 76 76 nan nan
2000 504 Boyz Wobble Wobble 3:35 2000-04-15 57 34 nan nan

7.2 Transposing data

You may find you need your rows to be columns and your columns to be rows. In other words, you need to transpose your data frame. You can do this by using the .T property of the dataframe or the transpose method. Let’s have a look at our billboard dataset again.

# Use .T to transpose the data (swap rows and columns).
transposed = billboard.T
# Take a look at the transposed data.
transposed.head()
0 1 2
year 2000 2000 2000
artist 2 Pac 2Ge+her 3 Doors Down
track Baby Don’t Cry (Keep… The Hardest Part Of … Kryptonite
time 4:22 3:15 3:53
date.entered 2000-02-26 2000-09-02 2000-04-08

While billboard data had 317 rows and 81 columns, transposed has 81 rows and 317 columns. The information in both billboard and transposed is the same, it’s just in a different shape.

7.3 Additional Resources