Chapter 8 How to Tidy Data
So far we’ve discussed what tidy and untidy data are. We’ve (hopefully) convinced you that tidy data are the right type of data to work with. And, more than that, hopefully we’ve explained that data are not always the tidiest when they come to you at the start of a project. An incredibly important skill of a data scientist is to be able to take data from an untidy format and get it into a tidy format. We’ve started to discuss how to do this in the last lesson where we learned to reshape data. In this lesson, we’ll discuss a number of other ways in which data can be tidied and the necessary tools to tidy data.
These skills are often referred to as data wrangling. They are skills that allow you to wrangle data from the format they’re currently in into the format you actually want them in.
As this is an incredibly important topic, this will be a long lesson covering a number of packages and topics. Take your time working through it and be sure to understand all of the examples!
In R, you used a range of packages like dplyr
, tidyr
, janitor
, and skimr
. In Python, we’ll use only one library: pandas
.
8.1 Exploring Data
We’ll be using a dataset from the ggplot2
package called msleep
. This dataset includes sleep times and weights from a number of different mammals. It has 83 rows, with each row including information about a different type of animal, and 11 variables. As each row is a different animal and each column includes information about that animal, this is a wide dataset.
To get an idea of what variables are included in this data frame, you can use the info
method. This method summarizes how many rows there are (observations) and how many columns there are (variables). Additionally, it gives you a glimpse into the type of data contained in each column. Specifically, in this data set, we know that the first column is name
and that it contains objects (strings, in this case). info
also tells you how many non-null (or non-missing) entries there are in each column. For example, there are no missing values in the name
columns (83 non-null out of a total of 83 rows), but there are 51 missing values (32 non-null out of a total of 83 rows) in the sleep_cycle
column.
import pandas as pd
# Download and read the dataset into a pandas data frame.
= "https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/msleep.csv"
url = pd.read_csv(url)
msleep msleep.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83 entries, 0 to 82
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 83 non-null object
1 genus 83 non-null object
2 vore 76 non-null object
3 order 83 non-null object
4 conservation 54 non-null object
5 sleep_total 83 non-null float64
6 sleep_rem 61 non-null float64
7 sleep_cycle 32 non-null float64
8 awake 83 non-null float64
9 brainwt 56 non-null float64
10 bodywt 83 non-null float64
dtypes: float64(6), object(5)
memory usage: 7.3+ KB
Another method to get a glimpse of your data is head
which returns the first few rows of the data frame.
msleep.head()
name | genus | vore | order | conservation | sleep_total | sleep_rem | sleep_cycle | awake | brainwt | bodywt | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Cheetah | Acinonyx | carni | Carnivora | lc | 12.1 | nan | nan | 11.9 | nan | 50 |
1 | Owl monkey | Aotus | omni | Primates | nan | 17 | 1.8 | nan | 7 | 0.0155 | 0.48 |
2 | Mountain beaver | Aplodontia | herbi | Rodentia | nt | 14.4 | 2.4 | nan | 9.6 | nan | 1.35 |
3 | Greater short-tailed shrew | Blarina | omni | Soricomorpha | lc | 14.9 | 2.3 | 0.133333 | 9.1 | 0.00029 | 0.019 |
4 | Cow | Bos | herbi | Artiodactyla | domesticated | 4 | 0.7 | 0.666667 | 20 | 0.423 | 600 |
Similarly, tail
returns the last few rows. You can pass an integer k
to both methods to show the first or last k
rows.
8.2 Filtering Data
When working with a large dataset, you’re often interested in only working with a portion of the data at any one time. For example, if you had data on people from ages 0 to 100 years old, but you wanted to ask a question that only pertained to children, you would likely want to only work with data from those individuals who were less than 18 years old. To do this, you would want to filter your dataset to only include data from these select individuals. Filtering can be done by row or by column, and we’ll discuss the syntax for both1.
8.2.1 Filtering Rows
If you were only interested in learning more about the sleep times of Primates
, we could filter this dataset to include only data about those mammals that are also Primates. As we can see from head
, this information is contained within the order
variable. Using pandas
, we filter the data as follows.
= msleep[msleep.order == "Primates"]
primates primates.head()
name | genus | vore | order | conservation | sleep_total | sleep_rem | sleep_cycle | awake | brainwt | bodywt | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Owl monkey | Aotus | omni | Primates | nan | 17 | 1.8 | nan | 7 | 0.0155 | 0.48 |
12 | Grivet | Cercopithecus | omni | Primates | lc | 10 | 0.7 | nan | 14 | nan | 4.75 |
25 | Patas monkey | Erythrocebus | omni | Primates | lc | 10.9 | 1.1 | nan | 13.1 | 0.115 | 10 |
28 | Galago | Galago | omni | Primates | nan | 9.8 | 1.1 | 0.55 | 14.2 | 0.005 | 0.2 |
33 | Human | Homo | omni | Primates | nan | 8 | 1.9 | 1.5 | 16 | 1.32 | 62 |
Note that we are using the equality ==
comparison operator that you learned about in the previous course rather than the assignment operator =
. This is an important distinction. So what’s happening here? Remember than indexing an object with [...]
selects some of its items. Here, we select all items that satisfy the condition that order
is Primates
.
We now have a smaller dataset of only 12 mammals (as opposed to the original 83) and we can see that the order
variable column only includes Primates
.
But, what if we were only interested in Primates who sleep more than 10 hours total per night? This information is in the sleep_total
column. To accomplish this, you would use the following syntax, combining the two conditions using the &
operator. This operator is the logical AND
. In other words, both order == "Primates"
and sleep_total > 10
need to be satisfied for the animal to be included.
= msleep[(msleep.order == "Primates") & (msleep.sleep_total > 10)]
sleepy_primates sleepy_primates
name | genus | vore | order | conservation | sleep_total | sleep_rem | sleep_cycle | awake | brainwt | bodywt | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Owl monkey | Aotus | omni | Primates | nan | 17 | 1.8 | nan | 7 | 0.0155 | 0.48 |
25 | Patas monkey | Erythrocebus | omni | Primates | lc | 10.9 | 1.1 | nan | 13.1 | 0.115 | 10 |
37 | Macaque | Macaca | omni | Primates | nan | 10.1 | 1.2 | 0.75 | 13.9 | 0.179 | 6.8 |
44 | Slow loris | Nyctibeus | carni | Primates | nan | 11 | nan | nan | 13 | 0.0125 | 1.4 |
55 | Potto | Perodicticus | omni | Primates | lc | 11 | nan | nan | 13 | nan | 1.1 |
Now, we have a dataset focused in on only 5 mammals, all of which are primates who sleep for more than 10 hours a night total. The number of columns hasn’t changed. All 11 variables are still shown in columns because the function filter()
filters on rows, not columns.
If you wanted to select animals that are primates or sleep more than ten hours, you’d use the |
operator to combine the two conditions like so.
= msleep[(msleep.order == "Primates") | (msleep.sleep_total > 10)]
sleepy_or_primates sleepy_or_primates.head()
name | genus | vore | order | conservation | sleep_total | sleep_rem | sleep_cycle | awake | brainwt | bodywt | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Cheetah | Acinonyx | carni | Carnivora | lc | 12.1 | nan | nan | 11.9 | nan | 50 |
1 | Owl monkey | Aotus | omni | Primates | nan | 17 | 1.8 | nan | 7 | 0.0155 | 0.48 |
2 | Mountain beaver | Aplodontia | herbi | Rodentia | nt | 14.4 | 2.4 | nan | 9.6 | nan | 1.35 |
3 | Greater short-tailed shrew | Blarina | omni | Soricomorpha | lc | 14.9 | 2.3 | 0.133333 | 9.1 | 0.00029 | 0.019 |
5 | Three-toed sloth | Bradypus | herbi | Pilosa | nan | 14.4 | 2.2 | 0.766667 | 9.6 | nan | 3.85 |
8.2.2 Selecting Columns
While indexing with [...]
typically operates on rows, it can also filter your dataset to only include the columns you’re interested in. To select columns so that your dataset only includes variables you’re interested in, you simply index the data frame with a list of column names.
Let’s start with the code we just wrote to only include primates who sleep a lot. What if we only want to include the first column (the name of the mammal) and the sleep information (included in the columns sleep_total
, sleep_rem
, and sleep_cycle
)?
= sleepy_primates[["name", "sleep_total", "sleep_rem", "sleep_cycle"]]
subset subset.head()
name | sleep_total | sleep_rem | sleep_cycle | |
---|---|---|---|---|
1 | Owl monkey | 17 | 1.8 | nan |
25 | Patas monkey | 10.9 | 1.1 | nan |
37 | Macaque | 10.1 | 1.2 | 0.75 |
44 | Slow loris | 11 | nan | nan |
55 | Potto | 11 | nan | nan |
After indexing with the column names, we see that we still have the five rows we filtered to before, but we only have the four desired columns.
8.2.3 Renaming Columns
You can use the rename
method to change column names. In the following example, we remove the sleep_
from the beginning of column names.
"total": "sleep_total", "rem": "sleep_rem", "cycle": "sleep_cycle"}) subset.rename({
name | sleep_total | sleep_rem | sleep_cycle | |
---|---|---|---|---|
1 | Owl monkey | 17 | 1.8 | nan |
25 | Patas monkey | 10.9 | 1.1 | nan |
37 | Macaque | 10.1 | 1.2 | 0.75 |
44 | Slow loris | 11 | nan | nan |
55 | Potto | 11 | nan | nan |
The data remain unchanged, however. Treat renaming columns with caution because having different names for the same data in different parts of your code can be confusing. Using rename
will only change the column names, but it will not filter out any columns.
8.3 Reordering
In addition to filtering rows and columns, often, you’ll want the data arranged in a particular order. It may order the columns in a logical way, or it could be to sort the data so that the data are sorted by value, with those having the smallest value in the first row and the largest value in the last row. All of this can be achieved with a few simple functions.
8.3.1 Reordering Columns
Using our example from above, if you wanted sleep_rem
to be the first sleep column and sleep_total
to be the last column, all you have to do is reorder the column names in the list.
= sleepy_primates[["name", "sleep_rem", "sleep_cycle", "sleep_total"]]
reordered_columns reordered_columns.head()
name | sleep_rem | sleep_cycle | sleep_total | |
---|---|---|---|---|
1 | Owl monkey | 1.8 | nan | 17 |
25 | Patas monkey | 1.1 | nan | 10.9 |
37 | Macaque | 1.2 | 0.75 | 10.1 |
44 | Slow loris | nan | nan | 11 |
55 | Potto | nan | nan | 11 |
Here we see that name
is displayed first followed by sleep_rem
, sleep_cycle
, and sleep_total
, just as it was specified in the list of column names.
8.3.2 Reordering Rows
Rows can also be reordered. To reorder a variable in ascending order (from smallest to largest), you’ll want to use the sort_values
method. Continuing on from our example above, to now sort our rows by the amount of total sleep each mammal gets, we would use the following syntax.
"sleep_total") reordered_columns.sort_values(
name | sleep_rem | sleep_cycle | sleep_total | |
---|---|---|---|---|
37 | Macaque | 1.2 | 0.75 | 10.1 |
25 | Patas monkey | 1.1 | nan | 10.9 |
44 | Slow loris | nan | nan | 11 |
55 | Potto | nan | nan | 11 |
1 | Owl monkey | 1.8 | nan | 17 |
To instead sort the data in descending (largest to smallest) order, simply pass the argument ascending=False
to the sort_values
method like so.
"sleep_total", ascending=False) reordered_columns.sort_values(
name | sleep_rem | sleep_cycle | sleep_total | |
---|---|---|---|---|
1 | Owl monkey | 1.8 | nan | 17 |
44 | Slow loris | nan | nan | 11 |
55 | Potto | nan | nan | 11 |
25 | Patas monkey | 1.1 | nan | 10.9 |
37 | Macaque | 1.2 | 0.75 | 10.1 |
sort_values
can also be used to order non-numeric variables. For example, sort_values
will sort character vectors alphabetically.
"name") reordered_columns.sort_values(
name | sleep_rem | sleep_cycle | sleep_total | |
---|---|---|---|---|
37 | Macaque | 1.2 | 0.75 | 10.1 |
1 | Owl monkey | 1.8 | nan | 17 |
25 | Patas monkey | 1.1 | nan | 10.9 |
55 | Potto | nan | nan | 11 |
44 | Slow loris | nan | nan | 11 |
If you would like to reorder rows based on information in multiple columns, you can specify them as a list of column names. This is useful if you have repeated labels in one column and want to sort within a category based on information in another column. In the example here, if there were repeated primates, this would sort the repeats based on their total sleep.
"name", "sleep_total"]) reordered_columns.sort_values([
name | sleep_rem | sleep_cycle | sleep_total | |
---|---|---|---|---|
37 | Macaque | 1.2 | 0.75 | 10.1 |
1 | Owl monkey | 1.8 | nan | 17 |
25 | Patas monkey | 1.1 | nan | 10.9 |
55 | Potto | nan | nan | 11 |
44 | Slow loris | nan | nan | 11 |
8.4 Creating new columns
You will often find when working with data that you need an additional column. For example, if you had two datasets you wanted to combine, you may want to make a new column in each dataset called dataset
. In one dataset you may put datasetA
in each row. In the second dataset, you could put datasetB
. This way, once you combined the data, you would be able to keep track of which dataset each row came from originally. More often, however, you’ll likely want to create a new column that calculates a new variable based on information in a column you already have. For example, in our mammal sleep dataset, sleep_total
is in hours. What if you wanted to have that information in minutes? You could create a new column with this very information!
Returning to our msleep
dataset, after filtering and re-ordering, we can create a new column. We will calculate the number of minutes each mammal sleeps by multiplying the number of hours each animal sleeps by 60 minutes. A new column can be created by indexing the data frame with a column name and assigning to it.
"sleep_total_min"] = reordered_columns.sleep_total * 60
reordered_columns[ reordered_columns
name | sleep_rem | sleep_cycle | sleep_total | sleep_total_min | |
---|---|---|---|---|---|
1 | Owl monkey | 1.8 | nan | 17 | 1020 |
25 | Patas monkey | 1.1 | nan | 10.9 | 654 |
37 | Macaque | 1.2 | 0.75 | 10.1 | 606 |
44 | Slow loris | nan | nan | 11 | 660 |
55 | Potto | nan | nan | 11 | 660 |
Importantly, creating a new column in this way will change or mutate the data frame. In contrast, all the previous methods like sort_values
, rename
, and indexing with column names to select a subset created a new data frame each time. The original data frame was unchanged. For example, the data frame msleep
we started with still contains 83 rows and 11 columns.
8.5 Separating Columns
Sometimes multiple pieces of information are merged within a single column even though it would be more useful during analysis to have those pieces of information in separate columns. To demonstrate, we’ll now move from the msleep
dataset to talking about another dataset conservation_explanation.csv
that includes information about conservation abbreviations in a single column.
# Download and read the data.
= "https://raw.githubusercontent.com/suzanbaert/Dplyr_Tutorials/master/conservation_explanation.csv"
url = pd.read_csv(url)
conservation
# Inspect the first few rows.
conservation.head()
conservation abbreviation | |
---|---|
0 | EX = Extinct |
1 | EW = Extinct in the wild |
2 | CR = Critically Endangered |
3 | EN = Endangered |
4 | VU = Vulnerable |
In this dataset, we see that there is a single column that includes both the abbreviation for the conservation term as well as what that abbreviation means. Recall that this violates one of the tidy data principles covered in the first lesson: Put just one thing in a cell. To work with these data, you could imagine that you may want these two pieces of information (the abbreviation and the description) in two different columns. We’ll use the str.split
method of the column to separate the abbreviation and description.
= conservation["conservation abbreviation"].str.split(" = ", expand=True)
separated separated.head()
0 | 1 | |
---|---|---|
0 | EX | Extinct |
1 | EW | Extinct in the wild |
2 | CR | Critically Endangered |
3 | EN | Endangered |
4 | VU | Vulnerable |
We selected the column using indexing notation [...]
instead of simply writing conservation.conservation abbreviation
because the column name contains a space. This is another violation of tidy data! Variable names should have underscores, not spaces! The str.split
method requires the characters that currently separate the pieces of information =
as its first argument. Setting expand=True
tells str.split
to expand the results of splitting into separate columns. You’ll learn more about working with strings in a later section.
However, the separated data frame does not have sensible column names. We can fix that by explicitly setting the column names.
= ["abbreviation", "conservation"]
separated.columns separated.head()
abbreviation | conservation | |
---|---|---|
0 | EX | Extinct |
1 | EW | Extinct in the wild |
2 | CR | Critically Endangered |
3 | EN | Endangered |
4 | VU | Vulnerable |
8.6 Merging Columns
The opposite of split
is to concatenate values from different columns which you can achieve using the +
operator. Using the code forming the two separate columns above, we can add on an extra line of code to re-join these separate columns, returning what we started with.
"conservation_abbreviation"] = separated.abbreviation + " = " + separated.conservation
separated[ separated.head()
abbreviation | conservation | conservation_abbreviation | |
---|---|---|---|
0 | EX | Extinct | EX = Extinct |
1 | EW | Extinct in the wild | EW = Extinct in the wild |
2 | CR | Critically Endangered | CR = Critically Endangered |
3 | EN | Endangered | EN = Endangered |
4 | VU | Vulnerable | VU = Vulnerable |
Because one of the principles of tidy data is to only ever have one thing in each column, you’ll rarely use concatenation in practice.
print(_.to_markdown())
8.7 Combining data across data frames
There is often information stored in two separate data frames that you’ll want in a single data frame. There are many different ways to join separate data frames. They are discussed in more detail in this tutorial. Here, we’ll demonstrate how to use the merge
function to perform a “left join”, as this is used frequently.
Let’s try to combine the information from the two different datasets we’ve used in this lesson. We have msleep
and separated
. msleep
contains a column called conservation
. This column includes lowercase abbreviations that overlap with the uppercase abbreviations in the abbreviation
column in the separated
dataset. To handle the fact that in one dataset the abbreviations are lowercase and the other they are uppercase, we’ll use str.upper
to take all the lowercase abbreviations to uppercase abbreviations.
We’ll then use merge
to perform a left join which takes all of the rows in the first dataset msleep
and incorporates information from the second dataset separated
if information in the second dataset is available. The left_on
and right_on
arguments specify the columns to use in the left
(msleep
) and right
datasets (separated
) to merge or join the two datasets. Setting how="left"
tells merge
to perform a left
join. This means if there is no information in the right
dataset, the result will have NA
where data are not available. Specifically, for rows where conservation is “DOMESTICATED” below, the description
column will have NA because “DOMESTICATED”” is not an abbreviation in the conserve
dataset.
# Convert conservation abbreviation to upper case.
= msleep.conservation.str.upper()
msleep.conservation # Merge the data.
= pd.merge(msleep, separated, left_on="conservation", right_on="abbreviation", how="left")
merged merged.head()
name | genus | vore | order | conservation_x | sleep_total | sleep_rem | sleep_cycle | awake | brainwt | bodywt | abbreviation | conservation_y | conservation_abbreviation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Cheetah | Acinonyx | carni | Carnivora | LC | 12.1 | nan | nan | 11.9 | nan | 50 | LC | Least Concern | LC = Least Concern |
1 | Owl monkey | Aotus | omni | Primates | nan | 17 | 1.8 | nan | 7 | 0.0155 | 0.48 | nan | nan | nan |
2 | Mountain beaver | Aplodontia | herbi | Rodentia | NT | 14.4 | 2.4 | nan | 9.6 | nan | 1.35 | NT | Near Threatened | NT = Near Threatened |
3 | Greater short-tailed shrew | Blarina | omni | Soricomorpha | LC | 14.9 | 2.3 | 0.133333 | 9.1 | 0.00029 | 0.019 | LC | Least Concern | LC = Least Concern |
4 | Cow | Bos | herbi | Artiodactyla | DOMESTICATED | 4 | 0.7 | 0.666667 | 20 | 0.423 | 600 | nan | nan | nan |
It’s important to note that there are many other ways to join data, which are covered in more detail in this tutorial. For now, it’s important to know that joining datasets is done easily using merge
. As you join data frames in your own work, it’s a good idea to refer back to this tutorial for assistance.
8.8 Summarizing Data
8.8.1 describe
The describe
method is similar to info
. But instead of telling you about the type of data in the data frame, it provides summary statistics for columns containing numbers.
msleep.describe()
sleep_total | sleep_rem | sleep_cycle | awake | brainwt | bodywt | |
---|---|---|---|---|---|---|
count | 83 | 61 | 32 | 83 | 56 | 83 |
mean | 10.4337 | 1.87541 | 0.439583 | 13.5675 | 0.281581 | 166.136 |
std | 4.45036 | 1.29829 | 0.35868 | 4.45209 | 0.976414 | 786.84 |
min | 1.9 | 0.1 | 0.116667 | 4.1 | 0.00014 | 0.005 |
25% | 7.85 | 0.9 | 0.183333 | 10.25 | 0.0029 | 0.174 |
50% | 10.1 | 1.5 | 0.333333 | 13.9 | 0.0124 | 1.67 |
75% | 13.75 | 2.4 | 0.579167 | 16.15 | 0.1255 | 41.75 |
max | 19.9 | 6.6 | 1.5 | 22.1 | 5.712 | 6654 |
If you want to learn about non-numeric data, pass include="object"
to the describe method.
="object") msleep.describe(include
name | genus | vore | order | conservation | |
---|---|---|---|---|---|
count | 83 | 83 | 76 | 83 | 54 |
unique | 83 | 77 | 4 | 19 | 6 |
top | Cheetah | Panthera | herbi | Rodentia | LC |
freq | 1 | 3 | 32 | 22 | 27 |
The output will tell you about the number of non-null values (count
), the number of unique values (unique
), the most frequent value (top
), and the number of times the most frequent value appeared in the dataset (freq
). If there is a tie between the most frequent values, a random value will be returned (Cheetah
in this case for the name
column).
If you are interested in the frequency of values in a particular column, you can use value_counts
to learn more.
msleep.order.value_counts()
order | count |
---|---|
Rodentia | 22 |
Carnivora | 12 |
Primates | 12 |
Artiodactyla | 6 |
Soricomorpha | 5 |
Perissodactyla | 3 |
Cetacea | 3 |
Hyracoidea | 3 |
Diprotodontia | 2 |
Erinaceomorpha | 2 |
Proboscidea | 2 |
Chiroptera | 2 |
Didelphimorphia | 2 |
Cingulata | 2 |
Lagomorpha | 1 |
Pilosa | 1 |
Monotremata | 1 |
Afrosoricida | 1 |
Scandentia | 1 |
In this example, the order Rodentia
appears 22 times and is the most frequently occurring order in the dataset. If you want percentages rather than counts, pass normalize=True
to value_counts
.
8.9 Grouping Data
Often, data scientists will want to summarize information in their dataset. You may want to know how many animals are in a dataset. However, more often, you’ll want to know how many animals there are within a group in your dataset. For example, you may want to know how many primates and how many carnivores there are. To do this, grouping your data is necessary. Rather than looking at the total number of individuals, to accomplish this, you first have to group the data by the order of the animals. Then, you count within those groups. Grouping by variables is straightforward using the groupby
method.
8.9.1 groupby
group_by
groups a dataset by one or more variables.
= msleep.groupby("order")
groups groups
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1356e1600>
This doesn’t seem very useful at first sight, but we can extract information about the different groups using further methods.
groups.order.count()
order
Afrosoricida 1
Artiodactyla 6
Carnivora 12
Cetacea 3
Chiroptera 2
Cingulata 2
Didelphimorphia 2
Diprotodontia 2
Erinaceomorpha 2
Hyracoidea 3
Lagomorpha 1
Monotremata 1
Perissodactyla 3
Pilosa 1
Primates 12
Proboscidea 2
Rodentia 22
Scandentia 1
Soricomorpha 5
Name: order, dtype: int64
This code counts the number of non-null values in the order
column within each group. The results may differ slightly if you use a different column. For example, there are quite a few values missing in the sleep_cycle
column.
groups.sleep_cycle.count()
order
Afrosoricida 0
Artiodactyla 2
Carnivora 4
Cetacea 0
Chiroptera 2
Cingulata 1
Didelphimorphia 1
Diprotodontia 0
Erinaceomorpha 1
Hyracoidea 0
Lagomorpha 1
Monotremata 0
Perissodactyla 2
Pilosa 1
Primates 5
Proboscidea 0
Rodentia 7
Scandentia 1
Soricomorpha 4
Name: sleep_cycle, dtype: int64
There are other ways in which the data can be summarized using groupby
. In addition to using to count
the number of samples within a group, you can also summarize using other helpful functions, such as mean
, median
, min
, or max
. For example, if we wanted to calculate the average (mean) total sleep each order of mammal got, we could use the following code.
groups.sleep_total.mean()
order
Afrosoricida 15.600000
Artiodactyla 4.516667
Carnivora 10.116667
Cetacea 4.500000
Chiroptera 19.800000
Cingulata 17.750000
Didelphimorphia 18.700000
Diprotodontia 12.400000
Erinaceomorpha 10.200000
Hyracoidea 5.666667
Lagomorpha 8.400000
Monotremata 8.600000
Perissodactyla 3.466667
Pilosa 14.400000
Primates 10.500000
Proboscidea 3.600000
Rodentia 12.468182
Scandentia 8.900000
Soricomorpha 11.100000
Name: sleep_total, dtype: float64
8.10 Conclusion
We have gone through a number of ways to work with data in this lesson. Mastering the skills in this lesson will provide you with a number of critical data science skills. Thus, running the examples in this lesson and practicing on your own with other data sets will be essential to succeeding as a data scientist.
The examples in this lesson and the organization for this lesson were adapted from Suzan Baert’s wonderful
dplyr
tutorials.↩︎