
Chapter 9 Data
We noted in the last lesson that at this point we have a pretty complete package. We have functions, unit tests, documentation, and a vignette. However, what we haven’t talked about yet is the fact that sometimes, packages require data. This generally takes one of three forms:
- Data Available to the User
- Data Your Functions Need
- Raw Data
We’ll walk through each of these in this lesson, explain how it relates to your R package, and demonstrate how to utilize the usethis
package to make your life easier when adding data.
9.0.1 Data Available to the User
As you start developing packages, making data available to the user is likely what you’ll find yourself doing most frequently when it comes to data. This is because having datasets available to users that demonstrate the functionality of your package. For example, think about the datasets available from ggplot2
, such as diamonds
or msleep
. They’re great for demonstrating the various types of plots that can be made using ggplot2
! It can be really helpful to include these types of datasets with your package for users to utilize.
Datasets you want to be available to your user should be stored in the data/
directory. These are referred to as external or exported datasets.
You may notice at this point, however, that in your cbds
package, there isn’t yet a data/
directory. This is where usethis
comes into play.
To store data in the data/
directory, you’ll use usethis::use_data()
and specify the objects that you want to be saved. In the example below, if you had a dataset about trees stored in the object df_trees
, the following would create the .rda file appropriate for your package and put it in the right directory.
# for external use
`usethis::use_data(df_trees)`
Note that you get a readout of the fact that the data/
directory has been created for you and that your dataset has been saved within that directory.
use_data
does all the legwork for you to make data available to users
To double check this, you would look at your Files tab, where you would now see that the data/
directory is there for you.
data/
directory has been created
Within that directory, you’ll find that your object df_trees
is now stored within data/
as a .rda
file: df_trees.rda
. Unlike CSVs, this R Data file extension is something we haven’t used a ton, but this is a great way to store data within an R package and is the most efficient format for making data available to users within a package.
df_trees.rda
is stored within data/
9.0.2 Data Required By Your Functions
In addition to data that you want your users to have access to, sometimes your functions require data and you want that data to be available to your functions but not to all package users. So, the data can’t be stored in data/
. Data for your use by your package’s functions is referred to as internal data. Internal data are generated again using the use_data()
function from usethis
; however, you’ll need to change an input argument: internal = TRUE
.
# for internal use
::use_data(df_trees, internal = TRUE) usethis
specify internal = TRUE
to store internal data
Notice that these data are not stored in data/
and instead are stored in the R/
directory in a file named sysdata.rda
.
internal data are stored in R/sysdata.rda
As a reminder, these data are only available for use inside your package.
9.0.3 Raw Data
Less frequently than providing users with an example dataset (that you’ve made available to them as discussed earlier in the lesson), you may find a need to provide raw data or original files with your package. These raw data files are often stored in inst/extdata/
.
To automatically generate the scripts (using usethis
) you’ll use to generate raw data files (and the accompanying script), you would use the usethis::use_data_raw()
function:
# for raw data
::use_data_raw("trees_raw") usethis
use_data_raw
gets you started with external data
As previously, you get a helpful readout of what has been done for you as well as what still remains for you to do.
Specifically, the example above creates a data-raw/
directory for you.
data-raw/
is generated for you
It also generates a script file of the name you specified within the data-raw/
directory, and opens that script file.
use_raw_data()
helps you to get set up to add raw data
Within the script file, there is a comment prompting you to add code here that takes your raw data file and gets it into the cleaner form that you’ve shared with users. You would add your code to take your raw dataset from its raw form into the form shared with your users (in your data/
directory).
One example of this approach being used comes from Hadley Wickham’s babynames
package. Within that package cleaned up data from the USA Social Security Administration are made available to users (in the data/
directory). However, to make those data usable by others from the package, Hadley Wickham had to gather and clean the data. The code to accomplish this is stored within the data-raw/
directory.
Example of .R script that generates clean data from raw
At the end of this script, he saves the data file as a CSV into the data-raw/
directory. This way his workflow is completely reproducible - anyone looking at the code in this package is completely clear on what data were used (from the information in data-raw/
) and can get to using the dataset (because the clean version is stored in data
/)
Note that to access raw data file from a package you utilize system.file
using the following syntax:
system.file("extdata", "births.csv", package = "babynames")
Here, “extdata” refers to the directory where the data are stored in the package, “births.csv” refers to the name of raw data file, and “babynames” refers to the package from which you want to retrieve this data.
9.0.4 Documenting Datasets
Now that we’ve discussed the main ways of including data within a package, we have to take a second to talk about documenting data.
Datasets stored in data/
(so, those external datasets you want your users to be able to use) are made available to users of your packages, just like functions! So, just like functions, external data must be documented. While we won’t go into too many specifics here, roxygen2 blocks are also used to document datasets making the process familiar. These roxygen2 blocks should be saved in the R/
directory in a file named data.R
.
Here is what the documentation looks like for the births
dataset from the babynames
package we were just talking about
#' Births
#'
#' Live births data from census.gov.
#'
#' @format A data frame:
#' \describe{
#' \item{year}{Year}
#' \item{births}{Number of live births, rounded to nearest 1000}}
"births"
Within this roxygen2 bloc, we see the name of the dataset and a brief description. However, the @format
tag is new. This should describe the dataset. For a dataframe, you must provide users with a list that describes each variable in the dataset you’re sharing. Here, we see that this dataset has two variables: year
and births
Note that @export
is not included (as you would for an exported function). Be sure not to include @export
when documenting a dataset.
Finally, only data being made available to users have to be documented. You don’t need roxygen2 blocks for internal data or raw data.
9.0.5 Summary
In this lesson we cover the final piece of what you’ll typically add to a package - data! Now, there are other topics surrounding creating R packages that we have not covered in the lessons here. In the next lesson we’ll talk about how to get your package onto GitHub and use Travis CI, but we don’t cover in detail what you’ll need to do to get your package onto CRAN, the NAMESPACE, or Compiled Code. So, if you want to learn even more the R Packages book by Hadley Wickham and Jenny Bryan is a great resource and is where tons of information in these lessons has come from. We also link to the appropriate section in that book after each lesson in this course.
9.0.6 Additional Resources
- External Data, by Hadley Wickham & Jenny Bryan
- Including Datasets, by Karl Broman