Chapter 19 Basic Commands in R

Now that we’ve covered some essentials about R objects, we’ll go over some basic commands that will be helpful in working with data.

19.0.1 Functions

In working with data, we will be making substantial use of functions. Functions in R carry out some task. We’ve already introduced you to a handful of functions like class(), head(), str(), getwd(), to name a few.

Functions are always a word (or set of words connected by underscores or periods followed by a set of parentheses, so the general structure of a function in R would look something like this:

function(input)

function_name(input)

Input to a function in R is known as an argument. Functions require at least one argument, but can require multiple different arguments, depending on the function.

At this point you may be saying, “but when we used getwd() we didn’t specify an argument!” - And you’d be correct! We didn’t specify one, but the sneaky thing about arguments is that often if you don’t specify one, that means one is automatically being chosen for you (called a default). To see all about a function’s arguments and its defaults, see its help page by using ?function.

To visually understand the anatomy of a function call (a term that describes the using of a function), let’s look at the following example:

mean(x, trim = 0.1)

We have an object x that should contain numbers, and we want to compute the mean of these numbers with the mean function. As stated above, all of the information inside the parentheses are function inputs (also called arguments), and they are separated by commas. In this command, I have supplied the object x and an additional argument trim that I set to be 0.1. The trim argument calls for a number between 0 and 0.5 and specifies the fraction of the observations in x to trim from the upper and lower ends of the data. Here, by including the trim argument, I am specifying that I want to take the mean of the middle 80% of the data.

19.0.2 What is this object?

If someone were to write down a mystery noun for us to guess, our first question would likely be: “Is it a person, place, or thing?” When working with R objects, we will initially want similar types of information. Here we will go over some functions that can help in this regard.

As we covered before, the class() function returns the class of an R object. This is useful for determining if an object is an atomic vector, list, or some other type of object. If it is an atomic vector, this function tells you the type.

> x <- 1:10
> class(x)
[1] "integer"
> y <- c(1.1,2.2)
> class(y)
[1] "numeric"
> class(mtcars)
[1] "data.frame"

As discussed, The str() function stands for “structure”, and it returns a description of the structure of an object. It tells you the class of an object, its size, and a preview of different components of the object. For example, when we call the str() function on a data frame object (mtcars), we see that its class is data.frame(), it has 32 rows and 11 columns, and a preview of each of the 11 columns, including the class of each column. In this example, all of the columns are numeric variables relating to features of different models of cars.

> str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

19.0.3 How big is this object?

After we determine generally what an object is, it is useful to know how much information it contains, how big it is.

The dim() function returns the dimensions of a rectangular object, such as a matrix or a data frame. The output is an integer vector with two components: first is the number of rows (which can also be obtained with nrow()), and second is the number of columns (which can also be obtained with ncol()). We saw previously that the str() function provides the same information and more, so why would we use these functions instead? The str() function provides this information by printing it to the screen for us to visually see, but it does not extract this information directly. If we need to use the dimensions later in the analysis as a variable, these functions provide a direct way to store this information.

> dim(mtcars)
[1] 32 11
> nrow(mtcars)
[1] 32
> ncol(mtcars)
[1] 11

The length() function returns the number of items in a vector object. We talked about this briefly last lesson that the number of things in your object is referred to as its length. Here, we can quickly calculate the length of an object by calling the length() function.

> x <- c(1, 10, 3)
> length(x)
[1] 3

19.0.4 Are there named features of this object?

Another way to explore an object in R is to see what components it has. In R, these components are designated with names.

The names() function can be used to get and set the names of an R object, most often an atomic vector or a list. For example, we can create an R object called prize_money that contains the prize money for first, second, and third places:

prize_money <- c(1000, 500, 250)

If we want to label this vector with the prizes, we can use names() combined with the assignment operator <- and a character vector of labels:

names(prize_money) <- c("first", "second", "third")

Later in our work, if we want to remind ourselves of the labels, we can use the names() function by itself, which will print the names for the object.

> names(prize_money)
[1] "first"  "second" "third"

Note that in many situations, it will be better practice to encapsulate the above information in a two-column data frame instead of a named vector as below.

prize_info <- data.frame(
    money = c(1000,500,250),
    place = c("first", "second", "third")
)

This is more convenient for further work if you have other objects that have information on first, second, or third placing, but not prize money information. You’ll learn more about these concepts when you learn about “tidy data” in a later chapter.

The colnames() and rownames() functions act analogously to the names() function but are used for the column labels and row labels of a matrix or data frame. The numbers in square brackets at the beginning of the lines of printed output indicate the index of the first observation on the line. So for the row names, we can see that “Duster 360” is the seventh element.

> colnames(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
> rownames(mtcars)
 [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
 [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
 [7] "Duster 360"          "Merc 240D"           "Merc 230"           
[10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
[13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood"
[16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
[19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
[22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
[25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
[28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
[31] "Maserati Bora"       "Volvo 142E"

colnames() and rownames() functions

19.0.5 What does this object look like?

Sometimes we may just want to see the information contained in an object. Here we will discuss functions that allow you to see parts of objects.

The print() function displays the entire contents of an object.

print(mtcars)

Recall that in R, the Console is where commands can be typed and entered for R to run. When R is ready to accept a command, a greater than sign will be displayed. An alternative to calling the print function is to simply type the name of the object in the Console and press enter. In general printing an entire object is not advisable just in case the object is quite large. In this case your screen would overflow with text!

mtcars

printing objects’ contents to the screen

Safer alternatives to printing are the head and tail functions. The head function displays the beginning of an object. By default, it shows the first 6 items. If the object is a vector, head shows the first 6 entries. If the object is a rectangle, such as a matrix or a data frame, head shows the first 6 rows. The tail function is analogous to head but for the end of the object.

head() and tail() can be used to see a portion of the data

The summary function computes summary statistics for numeric data and performs tabulations for categorical data, which are called factors in R.

The summary() function summarizes data

The unique() function shows only the unique elements of an object. For vectors, this returns the set of unique elements. For rectangles such as matrices and data frames, this returns the unique rows. This function is useful if we want to check the coding of our data. If we have sex information, then we expect the result of unique to be two elements. If not, there is likely some data cleaning that must be done. The unique() function is also useful for simply exploring the values that a variable can take. In the example below, we can see that in the mtcars data frame, there are only cars with 6, 4, and 8 cylinders. Note that to extract the column corresponding to cylinders, we used a dollar sign followed by the column name: $cyl. This is an example of subsetting that you will learn in later lessons.

The unique() shows unique elements of an object

19.0.6 Errors, Warnings, and Messages

In R, there are three types information that R may return to you to your screen to provide you with additional information. These come in the form of errors, warnings, and messages. While they will often look similar to one another, it’s important to understand the difference between them.

The most serious of these messages is an error message. Errors indicate that the code you tried to run did not run successfully. If you receive an error message, you should carefully look back at your code to see what went wrong. Error messages cannot be ignored as they indicate that there was no way for the code to run. Something has to be fixed before moving forward. For example, the code here produces an error, since mtca is not a data frame or object in R.

errors

Warnings are generally less serious than error messages. They are generated when the code executes (meaning, it runs without producing an error and stopping), but produces something unexpected. Warning messages should always be read, and then you, the person writing the code, has the option to decide whether or not the code that has generated the warning needs to be re-written. For example, the log function is only defined for numbers greater than zero. If, in R, you try to take the log of a negative number, you get an output (NaN):

log(-1)

This output means the code executed (there was no error), but you also get a warning letting you know that NaNs were produced. If you meant to take the log of a negative number, you would leave the code as is. However, if you did not intend to do this, the warning message helps clue you into the fact that you may want to revisit your code.

warnings

Last but not least, messages, in general, are simply there to provide you with more information. They do not indicate that you have done anything wrong. For example, if you were to run a function that creates a directory if it does not yet exist, the function may provide you a message informing you whenever a new directory has been created. This message would just be there to provide you with more information. No further action is generally necessary when a message is provided.

messages

Note that all three are in the same font and same color, so they’ll look similar in your Posit Cloud console. Over time, you’ll get more comfortable dealing with and understanding the difference between the three. For now, be sure that if you get an error, your code did not execute successfully. Go back and find what caused the error.

19.1 A brief intro to the `apply` family of functions

*This section is repurposed from the Childhood Cancer Data Lab’s section on apply functions..

In base R, the apply family of functions can be an alternative methods for performing transformations across a data frame, matrix or other object structures.

One of this family is (shockingly) the function apply(), which operates on matrices.

A matrix is similar to a data frame in that it is a rectangular table of data, but it has an additional constraint: rather than each column having a type, ALL data in a matrix has the same type.

Let’s make a data frame with some data.

df <- data.frame(x = 1:4, y = 5:8, z = 10:13)

Converting it to a matrix will require us to make them all the same type. We can coerce it into a matrix using as.matrix(), in which case R will pick a type that it can convert everything to. What does it choose?

a_matrix <- as.matrix(df)

# Explore the structure of the `gene_matrix` object
str(a_matrix)

##  int [1:4, 1:3] 1 2 3 4 5 6 7 8 10 11 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:3] "x" "y" "z"

Now that the matrix is all numbers, we can do things like calculate the column or row statistics using apply(). The first argument to apply() is the data object we want to work on. The third argument is the function we will apply to each row or column of the data object. The second argument specifies whether we are applying the function across rows or across columns (1 for rows, 2 for columns).

# Calculate row means
row_means <- apply(a_matrix, 1, mean) # Notice we are using 1 here

# How long will `gene_means` be?
length(row_means)

## [1] 4

Now let’s investigate the same set up, but use 2 to apply over the columns of our matrix.

# Calculate column means
col_means <- apply(a_matrix, 2, mean) # Notice we use 2 here

# How long will `col_means` be?
length(col_means)

## [1] 3

Although the apply functions may not be as easy to use as the tidyverse functions, for some applications, apply methods can be better suited. In this course, we will not delve too deeply into the various other apply functions (tapply(), lapply(), etc.) but you can read more information about them here.