Lesson 7 Base subsetting

teaching: 50 exercises: 30 adapted from: https://datacarpentry.org/r-socialsci/01-intro-to-r/index.html

questions:

How can subsets be extracted from vectors and data frames?
How does R treat missing values?
How can we deal with missing values in R?

objectives:

Subset and extract values from vectors.
Analyze vectors with missing data.

keypoints:

Access individual values by location using [].
Access arbitrary sets of data using [c(...)].
Use logical operations and logical vectors to access subsets of data.

7.1 Tidyverse and base R

As a programming language, R provides a lot of flexibility. Where other programming languages (e.g., Python) require a very particular syntax, and there is usually just one right way to do something, R developers can use whatever syntax they would like. This has led to a proliferation of R syntaxes, or many ways to “say” the same thing. A cheatsheet shows how to do the same tasks many ways in R.

As you develop as an R user and programmer, you will learn to mix and match these approaches in your work. Usually, there is one way that is easier (at least for a particular person) and one that is more challenging. So far, we have focused on Tidyverse approaches, but we also want you to see the more general base R approach to some tasks.

In particular, we want to come back to subsetting vectors and data frames. When we discussed dplyr, we used the filter command to retrieve certain rows from a dataframe, and the select command to retrieve certain columns. We are going to move to a slightly more matrix-oriented approach, using the square brackets [ , ].

7.2 Subsetting vectors

If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:

respondent_wall_type <- c("muddaub", "burntbricks", "sunbricks")
respondent_wall_type[2]

## [1] "burntbricks"

respondent_wall_type[c(3, 2)]

## [1] "sunbricks"   "burntbricks"

We can also repeat the indices to create an object with more elements than the original one:

more_respondent_wall_type <- respondent_wall_type[c(1, 2, 3, 2, 1, 3)]
more_respondent_wall_type

## [1] "muddaub"     "burntbricks" "sunbricks"   "burntbricks" "muddaub"    
## [6] "sunbricks"

R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.

7.2.1 Conditional subsetting

Another common way of subsetting is by using a logical vector. TRUE will select the element with the same index, while FALSE will not:

hh_members <- c(3, 7, 10, 6)
hh_members[c(TRUE, FALSE, TRUE, TRUE)]

## [1]  3 10  6

Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 5:

hh_members > 5    # will return logicals with TRUE for the indices that meet the condition

## [1] FALSE  TRUE  TRUE  TRUE

## so we can use this to select only the values above 5
hh_members[hh_members > 5]

## [1]  7 10  6

You can combine multiple tests using & (both conditions are true, AND) or | (at least one of the conditions is true, OR):

hh_members[hh_members < 3 | hh_members > 5]

## [1]  7 10  6

hh_members[hh_members >= 7 & hh_members == 3]

## numeric(0)

Here, < stands for “less than”, > for “greater than”, >= for “greater than or equal to”, and == for “equal to”. The double equal sign == is a test for numerical equality between the left and right hand sides, and should not be confused with the single = sign, which performs variable assignment (similar to <-).

A common task is to search for certain strings in a vector. One could use the “or” operator | to test for equality to multiple values, but this can quickly become tedious. The function %in% allows you to test if any of the elements of a search vector are found:

possessions <- c("car", "bicycle", "radio", "television", "mobile_phone")
possessions[possessions == "car" | possessions == "bicycle"] # returns both car and bicycle

## [1] "car"     "bicycle"

possessions %in% c("car", "bicycle", "motorcycle", "truck", "boat")

## [1]  TRUE  TRUE FALSE FALSE FALSE

possessions[possessions %in% c("car", "bicycle", "motorcycle", "truck", "boat")]

## [1] "car"     "bicycle"

7.3 Missing data

As R was designed to analyze datasets, it includes the concept of missing data (which is uncommon in other programming languages). Missing data are represented in vectors as NA.

When doing operations on numbers, most functions will return NA if the data you are working with include missing values. This feature makes it harder to overlook the cases where you are dealing with missing data. You can add the argument na.rm=TRUE to calculate the result while ignoring the missing values.

rooms <- c(2, 1, 1, NA, 4)
mean(rooms)

## [1] NA

max(rooms)

## [1] NA

mean(rooms, na.rm = TRUE)

## [1] 2

max(rooms, na.rm = TRUE)

## [1] 4

If your data include missing values, you may want to become familiar with the functions is.na(), na.omit(), and complete.cases(). See below for examples.

## Extract those elements which are not missing values.
rooms[!is.na(rooms)]

## [1] 2 1 1 4

## Returns the object with incomplete cases removed. The returned object is an atomic vector of type `"numeric"` (or `"double"`).
na.omit(rooms)

## [1] 2 1 1 4
## attr(,"na.action")
## [1] 4
## attr(,"class")
## [1] "omit"

## Extract those elements which are complete cases. The returned object is an atomic vector of type `"numeric"` (or `"double"`).
rooms[complete.cases(rooms)]

## [1] 2 1 1 4

Recall that you can use the typeof() function to find the type of your atomic vector.

7.3 Exercise
Using this vector of rooms, create a new vector with the NAs removed.
rooms <- c(1, 2, 1, 1, NA, 3, 1, 3, 2, 1, 1, 8, 3, 1, NA, 1)
Use the function median() to calculate the median of the rooms vector.

Use R to figure out how many households in the set use more than 2 rooms for sleeping.
7.3 Solution
rooms <- c(1, 2, 1, 1, NA, 3, 1, 3, 2, 1, 1, 8, 3, 1, NA, 1)
rooms_no_na <- rooms[!is.na(rooms)]
# or
rooms_no_na <- na.omit(rooms)
# 2.
median(rooms, na.rm = TRUE)
## [1] 1
# 3.
rooms_above_2 <- rooms_no_na[rooms_no_na > 2]
length(rooms_above_2)
## [1] 4

7.4 Indexing and subsetting data frames

We have seen how to use square brackets to index vectors. We can extend the same concept to dataframes.

Consider our interviews data frame. It has rows and columns, so it has 2 dimensions. If we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.

## first element in the first column of the data frame (as a vector)
interviews[1, 1]

## # A tibble: 1 x 1
##   key_ID
##    <dbl>
## 1      1

## first element in the 6th column (as a vector)
interviews[1, 6]

## # A tibble: 1 x 1
##   respondent_wall_type
##   <chr>               
## 1 muddaub

## first column of the data frame (as a vector)
interviews[[1]]

##   [1]   1   1   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  21  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71 127
##  [73] 133 152 153 155 178 177 180 181 182 186 187 195 196 197 198 201 202  72
##  [91]  73  76  83  85  89 101 103 102  78  80 104 105 106 109 110 113 118 125
## [109] 119 115 108 116 117 144 143 150 159 160 165 166 167 174 175 189 191 192
## [127] 126 193 194 199 200

## first column of the data frame (as a data.frame)
interviews[1]

## # A tibble: 131 x 1
##    key_ID
##     <dbl>
##  1      1
##  2      1
##  3      3
##  4      4
##  5      5
##  6      6
##  7      7
##  8      8
##  9      9
## 10     10
## # … with 121 more rows

## first three elements in the 7th column (as a vector)
interviews[1:3, 7]

## # A tibble: 3 x 1
##   rooms
##   <dbl>
## 1     1
## 2     1
## 3     1

## the 3rd row of the data frame (as a data.frame)
interviews[3, ]

## # A tibble: 1 x 14
##   key_ID village interview_date      no_membrs years_liv respondent_wall… rooms
##    <dbl> <chr>   <dttm>                  <dbl>     <dbl> <chr>            <dbl>
## 1      3 God     2016-11-17 00:00:00        10        15 burntbricks          1
## # … with 7 more variables: memb_assoc <chr>, affect_conflicts <chr>,
## #   liv_count <dbl>, items_owned <chr>, no_meals <dbl>, months_lack_food <chr>,
## #   instanceID <chr>

## equivalent to head_interviews <- head(interviews)
head_interviews <- interviews[1:6, ]

: is a special function that creates numeric vectors of integers in increasing or decreasing order, test 1:10 and 10:1 for instance.

You can also exclude certain indices of a data frame using the “-” sign:

interviews[, -1]          # The whole data frame, except the first column

## # A tibble: 131 x 13
##    village interview_date      no_membrs years_liv respondent_wall… rooms
##    <chr>   <dttm>                  <dbl>     <dbl> <chr>            <dbl>
##  1 God     2016-11-17 00:00:00         3         4 muddaub              1
##  2 God     2016-11-17 00:00:00         7         9 muddaub              1
##  3 God     2016-11-17 00:00:00        10        15 burntbricks          1
##  4 God     2016-11-17 00:00:00         7         6 burntbricks          1
##  5 God     2016-11-17 00:00:00         7        40 burntbricks          1
##  6 God     2016-11-17 00:00:00         3         3 muddaub              1
##  7 God     2016-11-17 00:00:00         6        38 muddaub              1
##  8 Chirod… 2016-11-16 00:00:00        12        70 burntbricks          3
##  9 Chirod… 2016-11-16 00:00:00         8         6 burntbricks          1
## 10 Chirod… 2016-12-16 00:00:00        12        23 burntbricks          5
## # … with 121 more rows, and 7 more variables: memb_assoc <chr>,
## #   affect_conflicts <chr>, liv_count <dbl>, items_owned <chr>, no_meals <dbl>,
## #   months_lack_food <chr>, instanceID <chr>

interviews[-c(7:131), ]   # Equivalent to head(interviews)

## # A tibble: 6 x 14
##   key_ID village interview_date      no_membrs years_liv respondent_wall… rooms
##    <dbl> <chr>   <dttm>                  <dbl>     <dbl> <chr>            <dbl>
## 1      1 God     2016-11-17 00:00:00         3         4 muddaub              1
## 2      1 God     2016-11-17 00:00:00         7         9 muddaub              1
## 3      3 God     2016-11-17 00:00:00        10        15 burntbricks          1
## 4      4 God     2016-11-17 00:00:00         7         6 burntbricks          1
## 5      5 God     2016-11-17 00:00:00         7        40 burntbricks          1
## 6      6 God     2016-11-17 00:00:00         3         3 muddaub              1
## # … with 7 more variables: memb_assoc <chr>, affect_conflicts <chr>,
## #   liv_count <dbl>, items_owned <chr>, no_meals <dbl>, months_lack_food <chr>,
## #   instanceID <chr>

Data frames can be subset by calling indices (as shown previously), but also by calling their column names directly:

interviews["village"]       # Result is a data frame
interviews[, "village"]     # Result is a data frame
interviews[["village"]]     # Result is a vector
interviews$village          # Result is a vector

In RStudio, you can use the autocompletion feature to get the full and correct names of the columns.

7.4 Exercise

Create a data frame (interviews_100) containing only the data in row 100 of the interviews dataset.

Notice how nrow() gave you the number of rows in a data frame?

Use that number to pull out just that last row in the data frame.

Compare that with what you see as the last row using tail() to make sure it’s meeting expectations.

Pull out that last row using nrow() instead of the row number.

Create a new data frame (interviews_last) from that last row.

Use nrow() to extract the row that is in the middle of the data frame. Store the content of this row in an object named interviews_middle.

Combine nrow() with the - notation above to reproduce the behavior of head(interviews), keeping just the first through 6th rows of the interviews dataset.
7.4 Solution
## 1.
interviews_100 <- interviews[100, ]
## 2.
# Saving `n_rows` to improve readability and reduce duplication
n_rows <- nrow(interviews)
interviews_last <- interviews[n_rows, ]
## 3.
interviews_middle <- interviews[(n_rows / 2), ]
## 4.
interviews_head <- interviews[-(7:n_rows), ]