C-Path R Training

class: title-slide, left, top
background-image: url(img/sam-balye-k5RD4dl8Y1o-unsplash_blue.jpg)
background-position: 75% 75%
background-size: cover

# C-Path R Training
### Advanced Data Wrangling Part II

**Kelsey Gonzalez**<br>
May 27, 2021 &#8212; Day 2

---
name: about-me
layout: false
class: about-me-slide, inverse, middle, center

# About me

## Kelsey Gonzalez

[<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg">  <path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"></path></svg> kelseygonzalez.github.io](https://kelseygonzalez.github.io/)
[<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg">  <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> @KelseyEGonzalez](https://twitter.com/kelseyegonzalez)
[<svg viewBox="0 0 496 512" style="position:relative;display:inline-block;top:.1em;height:1em;" xmlns="http://www.w3.org/2000/svg">  <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> @KelseyGonzalez](https://github.com/KelseyGonzalez)

---
class: left

# About you

.pull-left-narrow[.center[
<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidyverse.png" width="25%"/>]]
.pull-right-wide[### You're pretty good at data wrangling]

.pull-left-narrow[
.center[<svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;height:2em;" xmlns="http://www.w3.org/2000/svg">  <path d="M224 96l16-32 32-16-32-16-16-32-16 32-32 16 32 16 16 32zM80 160l26.66-53.33L160 80l-53.34-26.67L80 0 53.34 53.33 0 80l53.34 26.67L80 160zm352 128l-26.66 53.33L352 368l53.34 26.67L432 448l26.66-53.33L512 368l-53.34-26.67L432 288zm70.62-193.77L417.77 9.38C411.53 3.12 403.34 0 395.15 0c-8.19 0-16.38 3.12-22.63 9.38L9.38 372.52c-12.5 12.5-12.5 32.76 0 45.25l84.85 84.85c6.25 6.25 14.44 9.37 22.62 9.37 8.19 0 16.38-3.12 22.63-9.37l363.14-363.15c12.5-12.48 12.5-32.75 0-45.24zM359.45 203.46l-50.91-50.91 86.6-86.6 50.91 50.91-86.6 86.6z"></path></svg>]]
.pull-right-wide[### You don't have much exposure to special variable types]

.pull-left-narrow[
.center[<svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;height:2em;" xmlns="http://www.w3.org/2000/svg">  <path d="M223.75 130.75L154.62 15.54A31.997 31.997 0 0 0 127.18 0H16.03C3.08 0-4.5 14.57 2.92 25.18l111.27 158.96c29.72-27.77 67.52-46.83 109.56-53.39zM495.97 0H384.82c-11.24 0-21.66 5.9-27.44 15.54l-69.13 115.21c42.04 6.56 79.84 25.62 109.56 53.38L509.08 25.18C516.5 14.57 508.92 0 495.97 0zM256 160c-97.2 0-176 78.8-176 176s78.8 176 176 176 176-78.8 176-176-78.8-176-176-176zm92.52 157.26l-37.93 36.96 8.97 52.22c1.6 9.36-8.26 16.51-16.65 12.09L256 393.88l-46.9 24.65c-8.4 4.45-18.25-2.74-16.65-12.09l8.97-52.22-37.93-36.96c-6.82-6.64-3.05-18.23 6.35-19.59l52.43-7.64 23.43-47.52c2.11-4.28 6.19-6.39 10.28-6.39 4.11 0 8.22 2.14 10.33 6.39l23.43 47.52 52.43 7.64c9.4 1.36 13.17 12.95 6.35 19.59z"></path></svg>]]
.pull-right-wide[### .my-gold[**You want to master data wrangling**]]

---
# Learning Objectives

- Use case evaluations to create new variables
- Manipulate strings with the stringr package.
- Employ regular expressions (REGEX) to manipulate strings
- Use functions from the {forcats} package to manipulate factors in R.  
- Manipulate dates and times using the `lubridate` package.
---

{{content}}
---
template: question

<img src="img/tidyverse.png" width="25%"/>
# How can we create new categorical variables?

---

# ifelse()
----
ifelse() is a vectorised function with `test`, `yes`, and `no` vectors that will be applied to whatever you pass to it.

```r
ifelse(test = vector == condition, "result if true", "result if false")
```
---
# ifelse()
----

```r
x <- 1:6
ifelse(test = x < 3, yes = "small", no = "large")
## [1] "small" "small" "large" "large" "large" "large"
```

```r
nhanes %>% 
  mutate(pulse_type = ifelse(Pulse > 70, "fast", "slow")) %>% 
  select(Pulse, pulse_type)
## # A tibble: 20,293 x 2
##    Pulse pulse_type
##    <int> <chr>     
##  1    70 slow      
##  2    NA <NA>      
##  3    68 slow      
##  4    68 slow      
##  5    72 fast      
##  6    72 fast      
##  7    86 fast      
##  8    NA <NA>      
##  9    70 slow      
## 10    88 fast      
## # ... with 20,283 more rows
```

---

# case_when() <img src="img/dplyr.png" class="title-hex">
----
This function allows you to vectorize (and replace) multiple `if_else()` statements in a succinct and clear manner. The syntax is

```r
case_when(conditon1 ~ "result if conditon1 is true",
          condition2 ~ "results if condition2 is true",
          TRUE ~ "catch for everything else")
```

+ The Left hand side (LHS) determines which variables match a given case - this must return a logical vector
+ The Right hand side (RHS) provides the new or replacement value and all have to be of the same type of vector
+ you always end with a case of TRUE for when all of the other cases are FALSE
+ case_when is particularly useful inside mutate when you want to create a *new variable* that relies on a **complex combination** of existing variables
  
---
# case_when()<img src="img/dplyr.png" class="title-hex">
----

```r
x <- 1:16
x
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
case_when(
  x < 5 ~ "< 5",
  x < 10 ~ "< 10",
  TRUE ~ as.character(x)
)
##  [1] "< 5"  "< 5"  "< 5"  "< 5"  "< 10" "< 10" "< 10" "< 10" "< 10" "10"  
## [11] "11"   "12"   "13"   "14"   "15"   "16"
```

---
# case_when()<img src="img/dplyr.png" class="title-hex">
----

```r
nhanes %>% 
  mutate(health_status = case_when(Diabetes == "Yes" ~ "at-risk",
                                   Age > 70 ~ "at-risk",
                                   BPDiaAve < 80 & BPSysAve > 120 ~ "at-risk",
                                   TRUE ~ "okay")) %>% 
  select(health_status) %>% head()
## # A tibble: 6 x 1
##   health_status
##   <chr>        
## 1 okay         
## 2 okay         
## 3 okay         
## 4 okay         
## 5 at-risk      
## 6 okay
```
---

---

.left-column[
## Your turn<br><svg viewBox="0 0 576 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg">  <path d="M402.3 344.9l32-32c5-5 13.7-1.5 13.7 5.7V464c0 26.5-21.5 48-48 48H48c-26.5 0-48-21.5-48-48V112c0-26.5 21.5-48 48-48h273.5c7.1 0 10.7 8.6 5.7 13.7l-32 32c-1.5 1.5-3.5 2.3-5.7 2.3H48v352h352V350.5c0-2.1.8-4.1 2.3-5.6zm156.6-201.8L296.3 405.7l-90.4 10c-26.2 2.9-48.5-19.2-45.6-45.6l10-90.4L432.9 17.1c22.9-22.9 59.9-22.9 82.7 0l43.2 43.2c22.9 22.9 22.9 60 .1 82.8zM460.1 174L402 115.9 216.2 301.8l-7.3 65.3 65.3-7.3L460.1 174zm64.8-79.7l-43.2-43.2c-4.1-4.1-10.8-4.1-14.8 0L436 82l58.1 58.1 30.9-30.9c4-4.2 4-10.8-.1-14.9z"></path></svg><br>
]

.right-column[
### Let's practice using case_when()
----
Look up the variable `AgeDecade` in the NHANES documentation (`?NHANES`). It is missing in our raw version of the dataset. Recreate `AgeDecade` using `case_when()` and find how many cases are in each age category.

]
<div class="countdown blink-colon noupdate-15" id="timer_60b02cb6" style="bottom:0;left:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">03</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
---

---

# Strings
----
Strings (character vectors) are interesting.

```r
a <- "Grace Hopper said, \"If it's a good idea, go ahead and do it.\nIt's much easier to apologize than it is to get permission.\""
a
## [1] "Grace Hopper said, \"If it's a good idea, go ahead and do it.\nIt's much easier to apologize than it is to get permission.\""
cat(a)
## Grace Hopper said, "If it's a good idea, go ahead and do it.
## It's much easier to apologize than it is to get permission."
```
-   `"\n"` represents a new line
-   `"\t"` represents a tab

---
# stringr <img src="img/stringr.png" class="title-hex">
----
use `str_length()` to get the number of characters in a
string.

```r
beyonce <- "I am Beyoncé, always."
str_length(beyonce)
## [1] 21
```

What about spaces and punctuation marks - they count! What about escaped
characters? The '' does not but the character itself does.

```r
str_length("I am Beyoncé, \nalways.")
## [1] 22
```

---
## replacing words with `str_replace()` <img src="img/stringr.png" class="title-hex">

----
`str_sub()` extracts a substring between the location of two characters.

```r
hopper <- "A ship in port is safe, but that's not what ships are built for."
str_sub(hopper, start = 3, end = 6)
## [1] "ship"
```

You often want to use str_sub when the data is highly structured

```r
phone <- "800-800-8553"
#first three
str_sub(phone, end = 3)
## [1] "800"
# last four
str_sub(phone, start = -4)
## [1] "8553"
```

---
## replacing words with `str_replace()` <img src="img/stringr.png" class="title-hex">
----
.panelset[
.panel[.panel-name[Hopper]
If I want to replace a specific pattern of text with another pattern,
`str_replace()` or `str_replace_all()` are very useful.

```r
hopper
## [1] "A ship in port is safe, but that's not what ships are built for."
str_replace(hopper, "ship", "car")
## [1] "A car in port is safe, but that's not what ships are built for."
str_replace_all(hopper, "ship", "car")
## [1] "A car in port is safe, but that's not what cars are built for."
```
]
.panel[.panel-name[Phone Number]
Back with our phone number example, you'll see there's a difference
between `str_replace()` or `str_replace_all()`. The first only replaces
the first instance

```r
phone
## [1] "800-800-8553"
str_replace(phone, "800-", "")
## [1] "800-8553"
str_replace_all(phone, "800-", "")
## [1] "8553"
```

If I only want to change the *second* instance of "800-", I'll need to
use a more complicated pattern match. This would require a regular
expression
]
]

---

```r
z <- "You don't manage people, you manage things.\nYou lead people."
```
- How many characters does this string have? 
- Using string subsetting by indexes, can you extract the word "people"? 
- Replace the `\n` with a space

]
<div class="countdown blink-colon noupdate-15" id="timer_60b02dba" style="bottom:0;left:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">03</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>

---

---
# Regular Expressions 
----
-   Regular expressions (regex or regexp) are a syntax for **pattern
    matching** in strings.

-   Regex structure is used in many different computer languages

-   Wherever there is a `pattern` argument in a stringr function, you
    can use regex (to extract strings, get a logical if there is a
    match, etc...).

-   regex includes special characters, e.g., `.` and `*` These must be
    escaped using `\\` if you want to match their normal literal value.

---
# Regular Expressions  <img src="img/stringr.png" class="title-hex">
----

-   A period "`.`" matches any character.
-   A `[:alpha:]` matches any alphabetical character.
-   You can "escape" a symbol with two backslashes "`\\.`" to match. If you don't, the asterisk in this case will be interpreted as a regular expression command, not a symbol.

By default, regular expressions will match any part of a string. It’s often useful to anchor the regular expression so that it matches from the start or end of the string. You can use:

- `^` to match the start of the string.
- `$` to match the end of the string.

```r
fruit <- c("Apple", "Strawberry", "Banana", "Pear", "Blackberry", "*berry")
```

---
template: live-coding
---

## Symbols <img src="img/stringr.png" class="title-hex">
There are a lot of regular expression character matches in R and I don't
expect you to memorize them all - I often have [the
cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/strings.pdf)
open next to me while working. Some important ones you should however be
able to recognize:

| type this: | to mean this: |
|------------|---------------|
| \\\\n                   | new line                        |
| \\\\s or \[:space:\]    | any whitespace                  |
| \\\\d or \[:digit:\]    | any digit                       |
| \\\\w or \[:alpha:\]      | any word character              |
| \[:punct:\]             | any punctuation                 |
| .                       | every character except new line |

---
## Alternates <img src="img/stringr.png" class="title-hex">

| type this: | to mean this: |
|------------|---------------|
| [abc]      | one of a, b or c |
| [^abc]     | anything but a, b or c |

---
template: live-coding
---

## Quantifiers <img src="img/stringr.png" class="title-hex">

| Expression 	| matches                  	|
|------------	|--------------------------	|
| a?         	| a, zero or one times     	|
| a*         	| a, zero or more times    	|
| a+         	| a, one or more times     	|
| a\{n\}       	| a, n times               	|
| a\{n,\}      	| a, n or more times       	|
| a\{n,m\}     	| a, between n and m times 	|
---
template: live-coding
---
name: your-turn
background-color: var(--my-red)
class: inverse

.right-column[
### Let's practice using stringr functions
----
Given the corpus of common words in stringr::words, use `str_view()` and create regular expressions that find all words that:

- Start with “y”
- End with “x”
- Are exactly three letters long
- Have seven letters or more
- Start with a vowel

]
<div class="countdown blink-colon noupdate-15" id="timer_60b02edb" style="bottom:0;left:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">03</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
---
# Regular Expressions <img src="img/stringr.png" class="title-hex">

There is a lot more to learn about regular expressions what we won't cover here,
like groups and look arounds. Groups allows you to define which part of the
expression you want to extract or replace and look arounds allow you to define
what follows or precedes the expression. When you need to learn more, there are
many tools online like [https://regex101.com/](https://regex101.com/) to help
you learn. The only important thing to remember with online regular expression
tools is that `r` needs an extra `\` preceding each `\` in other coding
languages.
---
## Changing Case with stringr <img src="img/stringr.png" class="title-hex">

-   `str_to_lower()` and `str_to_upper()` convert all letters to lower
    or capital case.
-   `str_to_sentence` converts all words and letters to sentence
    case. Includes Acronyms
-   `str_to_title` converts the first letter of every word to capital case.

```r
mission <- "Critical Path Institute is a catalyst in the development of new approaches to advance medical innovation and regulatory science."
str_to_lower(mission)
## [1] "critical path institute is a catalyst in the development of new approaches to advance medical innovation and regulatory science."
str_to_upper(mission)
## [1] "CRITICAL PATH INSTITUTE IS A CATALYST IN THE DEVELOPMENT OF NEW APPROACHES TO ADVANCE MEDICAL INNOVATION AND REGULATORY SCIENCE."
str_to_sentence(mission)
## [1] "Critical path institute is a catalyst in the development of new approaches to advance medical innovation and regulatory science."
str_to_title(mission)
## [1] "Critical Path Institute Is A Catalyst In The Development Of New Approaches To Advance Medical Innovation And Regulatory Science."
```

---
## Detecting matches with stringr <img src="img/stringr.png" class="title-hex">

`str_detect()`: Returns `TRUE` if a regex pattern matches a string and `FALSE` if it does not. Very useful for filters.

To ignore case, place a `(?i)` before the regex.

```r
nhanes %>% 
  select(HealthGen) %>% 
  filter(str_detect(HealthGen, "(?i)good")) %>% 
  head()
## # A tibble: 6 x 1
##   HealthGen
##   <fct>    
## 1 Good     
## 2 Vgood    
## 3 Good     
## 4 Good     
## 5 Good     
## 6 Good
```
    
---
# Combining strings with glue <img src="img/glue.png" class="title-hex">

```r
# install.packages("glue")
library(glue)
## 
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
## 
##     collapse

name <- "Fred"
age <- 50
anniversary <- as.Date("1991-10-12")

glue('My name is {name}, my age next year is {age + 1}, and my anniversary is {format(anniversary, "%A, %B %d, %Y")}.') 
## My name is Fred, my age next year is 51, and my anniversary is Saturday, October 12, 1991.
```
---
template: live-coding
---

.right-column[
### Let's practice glue!
----
With the nhanes dataset, create a new character column that looks something like:
"`ID` is a `Age` year old `Gender` who `"smokes" or "doesn't smoke"`"

You'll either need to create a new variable using `ifelse()` or `case_when()` before you write your glue statement or use `ifelse()` or `case_when()` inside of `{}`

]
<div class="countdown blink-colon noupdate-15" id="timer_60b02c71" style="bottom:0;left:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">03</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
---
<img src="img/pipe.png" class="title-hex"><img src="img/glue.png" class="title-hex"><img src="img/dplyr.png" class="title-hex">

```r
library(glue)

nhanes %>% 
  select(ID, Age, Gender, SmokeNow) %>% 
  mutate(smoke_text = ifelse(SmokeNow == "Yes", "smokes", "doesn't smoke"),
         text = glue("{ID} is a {Age} year old {Gender} who {smoke_text}")) %>% 
  head()
## # A tibble: 6 x 6
##      ID   Age Gender SmokeNow smoke_text    text                                
##   <int> <int> <fct>  <fct>    <chr>         <glue>                              
## 1 51624    34 male   No       doesn't smoke 51624 is a 34 year old male who doe~
## 2 51625     4 male   <NA>     <NA>          51625 is a 4 year old male who NA   
## 3 51626    16 male   <NA>     <NA>          51626 is a 16 year old male who NA  
## 4 51627    10 male   <NA>     <NA>          51627 is a 10 year old male who NA  
## 5 51628    60 female Yes      smokes        51628 is a 60 year old female who s~
## 6 51629    26 male   No       doesn't smoke 51629 is a 26 year old male who doe~
```

---
<img src="img/pipe.png" class="title-hex"><img src="img/glue.png" class="title-hex"><img src="img/dplyr.png" class="title-hex">
A little more elegant...

```r
library(glue)

nhanes %>% 
  select(ID, Age, Gender, SmokeNow) %>% 
  mutate(text = glue("{ID} is a {Age} year old {Gender} who {case_when(SmokeNow == 'Yes'~'smokes',SmokeNow == 'No'~'doesnt smoke',Age < 20 ~ 'is too young to smoke',TRUE~'we dont know the smoking status for')}")) %>% 
  head()
## # A tibble: 6 x 5
##      ID   Age Gender SmokeNow text                                              
##   <int> <int> <fct>  <fct>    <glue>                                            
## 1 51624    34 male   No       51624 is a 34 year old male who doesnt smoke      
## 2 51625     4 male   <NA>     51625 is a 4 year old male who is too young to sm~
## 3 51626    16 male   <NA>     51626 is a 16 year old male who is too young to s~
## 4 51627    10 male   <NA>     51627 is a 10 year old male who is too young to s~
## 5 51628    60 female Yes      51628 is a 60 year old female who smokes          
## 6 51629    26 male   No       51629 is a 26 year old male who doesnt smoke
```

---
name: break
background-color: var(--my-yellow)
class: middle, center

<div class="countdown" id="timer_60b02dce" style="right:0;bottom:0;left:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">03</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
 
---

---
# Factors  
----
R has a special data class, called factor, to deal with categorical data that
you may encounter when creating plots or doing statistical analyses. They are
stored as integers associated with labels and they can be ordered or unordered.
While factors look (and often behave) like character vectors, they are actually
treated as integer vectors by R. So you need to be very careful when treating
them as strings.

To work with factors, we will practice with the `gapminder` dataset.

```r
install.packages("gapminder")
library(gapminder)
```

---
# Factors 
----

Once created, factors can only contain a pre-defined set of values, known as
*levels*. By default, base R always sorts levels in alphabetical order. For
instance, if you have a factor with 2 levels:

```r
feeling <- factor(c("sad", "happy", "happy", "sad"))
feeling
## [1] sad   happy happy sad  
## Levels: happy sad
```

R will assign `1` to the level `"happy"` and `2` to the level `"sad"`
(because `h` comes before `s` in the alphabet, even though the first element in
this vector is`"sad"`).

In R's memory, factors are represented by integers (1, 2), but are more
informative than integers because factors are self describing: `"happy"`,
`"sad"` is more descriptive than `1`, and `2`. 
---
# Factors 
----

To see the levels of a factor, we can say

```r
levels(feeling)
## [1] "happy" "sad"
nlevels(feeling)
## [1] 2
```
---
# The continent factor
----

Let's get to know the factor we'll be working with today: continent

```r
glimpse(gapminder$continent)
##  Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
nlevels(gapminder$continent)
## [1] 5
class(gapminder$continent)
## [1] "factor"
```

---
## reordering factors with forcats <img src="img/forcats.png" class="title-hex">

Sometimes, the order of the factors does not matter, other times you might want
to specify the order because it is meaningful (e.g., "low", "medium", "high"),
it improves your visualization, or it is required by a particular type of
analysis.

By default, factor levels are ordered alphabetically. Which might as well be
random, when you think about it! It is preferable to order the levels according
to some principle:

* Frequency. Make the most common level the first and so on.
  * Another variable. Order factor levels according to a summary statistic for another variable. Example: order Gapminder countries by life expectancy.
  
---
## Manually reorder with fct_relevel <img src="img/forcats.png" class="title-hex">

```r
levels(gapminder$continent) 
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
```

```r
gapminder <- gapminder %>% 
  mutate(continent_reordered = fct_relevel(continent, "Asia", "Africa", 
                                           "Americas", "Europe", "Oceania"))
```

```r
levels(gapminder$continent_reordered) 
## [1] "Asia"     "Africa"   "Americas" "Europe"   "Oceania"
```
---
## Automatically reorder with fct_infreq <img src="img/forcats.png" class="title-hex">

.small[Another way to re-order your factor levels is by frequency, so the most common
factor levels come first, and the less common come later. (This is often useful
for plotting!) In this case, it is the frequency of how often each level occurs
in the variable, as seen in `fct_count(gapminder$continent)`]

```r
# Current levels
levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

# Reorder by frequency in dataset:
gapminder <- gapminder %>% 
  mutate(continent_infreq = fct_infreq(continent, ordered = TRUE))

# New levels
levels(gapminder$continent_infreq) 
## [1] "Africa"   "Asia"     "Europe"   "Americas" "Oceania"
```
---
## Automatically reorder based on another variable with fct_reorder <img src="img/forcats.png" class="title-hex">
.panelset[
.panel[.panel-name[fct_reorder]
What if we want to order the continent factor based on the values of another
variable? This other variable is usually quantitative and you will order the
factor according to a grouped summary. The factor is the grouping variable and
the default summarizing function is `median()` but you can specify something
else.
]
.panel[.panel-name[median life expectancy]

```r
head(levels(gapminder$country))
## [1] "Afghanistan" "Albania"     "Algeria"     "Angola"      "Argentina"  
## [6] "Australia"
## order countries by median life expectancy
gapminder <- gapminder %>% 
  mutate(country_med_lifexp = fct_reorder(country, lifeExp))
head(levels(gapminder$country_med_lifexp))
## [1] "Sierra Leone"  "Guinea-Bissau" "Afghanistan"   "Angola"       
## [5] "Somalia"       "Guinea"
```
]
.panel[.panel-name[max population]

```r
head(levels(gapminder$country))
## [1] "Afghanistan" "Albania"     "Algeria"     "Angola"      "Argentina"  
## [6] "Australia"
## order according to max population instead of median life expectancy
gapminder <- gapminder %>% 
  mutate(country_min_pop = fct_reorder(country, pop, .fun = max))
head(levels(gapminder$country_min_pop))
## [1] "Sao Tome and Principe" "Iceland"               "Djibouti"             
## [4] "Equatorial Guinea"     "Bahrain"               "Comoros"
```
]
]
---
template: live-coding
---

.right-column[
### Let's practice forcats!
----
With the nhanes dataset, reorder the `MaritalStatus` variables in three ways:

- with `fct_relevel()`, reorder with "NeverMarried", "LivePartner", "Married","Separated", "Divorced", "Widowed"
- with `fct_infreq()`, reorder based on the frequency. 
- with `fct_reorder()`, reorder based on the median `PhysActiveDays`

]
<div class="countdown blink-colon noupdate-15" id="timer_60b02dd1" style="bottom:0;left:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">03</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>

---
## renaming factor levels with fct_recode <img src="img/forcats.png" class="title-hex">

`forcats` makes easy to rename factor levels. Let's say we made a mistake and
need to recode "Oceania" to actually be "Australia". We'd use the `fct_recode` function to
do this.

```r
levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

gapminder <- gapminder %>% 
  mutate(continent_recode = fct_recode(continent, Australia = "Oceania"))

levels(gapminder$continent_recode)
## [1] "Africa"    "Americas"  "Asia"      "Europe"    "Australia"
```

---
## collapsing factor levels into "other" with fct_lump <img src="img/forcats.png" class="title-hex">

There are many other `forcat` packages for very specific uses - like making an
"other" factor for rare occurrences with `fct_lump()`. You can also use
`fct_other()` to manually set factors to equal other or `fct_collapse()` to
collapse levels into manually defined groups.

Explore [the cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/factors.pdf) so you know what is available!

---
template: break

---

---
# Dates

If you've ever worked with dates in R before, you may know they're somewhat of a mess. Dates / times can be the `Date` class, `POSIXct` class, or `hms` class and it's all sort of confusing. That's where the lubridate package comes in.

---
## Parsing Dates Using lubridate <img src="img/lubridate.png" class="title-hex">
----
lubridate has many "helper" functions which parse dates/times more
automatically. - The helper *function name specifies the order of the
components*: year, month, day, hours, minutes, and seconds. The help page for
`ymd` shows multiple functions to parse **dates** with different sequences of
**y**ear, **m**onth and **d**ay,

Only the order of year, month, and day matters

```r
library(lubridate)

ymd(c("2011/01-10", "2011-01/10", "20110110"))
## [1] "2011-01-10" "2011-01-10" "2011-01-10"
mdy(c("01/10/2011", "01 adsl; 10 df 2011", "January 10, 2011"))
## [1] "2011-01-10" "2011-01-10" "2011-01-10"
```
---
## Parsing Times Using lubridate<img src="img/lubridate.png" class="title-hex">
----
For times, only the order of hours, minutes, and seconds matter

```r
hms(c("10:40:10", "10 40 10"))
## [1] "10H 40M 10S" "10H 40M 10S"
```
---
## Parsing Date-Times Using lubridate<img src="img/lubridate.png" class="title-hex">
----
Let's parse the following date-times.

```r
t1 <- "05/26/2004 UTC 11:11:11.444" #mdy, hms
t2 <- "26 2004 05 UTC 11/11/11.444" #dym, hms

mdy_hms(t1)
## [1] "2004-05-26 11:11:11 UTC"

## No dym_hms() function is defined, so need to use parse_datetime()
parse_date_time(t2, "d y m H M S")
## [1] "2004-05-26 11:11:11 UTC"
```

---

.right-column[
### Let's practice parsing dates!
----
Use the appropriate lubridate function to parse the following dates/times:

```r
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14 23:14:11" 
```

]
<div class="countdown blink-colon noupdate-15" id="timer_60b02e01" style="bottom:0;left:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">02</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">30</span></code>
</div>

---
# Extracting date components <img src="img/lubridate.png" class="title-hex">

.panelset[
.panel[.panel-name[About]
To extract the component of a date-time, use one of the following: 
- `year()` extracts the year
- `month()` extracts the month
- `week()` extracts the week
- `mday()` extracts the day of the month (1, 2, 3, ...)
- `wday()` extracts the day of the week (Saturday, Sunday, Monday ...)
- `yday()` extracts the day of the year (1, 2, 3, ...)
- `hour()` extracts the hour
- `minute()` extract the minute
- `second()` extracts the second
]
.panel[.panel-name[Examples]
.pull-left[

```r
ddat <- mdy_hms("01/02/1970 03:51:44")
ddat
## [1] "1970-01-02 03:51:44 UTC"
year(ddat)
## [1] 1970
month(ddat, label = TRUE)
## [1] Jan
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
week(ddat)
## [1] 1
mday(ddat)
## [1] 2
```
]
.pull-right[

```r
wday(ddat, label = TRUE)
## [1] Fri
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
yday(ddat)
## [1] 2
hour(ddat)
## [1] 3
minute(ddat)
## [1] 51
second(ddat)
## [1] 44
```
]
]
]

---
## Doing Math with Time ?  <img src="img/lubridate.png" class="title-hex">
----
Humans manipulate "clock time" with the use of policies such as [Daylight
Savings Time](https://en.wikipedia.org/wiki/Daylight_saving_time) which creates
irregularities in the "physical time". `lubridate` provides three classes of
time spans to facilitate math with dates and date-times.
--
- **Periods**: track changes in "clock time", and *ignore irregularities* in
    "physical time". 
--
- **Durations**: track the passage of "physical time", which deviates from
    "clock time" when irregularities occur.
--
- **Intervals**: represent specific spans of the timeline, bounded by start
    and end date-times. We won't cover this in this workshop because I find them
    to be the least useful, but you can learn more with `?interval-class`

---
## Periods<img src="img/lubridate.png" class="title-hex">
**Periods**: track changes in "clock time", and *ignore irregularities* in "physical time".

Make a period with the name of a time unit pluralized, e.g.

```r
p <- months(3) + days(12)
p
## [1] "3m 12d 0H 0M 0S"
```

And calculate the duration between two times with basic operators:

```r
d1 <- mdy("June 13, 2013")
d2 <- today()
d2-d1
## Time difference of 2905 days
```

---
## Durations<img src="img/lubridate.png" class="title-hex">

**Durations**: track the passage of "physical time", which deviates from "clock time" when irregularities occur.

Durations are stored as seconds, the only time unit with a consistent length.
Add or subtract durations to model *physical processes*, like travel or
lifespan. You can create durations from years with `dyears()`, from days with
`ddays()`, etc...

```r
dyears(1)
## [1] "31557600s (~1 years)"
ddays(1)
## [1] "86400s (~1 days)"
dhours(1)
## [1] "3600s (~1 hours)"
```

---

```r
bike_trips <- read_csv("http://bit.ly/capital_trips", n_max = 1000)
# this will only read 1000 observations if your computer is a bit slower, 552,399 originally
```
]
.right-column[
### Day 2 Case Study
----
.small[Use the data from the capital_trips_2016.csv below. These data are from a bikesharing program.

- Review the variables with glimpse().
- Rename variables to conform to “best practices” for variable names i.e., no spaces in the names. Feel free to experiment with the handy janitor::clean_names() function here if you’d like.
- Convert the start date and end date variables to be date-times.
- Create a new variable `weekday` where you extract the day of the week based on the start date (use label = TRUE option to get actual days of the week)
- Use the start date and end date variables to calculate the duration of each trip.
- Reorder the weekday factor by the median trip duration
- How much time elapsed between the start of the first trip and the end of the the last trip.]
]

---

```r
*bike_trips
```
]
 
.panel2-case-study-auto[

```
## # A tibble: 552,399 x 9
##    `Duration (ms)` `Start date`  `End date`  `Start station n~ `Start station`  
##              <dbl> <chr>         <chr>                   <dbl> <chr>            
##  1          301295 3/31/2016 23~ 4/1/2016 0~             31280 11th & S St NW   
##  2          557887 3/31/2016 23~ 4/1/2016 0~             31275 New Hampshire Av~
##  3          555944 3/31/2016 23~ 4/1/2016 0~             31101 14th & V St NW   
##  4          766916 3/31/2016 23~ 4/1/2016 0~             31226 34th St & Wiscon~
##  5          139656 3/31/2016 23~ 3/31/2016 ~             31011 23rd & Crystal Dr
##  6          967713 3/31/2016 23~ 4/1/2016 0~             31266 11th & M St NW   
##  7          534836 3/31/2016 23~ 4/1/2016 0~             31222 New York Ave & 1~
##  8          243864 3/31/2016 23~ 4/1/2016 0~             31228 8th & H St NW    
##  9          372524 3/31/2016 23~ 4/1/2016 0~             31113 Columbia Rd & Be~
## 10          215194 3/31/2016 23~ 3/31/2016 ~             31263 10th & K St NW   
## # ... with 552,389 more rows, and 4 more variables: End station number <dbl>,
## #   End station <chr>, Bike number <chr>, Member Type <chr>
```
]