Introduction to Data Science with R

background-image: url(http://fusionanalyticsworld.com/wp-content/uploads/2016/08/Data-Science-with-R1.jpg?a73fae)
background-position: center
background-size: cover

class: title-slide

.bg-text[
# Introduction to Data Science with R
### week.7

<hr />

4月  8, 2021  
謝舒凱
]

---
## Data Manipulation using `dplyr` and `tidyr`

- .large[data wrangling (= munging)] using `dplyr`

> the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

- .large[data reshaping] : change the layout of data set using `tidyr`

> about changing the way data is organized into rows and columns.

---
## `tidyr`

- a package that **reshape** the layout of tables (into **tidy** format),  particularly designed to work in combination with `magrittr and dplyr` to build a solid data analysis pipeline.

- Four main functions

---
## tidy data principle

Recall that a data set is called tidy when:

- each column represents a variable
- each row represents an observation (or case)

---
## Tidy form

Having your data in tidy format is crucial for facilitating the tasks of data analysis including data manipulation, modeling and visualization.

---
## `tidyr`, `dplyr` 都已經被整併入 `tidyverse`

```r
#install.packages("tidyr")
library("tidyverse")
```

```
## ─ Attaching packages ──────────────────── tidyverse 1.3.0 ─
```

```
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
```

```
## ─ Conflicts ───────────────────── tidyverse_conflicts() ─
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
```

```r
# use built-in dataset ""USArrest"
str(USArrests)
```

```
## 'data.frame':	50 obs. of  4 variables:
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
```

---
## Data wrangling

- Start by subsetting a small data set,

```r
small <- USArrests[c(1, 10, 20, 30,40),]; small
```

```
##                Murder Assault UrbanPop Rape
## Alabama          13.2     236       58 21.2
## Georgia          17.4     211       60 25.8
## Maryland         11.3     300       67 27.8
## New Jersey        7.4     159       89 18.8
## South Carolina   14.4     279       48 22.5
```

- Row names are states, use `cbind()` to add a column named 'state' in the data. *which will make the data tidy and the analysis easier*.

```r
#my_data <- cbind(state = rownames(small), small); my_data
my_data <- mutate(small, state = rownames(small)); my_data
```

```
##   Murder Assault UrbanPop Rape          state
## 1   13.2     236       58 21.2        Alabama
## 2   17.4     211       60 25.8        Georgia
## 3   11.3     300       67 27.8       Maryland
## 4    7.4     159       89 18.8     New Jersey
## 5   14.4     279       48 22.5 South Carolina
```

---
## `gather()` collapse columns into rows

- collapses multiple columns into key-value pairs. 寬格式轉長格式 It produces a 'long' data format from a 'wide' one.
  - `gather(dataframe, key, value, ...)`

```r
# gather only Murder, Assault and Rape
my_data1 <- gather(my_data,
                   key = "arrest_attribute",
                   value = "arrest_estimate",
                   Murder, Assault, Rape)
# all selected column names have been collapsed into a single key column (here “arrest_attribute”). Their values have been put into a value column (here “arrest_estimate”).
```

---
## `spread()`: spread two columns into multiple columns

相反動作

- does the reverse of `gather()`. It takes two columns (key and value) and spreads into multiple columns. It produces a 'wide' data format from a 'long' one.

```r
my_data2 <- spread(my_data1, 
                   key = "arrest_attribute",
                   value = "arrest_estimate"
                   )
```

---
## `unite()` : Unite multiple columns into one

- takes multiple columns and paste them together into one.
  - `unite(data, col, ..., sep = "_")`

```r
my_data3 <- unite(my_data,
                  col = "Murder_Assault",
                  Murder, Assault,
                  sep = "_")
```

---
## `separate()`: separate one column into multiple

相反動作

-  the reverse of `unite()`. It takes values inside a single character column and separates them into multiple columns.

```r
# Separate the column “Murder_Assault” [in my_data4] into two columns Murder and Assault
separate(my_data3,
         col = "Murder_Assault",
         into = c("Murder", "Assault"),
         sep = "_")
```

---
background-image: url(../img/emo/boredom-small.png)
---
## Chaining together

- recall that `x %>% f` is equivalent to `$f(x)$`

```r
my_data %>% gather(key = "arrest_attribute",
                   value = "arrest_estimate",
                   Murder:UrbanPop) %>%
            unite(col = "attribute_estimate",
                  arrest_attribute, arrest_estimate)
```

```
##    Rape          state attribute_estimate
## 1  21.2        Alabama        Murder_13.2
## 2  25.8        Georgia        Murder_17.4
## 3  27.8       Maryland        Murder_11.3
## 4  18.8     New Jersey         Murder_7.4
## 5  22.5 South Carolina        Murder_14.4
## 6  21.2        Alabama        Assault_236
## 7  25.8        Georgia        Assault_211
## 8  27.8       Maryland        Assault_300
## 9  18.8     New Jersey        Assault_159
## 10 22.5 South Carolina        Assault_279
## 11 21.2        Alabama        UrbanPop_58
## 12 25.8        Georgia        UrbanPop_60
## 13 27.8       Maryland        UrbanPop_67
## 14 18.8     New Jersey        UrbanPop_89
## 15 22.5 South Carolina        UrbanPop_48
```

---
# 小抄速記 Cheatsheet

.large[
- [可以善用 cheatsheet](https://rstudio.com/resources/cheatsheets/)

- 徹底弄懂這個 [Data Wrangling with dplyr and tidyr](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

]