background-image: url(http://fusionanalyticsworld.com/wp-content/uploads/2016/08/Data-Science-with-R1.jpg?a73fae) background-position: center background-size: cover class: title-slide .bg-text[ # Introduction to Data Science with R ### week.7 <hr /> 4月 8, 2021 謝舒凱 ] --- ## Data Manipulation using `dplyr` and `tidyr` - .large[data wrangling (= munging)] using `dplyr` > the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. - .large[data reshaping] : change the layout of data set using `tidyr` > about changing the way data is organized into rows and columns. --- ## `tidyr` - a package that **reshape** the layout of tables (into **tidy** format), particularly designed to work in combination with `magrittr and dplyr` to build a solid data analysis pipeline. - Four main functions <img src="tidyr.png", width=450 height=380 align=right> --- ## tidy data principle Recall that a data set is called tidy when: - each column represents a variable - each row represents an observation (or case) --- ## Tidy form Having your data in tidy format is crucial for facilitating the tasks of data analysis including data manipulation, modeling and visualization. <img src="tidy-data.png", width=500 height=380 align=center> --- ## `tidyr`, `dplyr` 都已經被整併入 `tidyverse` ```r #install.packages("tidyr") library("tidyverse") ``` ``` ## ─ Attaching packages ──────────────────── tidyverse 1.3.0 ─ ``` ``` ## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4 ## ✓ tibble 3.1.0 ✓ dplyr 1.0.2 ## ✓ tidyr 1.1.2 ✓ stringr 1.4.0 ## ✓ readr 1.3.1 ✓ forcats 0.5.0 ``` ``` ## ─ Conflicts ───────────────────── tidyverse_conflicts() ─ ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ``` ```r # use built-in dataset ""USArrest" str(USArrests) ``` ``` ## 'data.frame': 50 obs. of 4 variables: ## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ... ## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ... ## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ... ## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ... ``` --- ## Data wrangling - Start by subsetting a small data set, ```r small <- USArrests[c(1, 10, 20, 30,40),]; small ``` ``` ## Murder Assault UrbanPop Rape ## Alabama 13.2 236 58 21.2 ## Georgia 17.4 211 60 25.8 ## Maryland 11.3 300 67 27.8 ## New Jersey 7.4 159 89 18.8 ## South Carolina 14.4 279 48 22.5 ``` - Row names are states, use `cbind()` to add a column named 'state' in the data. *which will make the data tidy and the analysis easier*. ```r #my_data <- cbind(state = rownames(small), small); my_data my_data <- mutate(small, state = rownames(small)); my_data ``` ``` ## Murder Assault UrbanPop Rape state ## 1 13.2 236 58 21.2 Alabama ## 2 17.4 211 60 25.8 Georgia ## 3 11.3 300 67 27.8 Maryland ## 4 7.4 159 89 18.8 New Jersey ## 5 14.4 279 48 22.5 South Carolina ``` --- ## `gather()` collapse columns into rows - collapses multiple columns into key-value pairs. 寬格式轉長格式 It produces a 'long' data format from a 'wide' one. - `gather(dataframe, key, value, ...)` <img src="gather.png", width=150 height=100 align = center> ```r # gather only Murder, Assault and Rape my_data1 <- gather(my_data, key = "arrest_attribute", value = "arrest_estimate", Murder, Assault, Rape) # all selected column names have been collapsed into a single key column (here “arrest_attribute”). Their values have been put into a value column (here “arrest_estimate”). ``` --- ## `spread()`: spread two columns into multiple columns 相反動作 - does the reverse of `gather()`. It takes two columns (key and value) and spreads into multiple columns. It produces a 'wide' data format from a 'long' one. <img src="spread.png", width=150 height=100 align = center> ```r my_data2 <- spread(my_data1, key = "arrest_attribute", value = "arrest_estimate" ) ``` --- ## `unite()` : Unite multiple columns into one - takes multiple columns and paste them together into one. - `unite(data, col, ..., sep = "_")` <img src="unite.png", width=180 height=100 align = center> ```r my_data3 <- unite(my_data, col = "Murder_Assault", Murder, Assault, sep = "_") ``` --- ## `separate()`: separate one column into multiple 相反動作 - the reverse of `unite()`. It takes values inside a single character column and separates them into multiple columns. <img src="separate.png", width=180 height=100 align = center> ```r # Separate the column “Murder_Assault” [in my_data4] into two columns Murder and Assault separate(my_data3, col = "Murder_Assault", into = c("Murder", "Assault"), sep = "_") ``` ``` ## Murder Assault UrbanPop Rape state ## 1 13.2 236 58 21.2 Alabama ## 2 17.4 211 60 25.8 Georgia ## 3 11.3 300 67 27.8 Maryland ## 4 7.4 159 89 18.8 New Jersey ## 5 14.4 279 48 22.5 South Carolina ``` --- background-image: url(../img/emo/boredom-small.png) --- ## Chaining together - recall that `x %>% f` is equivalent to `\(f(x)\)` ```r my_data %>% gather(key = "arrest_attribute", value = "arrest_estimate", Murder:UrbanPop) %>% unite(col = "attribute_estimate", arrest_attribute, arrest_estimate) ``` ``` ## Rape state attribute_estimate ## 1 21.2 Alabama Murder_13.2 ## 2 25.8 Georgia Murder_17.4 ## 3 27.8 Maryland Murder_11.3 ## 4 18.8 New Jersey Murder_7.4 ## 5 22.5 South Carolina Murder_14.4 ## 6 21.2 Alabama Assault_236 ## 7 25.8 Georgia Assault_211 ## 8 27.8 Maryland Assault_300 ## 9 18.8 New Jersey Assault_159 ## 10 22.5 South Carolina Assault_279 ## 11 21.2 Alabama UrbanPop_58 ## 12 25.8 Georgia UrbanPop_60 ## 13 27.8 Maryland UrbanPop_67 ## 14 18.8 New Jersey UrbanPop_89 ## 15 22.5 South Carolina UrbanPop_48 ``` --- # 小抄速記 Cheatsheet .large[ - [可以善用 cheatsheet](https://rstudio.com/resources/cheatsheets/) - 徹底弄懂這個 [Data Wrangling with dplyr and tidyr](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) ]