Tidy data
Tidy data processing methods
dplyr : a grammar for data wrangling
ggplot2
- ggplot2 的基本文法

Introduction to Programming and Data Science with R
week 6 Shu-Kai Hsieh
2021-03-25

在了解基本的 R語法之後，我們要進入整頓資料 (data wrangling/transformation)（進而理解資料）的學習階段。

幸福的人都很類似，不幸的人則各有各的不幸。 – Лев Николаевич Толстой

首先你會拿到資料，但是不整齊 (non-tidy) 的資料讓妳的人生被浪費。

Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

那整齊的資料是什麼意思？(不是不乾淨)

Tidy data

三原則

每個變數都必須有自己的資料欄 Each variable forms a column.
每個觀察都必須有自己的資料列 Each observation forms a row.
每個值都必須有自己的格子 Each value must have its own cell.

看例子 < 附帶一提，你可以開始看這本書了。
有學術興趣的人，可以試著讀 Wickham 大神的原著

Tidy data processing methods

目前大家學習了 R 語言的基本語法：從變數、資料類型、資料結構、流程控制、迴圈與自訂函數、輸入輸出等等。接著可以開始針對要處理的問題，進一步學習資料整頓、視覺化、統計與機器學習等應用。
近年來由主要由 Rtudio 開發團隊則推出一個整體架構 tidyverse，希望形成一個資料科學的套件生態圈，共享處理資料的哲學、語法、與資料結構。目的之一是希望能在最短時間讓初學者直接切入資料處理與視覺化的專案應用，讓 R 語言能夠很快地直接派上用場。這個想法甚至引發了 R 語言教育的路線之爭 (R base first vs Tidyverse first)。
我們先掌握兩個核心的套件架構即可。
- dplyr: a Grammar of Data Manipulation
- ggplot2: a Grammar of Graphics
在 DataCamp 上的 Introduction to the Tidyverse 的課程，課程時數約四小時，請大家可找時間練習。

`dplyr` : a grammar for data wrangling

先看一下網站簡介並安裝

Great for data exploration and transformation
Intuitive to write and easy to read, especially when using the chaining syntax
Fast on data frames

可以開始利用 cheatsheet 來參考 (上傳品質很好的文本處理 cheatsheet 或 R package 也可以當成期末專案)

五個基本動作 Basic Five Verbs

Data manipulation with five verbs ：filter(), select(), arrange(), mutate(), summarise()。注意：直行（VAR）橫列 (OBS)

篩選 filter(): take a subset of the rows (i.e., observations, OBS)
按給定的邏輯判斷，篩選出符合要求的 OBS, 類似於 subset()。
選擇 select() : take a subset of the columns (i.e., variables, VAR)
用 VAR 作參數來選擇 OBS。
排列 arrange(): sort the rows
按給定的 VAR 依次對 OBS 進行排序。類似於 order()。
增行 mutate(): add or modify existing columns
對已有 VAR 進行運算並添加為新的 VAR。類似於 transform()。
摘要 summarise(): aggregate the data across rows
對data frame 調用其它函數進行 summarise, 並回傳一維結果。

使用方法：Each of these functions takes a data frame as its first argument, and returns a data frame. First argument is a data frame, and subsequent arguments say what to do with data frame.

管線 Chaining/Pipelining

Usual way to perform multiple operations in one line is by nesting.
Can write commands in a natural order by using the %>% infix operator (which can be pronounced as 「then」)
Chaining increases readability significantly

# The easiest way to get dplyr is to install the whole tidyverse:
#install.packages("tidyverse")
library(tidyverse)

# play with the starwars data
# head(starswar)

# filter
starwars %>% 
  filter(species == "Droid")

# select
starwars %>% 
  select(name, ends_with("color"))

# mutate then select
starwars %>% 
  mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)

# arrange
starwars %>% 
  arrange(desc(mass))

Grouping Data

以上功能加上分組操作group_by()這個概念結合起來時非常強大!

# group_by then summarise then filter
starwars %>%
  group_by(species) %>%
  summarise(
    n = n(), # number of values in a vector
    mass = mean(mass, na.rm = TRUE)
  ) %>%
  filter(
    n > 1,
    mass > 50
  )

Exercise

Are there more Droids or humans in the Star Wars movies? (anw: There are 5 Droids and 35 Humans. So more Humans.)

starwars %>% select(species) %>%
  filter(species=="Droid" | species=="Human") %>%
  group_by(species) %>%
  summarize(n=n())

How many films are in the dataset? (hint: also check unlist(),unique())

starwars %>% 
  select(films) %>%
  unlist() %>%
  unique()

Which of the Star Wars movies was Luke Skywalker in?

starwars %>% 
  filter(name=="Luke Skywalker") %>%
  select(films) %>%
  unlist()

Comparing `base R` approch 和 `tidyverse` approach

filter (keep rows marching criteria) : 篩選觀察
- Base R approach to filtering forces you to repeat the data frame’s name, and dplyr approach is simpler to write and read: filter(df, 回傳符合邏輯條件的 rows)

# base R approach 
#starwars[starwars$height > 160 & starwars$sex == "female", ]

# dplyr approach
# note: you can use comma or ampersand to represent AND condition
filter(starwars, height > 160 & sex == "female")

select: Pick columns by name 選取變量
- Base R approach is awkward to type and to read, dplyr approach uses similar syntax to filter.

# base R approach to select DepTime, ArrTime, and FlightNum columns
#starwars[, c("name", "height", "gender")]

# dplyr approach
select(starwars, name, height, gender)

# use colon to select multiple contiguous columns, and use `contains` to match columns by name
# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
# 或者使用 - 來排除某列
select(starwars, name:gender, contains("color"))

# nesting method to select name and height columns and filter for height > 90
#filter(select(starwars, name, height), height > 90)

# chaining method
starwars %>%
    select(name, height) %>%
    filter(height > 90)

Others
- rename() 重命名變量 variable names: rename(tbl, newname = oldname,...)
- Summarise function takes n inputs and returns 1 value
- Window function takes n inputs and returns n values. Includes ranking and ordering functions (like min_rank()), offset functions (lead() and lag()), and cumulative aggregates (like cummean()).

Data wrangling with Multiple Tables

如果要處理的數據包含許多的表格怎麼辦？

left_join(), right_join(), inner_join()

`ggplot2`

作圖是 EDA 的一把瑞士刀。在應用技術之前，可以先想想
- 什麼樣資料適合用什麼樣的圖形表達？
- 適當的作圖工具套件（library）為何？
- 如何生動、產生互動？
先看看這個 plots to avoid
學習途徑：plot() - qplot() - ggplot() >> interactive plot rCharts, plotly, networkD3, dygraphs… (視你的應用需求而定)。有些套件本身就足夠滿足妳的需求。
- e.g., gather and display Google trend information.

#install.packages('gtrendsR')
library(gtrendsR)
trends <- gtrends(c("Nerds", "Smarties"), geo ="CA")
plot(trends)

make it more interactive

#install.packages('plotly')
library(plotly)
p <-plot(trends)

ggplotly(p)

`ggplot2` 的基本文法

gg 代表 grammar of graphics
- (data, aesthetics) + geometry
  - data: a data frame
  - aesthetics: used to indicate x and y variables, also used to control the color, size, shape of points, heights of bars, etc.
  - geometry: corresponds to the type of graphics (histogram, box plot,…)

library(ggplot2)
gg <- ggplot(diamonds, aes(price, carat)) +
  geom_point(color = "brown4") # scatter plot; size=1.5, shape=18

gg

Build plots steps by steps

gg <- gg + ggtitle("Diamond carat and price")

Rapid Data Exploration with dplyr and ggplot2

diamonds %>%                      # Start with the 'diamonds' dataset
  filter(cut == "Ideal") %>%      # Then, filter down to rows where cut == Ideal
  ggplot(aes(x=color,y=price)) +  # Then, plot using ggplot
    geom_boxplot()                #  with and create a boxplot

再回來玩 gtrendsR

library(gtrendsR)

# https://rdrr.io/cran/gtrendsR/man/gtrends.html

#define the keywords
keywords=c("Paris","New York","Barcelona")
#set the geographic area: DE = Germany
country=c('TW')
#set the time window
time=("2010-01-01 2018-08-27")
#set channels 
channel='web'



trends = gtrends(keywords, gprop =channel,geo=country, time = time )
#select only interst over time 
time_trend=trends$interest_over_time
head(time_trend)


plot<-ggplot(data=time_trend, aes(x=date, y=hits,group=keyword,col=keyword))+
  geom_line()+xlab('Time')+ylab('Relative Interest')+ theme_bw()+
  theme(legend.title = element_blank(),legend.position="bottom",legend.text=element_text(size=12))+ggtitle("Google Search Volume")
plot

中文資料參考

http://molecular-service-science.com/2013/11/27/r-ggplot-tutorial-1/

http://molecular-service-science.com/2014/01/23/r-ggplot-tutorial-2/

Exploratory Data Analysis (1) : Data wrangling