Introduction to Programming and Data Science with R
week 6 Shu-Kai Hsieh
2021-03-25


在了解基本的 R語法之後,我們要進入整頓資料 (data wrangling/transformation)(進而理解資料)的學習階段。


幸福的人都很類似,不幸的人則各有各的不幸。 – Лев Николаевич Толстой

首先你會拿到資料,但是不整齊 (non-tidy) 的資料讓妳的人生被浪費。

Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

那整齊的資料是什麼意思?(不是不乾淨)

Tidy data

三原則

  1. 每個變數都必須有自己的資料欄 Each variable forms a column.
  2. 每個觀察都必須有自己的資料列 Each observation forms a row.
  3. 每個值都必須有自己的格子 Each value must have its own cell.


Tidy data processing methods


dplyr : a grammar for data wrangling

先看一下網站簡介 並安裝

可以開始利用 cheatsheet 來參考 (上傳品質很好的文本處理 cheatsheet 或 R package 也可以當成期末專案)

五個基本動作 Basic Five Verbs

Data manipulation with five verbs :filter(), select(), arrange(), mutate(), summarise()。 注意:直行(VAR)橫列 (OBS)

  • 篩選 filter(): take a subset of the rows (i.e., observations, OBS)
    按給定的邏輯判斷,篩選出符合要求的 OBS, 類似於 subset()

  • 選擇 select() : take a subset of the columns (i.e., variables, VAR)
    用 VAR 作參數來選擇 OBS。

  • 排列 arrange(): sort the rows
    按給定的 VAR 依次對 OBS 進行排序。類似於 order()

  • 增行 mutate(): add or modify existing columns
    對已有 VAR 進行運算並添加為新的 VAR。類似於 transform()

  • 摘要 summarise(): aggregate the data across rows
    對data frame 調用其它函數進行 summarise, 並回傳一維結果。

使用方法:Each of these functions takes a data frame as its first argument, and returns a data frame. First argument is a data frame, and subsequent arguments say what to do with data frame.

管線 Chaining/Pipelining

  • Usual way to perform multiple operations in one line is by nesting.
  • Can write commands in a natural order by using the %>% infix operator (which can be pronounced as 「then」)
  • Chaining increases readability significantly
# The easiest way to get dplyr is to install the whole tidyverse:
#install.packages("tidyverse")
library(tidyverse)

# play with the starwars data
# head(starswar)

# filter
starwars %>% 
  filter(species == "Droid")
# select
starwars %>% 
  select(name, ends_with("color"))
# mutate then select
starwars %>% 
  mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)
# arrange
starwars %>% 
  arrange(desc(mass))

Grouping Data

  • 以上功能加上分組操作group_by()這個概念結合起來時非常強大!
# group_by then summarise then filter
starwars %>%
  group_by(species) %>%
  summarise(
    n = n(), # number of values in a vector
    mass = mean(mass, na.rm = TRUE)
  ) %>%
  filter(
    n > 1,
    mass > 50
  )

Exercise

  1. Are there more Droids or humans in the Star Wars movies? (anw: There are 5 Droids and 35 Humans. So more Humans.)
starwars %>% select(species) %>%
  filter(species=="Droid" | species=="Human") %>%
  group_by(species) %>%
  summarize(n=n())
  1. How many films are in the dataset? (hint: also check unlist(),unique())
starwars %>% 
  select(films) %>%
  unlist() %>%
  unique()
  1. Which of the Star Wars movies was Luke Skywalker in?
starwars %>% 
  filter(name=="Luke Skywalker") %>%
  select(films) %>%
  unlist()

Comparing base R approch 和 tidyverse approach

  • filter (keep rows marching criteria) : 篩選觀察
    • Base R approach to filtering forces you to repeat the data frame’s name, and dplyr approach is simpler to write and read: filter(df, 回傳符合邏輯條件的 rows)
# base R approach 
#starwars[starwars$height > 160 & starwars$sex == "female", ]

# dplyr approach
# note: you can use comma or ampersand to represent AND condition
filter(starwars, height > 160 & sex == "female")
  • select: Pick columns by name 選取變量
    • Base R approach is awkward to type and to read, dplyr approach uses similar syntax to filter.
# base R approach to select DepTime, ArrTime, and FlightNum columns
#starwars[, c("name", "height", "gender")]

# dplyr approach
select(starwars, name, height, gender)
# use colon to select multiple contiguous columns, and use `contains` to match columns by name
# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
# 或者使用 - 來排除某列
select(starwars, name:gender, contains("color"))
# nesting method to select name and height columns and filter for height > 90
#filter(select(starwars, name, height), height > 90)

# chaining method
starwars %>%
    select(name, height) %>%
    filter(height > 90)
  • Others
    • rename() 重命名變量 variable names: rename(tbl, newname = oldname,...)
    • Summarise function takes n inputs and returns 1 value
    • Window function takes n inputs and returns n values. Includes ranking and ordering functions (like min_rank()), offset functions (lead() and lag()), and cumulative aggregates (like cummean()).

Data wrangling with Multiple Tables

如果要處理的數據包含許多的表格怎麼辦?

left_join(), right_join(), inner_join()


ggplot2

#install.packages('gtrendsR')
library(gtrendsR)
trends <- gtrends(c("Nerds", "Smarties"), geo ="CA")
plot(trends)

#install.packages('plotly')
library(plotly)
p <-plot(trends)

ggplotly(p)

ggplot2 的基本文法

  • gg 代表 grammar of graphics

    • (data, aesthetics) + geometry

      • data: a data frame
      • aesthetics: used to indicate x and y variables, also used to control the color, size, shape of points, heights of bars, etc.
      • geometry: corresponds to the type of graphics (histogram, box plot,…)
library(ggplot2)
gg <- ggplot(diamonds, aes(price, carat)) +
  geom_point(color = "brown4") # scatter plot; size=1.5, shape=18

gg

  • Build plots steps by steps
gg <- gg + ggtitle("Diamond carat and price")

Rapid Data Exploration with dplyr and ggplot2

diamonds %>%                      # Start with the 'diamonds' dataset
  filter(cut == "Ideal") %>%      # Then, filter down to rows where cut == Ideal
  ggplot(aes(x=color,y=price)) +  # Then, plot using ggplot
    geom_boxplot()                #  with and create a boxplot

再回來玩 gtrendsR

library(gtrendsR)

# https://rdrr.io/cran/gtrendsR/man/gtrends.html

#define the keywords
keywords=c("Paris","New York","Barcelona")
#set the geographic area: DE = Germany
country=c('TW')
#set the time window
time=("2010-01-01 2018-08-27")
#set channels 
channel='web'



trends = gtrends(keywords, gprop =channel,geo=country, time = time )
#select only interst over time 
time_trend=trends$interest_over_time
head(time_trend)


plot<-ggplot(data=time_trend, aes(x=date, y=hits,group=keyword,col=keyword))+
  geom_line()+xlab('Time')+ylab('Relative Interest')+ theme_bw()+
  theme(legend.title = element_blank(),legend.position="bottom",legend.text=element_text(size=12))+ggtitle("Google Search Volume")
plot

中文資料參考

http://molecular-service-science.com/2013/11/27/r-ggplot-tutorial-1/

http://molecular-service-science.com/2014/01/23/r-ggplot-tutorial-2/