Introduction to Data Science with R

background-image: url(https://www.technotification.com/wp-content/uploads/2018/06/R-prograamming-for-data-science.jpg)
background-position: center
background-size: cover

class: title-slide

.bg-text[
# Introduction to Programming and Data Science with R
### week.4

<hr />

3月 18, 2021  
謝舒凱
]

---
# 課程資訊公告
| Administrivia

.large[
- [Data Scientist with R: career track](https://learn.datacamp.com/career-tracks/data-scientist-with-r)

- `DataCamp` 前 20% 同學平時成績加分 (目前排行) 
  
  - Mid-term exam (Base R)
]

---
## 統神端火鍋與資料科學

- 被同學一再刺激下的老師決定也來研究一下。

- 8 seconds arouses [abundant video memes](https://www.youtube.com/watch?v=QHROwZHvLR4)

---
## Internet Meme and Participatory Culture (`#TikTokization`)

- 數據產品與現象的 **spreadability (virality)** and **popularity** 也是資料科學的主題。

> What makes content go viral? Which (videos/memes/songs/movies/...) become popular and why others don't?

- .small[最近興起的計算社會媒體科學 (computational social media)]

---
## 專案素養練習
.small[動機目的、資料處理流程、模型與展現]</br>
Explaining and Predicting the Popularity of (Youtube Videos)
--
.small[
- Identifying prospective popular videos
- Simulating video reaction to promotion schedules
- Comparing videos from different Youtubers / Youtube channels
- Visualizing the popularity series fitted and predicted by proposed algorithm.
]
--

目前最強演算 `Hawkes Intensity Processes` for Social Media Popularity 
<img src="img/hip.png" alt="drawing" width="400"/>

--
[R 實作](https://github.com/computationalmedia/hipie)
<img style="float: right;" src="img/hipAPP.png" alt="drawing" width="350"/>

---
## 學習方式建議

- 給我六個小時砍樹，我會花前四個小時磨斧

- 即早進入 `$<g,t>$` 態

- 好的資料科學家不是只有程式厲害，數位素養是關鍵。

???
typeof() , class(), vs  mode()

---
## 「數養」舉例
`數感`與`邏輯感` 讓自己不被數據操控，也不操控別人。

- `比率偏誤` (ratio bias effect) : choose the lower probability of winning instead of the higher one, simply because of the way the ratios are expressed. 有兩個裝著不同玻璃珠的碗，

1. 有 10 個玻璃珠，9 個白色，1 個紅色
  2. 有 100 個玻璃珠，92 個白色，8 個紅色
  
如果你被告知要被矇眼，選擇最可能挑出紅色的玻璃珠？ <br>
--
(53% 選了 2 !)

---
## 「數養」舉例

- `大數據的迷惑` </br>
--
【每天有 100 人死於癌症】 `$<$` 【每年有 36,500 人死於癌症】?

- `分母的忽略` (denominator neglect) <br> 
--
【每 100 人之中，就有 8 人能中獎！】 `$>$` 【每 10 之中，就有 1 人能贏得獎金】?

--
- `脈絡前後景操弄` <br>

- 在民進黨政府的執政之下，有 30% 的人過得更糟 <br>
--
(.small[= 與之前相比，有至少 70% 的人擁有跟過去一樣的生活水平])

- 每 50 位居民當中，有 30 位能夠活到超過 70 歲 <br>
--
(.small[= 有 60% 的居民能活超過 70 歲])

- `孤立背景` <br>
--

【20 名學生因為嗑藥被退學】vs 【這學校有 2000 名學生, 99% 的學生沒嗑藥】

---
## 「數養」再舉例

這段話問題是什麼？怎麼改？

> 十個人之中就有兩個人可以透過充分的運動，來降低罹患心臟疾病的風險。而有另外三分之一的人，可以透過充分的運動減少百分之十五罹患心臟疾病的風險。

--
(全改成百分比)

> 充分運動能讓 20% 的人降低罹病機率 30%， 讓 33% 的人降低罹病機率 15%。（我們也同時知道有 47% 的人，未能保持充分運動） 
100-(20+33) = 47

---
# Base R

要常常有個學習的概覽 birds-view

---
## Data Structure

- 為何需要資料結構？

- `$<g,t>$`: **data type** vs **data structure** ?

- .large[R 提供 6 種基礎的資料結構]

<span style="color:green; font-weight:bold">向量 (vector), 矩陣 (matrix), 陣列 (array), 因子 (factor), 列表 (list) and 數據框 (data frame).</span>

- .large[重點在於：怎麼建立、確認、轉換、取值、操作與計算]
    - **create, convert, access, manipulate, calculation**

???
list() vs as.list()

---
## 圖示

![](https://miro.medium.com/proxy/1*JjZYjvyBurwgQa1RBRtzAA.png)

---
## 向量 Vector

複習

- All vectors are one-dimensional and each element is of the same type.

---
## 矩陣 Matrix

- a collection of elements that has a two-dimensional representation(i.e., columns and rows.)

- A matrix can contain elements of the *same* data type only. （`character`, `numeric`, `logical`）
- **create, convert, access, manipulate, calculation**

```r
m0 <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol =3) 
m1 <- matrix(1:25, nrow = 5, ncol = 5) # check byrow=
#rnames <- c("R1", "R2", "R3", "R4", "R5")
#cnames <- c("C1", "C2", "C3", "C4", "C5")
#m1 <- matrix(1:25, nrow = 5, ncol = 5, dimnames = list(rnames, cnames))
# class(m); mode(m)
```

---
## 矩陣 Matrix

```r
# access 
m1[3,4]
m1[,3]
m1[c(1:3),]
# convert
v <- as.vector(m1);v
```

---
## 矩陣 Matrix

- Another way is to bind columns or rows using `rbind()` and `cbind()`
- can also use the `byrow` argument to specify how the matrix is filled.

```r
# manipulate: merge and delete
(y <- c(1:10))
m2 <- matrix(y, nrow = 5, ncol = 2);m2
#(m2 <- matrix(y, nrow = 5, ncol = 2, byrow = F))
(m3 <- rbind(m2, c(11,12)))
(m4 <- cbind(m3, c(13:18)))
(m4 <- m4[2,])
```

---
background-image: url(../img/emo/boredom-small.png)
---
## 矩陣 Matrix
### 矩陣運算

```r
# Transpose the whole matrix
t(m2)

# Matrix multiplication
m2 %*% t(m2)
```

---
## 陣列 Array

陣列是矩陣的延伸，矩陣可說是 2 維的陣列。而陣列的維度可以大於 2。

```r
# array(data = NA, dim = length(data), dimnames = NULL)
z <- c(1:30)
dim1 <- c("a1", "a2","a3")
dim2 <- c("b1","b2","b3", "b4", "b5")
dim3 <- c("c1","c2")
a <- array(z, dim = c(3,5,2), dimnames = list(dim1,dim2,dim3))
```

---
## 陣列 Array

```r
a[2,4,1]
```

```
## [1] 11
```

```r
a['a1','b4','c1']
```

```
## [1] 10
```

```r
dim(a)
```

```
## [1] 3 5 2
```

---
## 資料框 Data Frame

.large[最常處理的資料結構]

- A dataframe is similar to the matrix, but in a data frame, the columns can hold data elements of different types.

- the most commonly used data type for most of the analysis. Number of columns equals to number of observed variables; number of rows equals to number of observations.

```r
# create, manipulate, access
# iris
(iris.simple <- data.frame(Sepal.Length = c(5.1, 4.7,5.0), 
                           Sepal.Width = c(3.5, 3.2, 3.6), 
                           Pedal.Length = c(1.4, 1.3,1.4)))
```

```
##   Sepal.Length Sepal.Width Pedal.Length
## 1          5.1         3.5          1.4
## 2          4.7         3.2          1.3
## 3          5.0         3.6          1.4
```

```r
# str(); dim(); summary()
```

---
## Data Frame

- `[]`, `$`, `subset()`

```r
iris.simple[,1]
iris.simple$Sepal.Width
iris.simple$Sepal.Width[2]
subset(iris.simple, Sepal.Length < 5)
```

---
## Data Frame

```r
## cbind(), rbind()
names(iris.simple)
names(iris.simple)[1] <- "sepal.length"
```

---
## Data Frame

- 基本運算 
- 基本統計 `mean(), median(), sum(), min(), max(), sd(), ...`

```r
# 練習自己建立一個 data frame
students <- data.frame(c("Cedric","Fred","George","Cho","Draco","Ginny"),
                       c(3,2,2,1,0,-1),
                       c("H", "G", "G", "R", "S", "G"))
names(students) <- c("name", "year", "house") # name the columns
class(students)	# "data.frame"
class(students$year)	# "numeric"
class(students[,3])	# "factor"
# find the dimensions
nrow(students)	
ncol(students)	
dim(students)	
```

---
## In-class Exercise

`mtcars` 是個很好的練習用例子。（打在 `NTU cool` 讓我知道）

```r
#mtcars             # The built-in data frame
#help(mtcars)
dim(mtcars)         # The dimensions(rows and columns)
nrow(mtcars)        # Number of rows
ncol(mtcars)        # Number of columns
names(mtcars)       # The column names
rownames(mtcars)    # The row names
summary(mtcars)     # A summary of each column
```

---
## 因子 Factor

- 複習一下統計學中「變數」的分類
<img style='border: 1px solid;' width=40% src='./img/var.png'></img>

- 在 R 中，類別（【男、女】）和有序（【好-中-差】）的變數稱作「因子」(factor)。 在 data frame 中常看到。

Factors are variables which take on a limited number of values, aka categorical variables. In R, factors are stored as a vector of integer values with the corresponding set of character values you’ll see when displayed (colloquially, labels; in R, levels).

---
## 因子 Factor

- Factors 可以視為是一種特殊的向量類型。只是其元素由定性變數所組成。
用 `factor()` 來產生，用 `levels()` 來取得 levels (values the categorical data can take)。

```r
gender <- c("female", "female", "male", "female", "male", "female")
gender.2 <- factor(gender)
levels(gender.2)
```

---
## 因子 Factor

```r
# 變成有序因子
honor <- c("cum laude","summa cum laude", "cum laude", 
           "summa laude", "magna cum laude","cum laude")
honor.fac <- factor(honor, 
                    levels =c("cum laude", "magna cum laude", "summa cum laude"), 
                    ordered = TRUE); honor.fac
```

---
## List

- 資料結構的大雜燴：其構成元素可以是向量、矩陣、陣列、數據框、甚至是表列。

- list 中的每個元素也可以有不同長度。

---
## List

- **create, access, manipulate**

```r
# create
v1 <- c(1:10)
v2 <- c("life", "is", "short")
m1 <- matrix(c(1:9), nrow=3)
f1 <- factor(c("positive", "negative", "negative", "neutral", "positive"))
name <- c("jessy", "jessica", "jessie")
R <- c(60, 90, 92)
PYTHON <- c(60, 95, 93)
piano <- c("great", "ok","ok")
df1 <- data.frame(name, R, PYTHON, piano)
mylist <- list(v1,v2,m1,f1, df1)
# 命名(注意語法！)
mylist <- list(num = v1, char = v2, mat = m1, fac = f1, daframe = df1)
```

- `list()` vs. `as.list()`: create vs coerce

---
## 列表 List

```r
## access: three ways: [[index]], [[element.name]], list$element.name
mylist[[1]]
mylist[["num"]]
mylist$num
```
- 利用 `table()` 建立 contingency table; `prop.table()` 轉成頻率。

```r
table(mylist$fac)
```

```
## 
## negative  neutral positive 
##        2        1        2
```

---
## 邏輯流程： 條件判斷與迴圈
logical flow: conditionals and loop

- 助教實習課會教 R 語法，這裏先談重要的背景知識。
> 條件判斷的邏輯，就是進行【沒有遺漏且互斥的分割】

- `沒有遺漏的` (exhaustive) : 可確定該規則適用各種情況。
- `互斥的` (exclusive)：可確定該規則沒有矛盾。

舉例來說，我們怎麼寫票價系統

---
## R 程式中的邏輯運算
Boolean value, truth table and Venn diagram

---
## R 的運算子 Operators in R

[參考](https://www.datamentor.io/r-programming/operator/)

---
## 基本繪圖 Basic plotting

- `plot()` 是基本作圖函式。

```r
#plot(iris)
#plot(iris$Sepal.Length, iris$Petal.Length)
```

- `qplot()` 是 `ggplot2` 作圖套件的一個基本作圖函式，基本用法類似，但較美觀?

![](index_files/figure-html/unnamed-chunk-19-1.png)

---
## In-class Exercise

- 結合上述資料，建立 data frame (無序、分類變數)。
- 利用 `table()` 建立 contingency table; `prop.table()` 轉成頻率。
- 做圖

---
## Preparing/cleaning data

- In many cases, getting our data in the rectangular arrangement of a matrix or data frame is the first step in preparing it for analysis. 
- As much as 60%-80% of the time Data Scientists spent on data analysis is focused on preparing the data for analysis.
  
  - (numerical data) : **handling missing data and outliers**
  - (textual data) : **tokenization**/**word segmentation**

---
## Missing values 缺失值處理

> Missing values are values that should have been recorded but were not.

-  a numeric missing value is represented by `NA` (Not Available) while character missing values are represented by `<NA>`.

- use the `is.na()` to identify the presence of NA for each column; 
the function `anyNA()` returns TRUE if the vector contains any missing values.

```r
(missing_dat <- data.frame(col.1=c(1,NA,0,1),col.2=c("M","F",NA,"M")))

is.na(missing_dat$col.1)

anyNA(missing_dat)
# 提取非缺失值
missing_dat[!is.na(missing_dat)]
```

---
## Missing values 缺失值處理

- We can replace the NA with the mean value or we can **remove these NA rows**.

```r
(newdata <- na.omit(missing_dat))
```

- 有許多函式都帶有 `na.rm` 參數，設成 TRUE 執行時會自動刪除所有的 NA，不然造成 `NA+[anything]=NA`。但要注意：Substitute or remove 從方法論上來說不一定是好事。

```r
sum(c(NA, 1,44,23,NA,99), na.rm = TRUE)
```

```
## [1] 167
```

???
NaN, NULL, Inf 用 is.na() 來檢查

---
## Basic I/O

了解預設值

- `read.table(file, header = TRUE, sep = "")`

- `write.table(x, file = "", append = FALSE, sep = " ",
     row.name = TRUE, col.names = TRUE)`

---
## Data input

- `read.table()` 是最基本的資料輸入函式。至少有幾個參數要了解：`file, header, sep, stringAsFactors`

- **file**: 相對路徑或絕對路徑，用 `/` 或是 `\\` 來表示。(e.g., OSX `"~/dsR/data"`, Windows `"C:\\dsR\\data"`)
  - **header**: 邏輯值。設成 TRUE，會將第一個 row 當成變數名。
  - **sep**: 分隔符號。預設為空格。
  - **stringAsFactors**: 預設是將字串的資料類型轉換成 factor 變數。想要字串被當成字串，則設成 FALSE.
  - For data exported from Excel, use `na.strings = c("", "#N/A", "#DIV/0!", "#NUM!")`.
  - **fill**: Load data file with columns of unequal length. 如果我們的原始檔本身,有不同的 columns 長度,那麼我們用`fill=TRUE`來補上 blank。
  
  
---
## 給還沒習慣路徑概念的人

```r
data <- read.table(file.choose()) # for MAC/Linux
data <- read.table(choose.files()) # for Windows
```

---
## Data I/O 資料的輸出

- `row.names` 和 `col.names` 都是邏輯值。設成 TRUE 則會將 row or column names 一起輸出。

```r
write.csv(data, "~/dsR/data.csv",
          row.names    = FALSE,
          fileEncoding = "utf8"
)
```

---
## In-class Exercise 練習讀取外部檔案

[Personality](http://personality-project.org/r/#getdata)

```r
personality <- read.table(
  "http://personality-project.org/r/datasets/maps.mixx.epi.bfi.data", 
  header = TRUE) # or: header = T
```

---
## Review

資料科學涉及的歷程：
- (操作型)定義可以利用資料回答的問題 (問題的類型決定了答案的類型！)
- 蒐集與清理資料
- 探索、分析資料 (資料不適合回答問題，怎麼辦？)
- 溝通 （transfer your findings to action!!）

---
## 分組練習

<span style="color:green; font-weight:bold">自己的資料自己玩</span>

```r
dsr <- read.csv("data/week3.in.class.csv", header = TRUE, stringsAsFactors = FALSE)
dsr.clean <- na.omit(dsr)
dsr.clean$gender <-factor(dsr.clean$gender)
dsr.clean$grade <-factor(dsr.clean$grade)
str(dsr.clean)
```

```
## 'data.frame':	80 obs. of  6 variables:
##  $ nickname : chr  "iakuhs" "iewob" "KingInWorld " "Sam" ...
##  $ gender   : Factor w/ 3 levels "0","1","2": 3 3 3 3 2 2 3 2 3 2 ...
##  $ grade    : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 4 4 2 4 1 4 3 ...
##  $ q_self   : int  80 100 83 80 80 100 87 60 100 85 ...
##  $ q_teacher: int  100 100 88 90 100 0 100 100 100 100 ...
##  $ GPA      : num  4.5 4.3 3.63 3.7 3.7 4.3 4.4 0 3.5 3.9 ...
##  - attr(*, "na.action")= 'omit' Named int [1:6] 15 29 77 80 84 85
##   ..- attr(*, "names")= chr [1:6] "15" "29" "77" "80" ...
```

```r
table(dsr.clean$gender)
```

```
## 
##  0  1  2 
##  1 44 35
```

---
## 進階關聯作圖

先練習程式套件使用，道理在統計週的課再談。

```r
library(correlation)
library(see)
result <- correlation(dsr.clean); result
```

```
## # Correlation table (pearson-method)
## 
## Parameter1 | Parameter2 |     r |   CI |        95% CI | t(78) |      p
## -----------------------------------------------------------------------
## q_self     |  q_teacher | -0.05 | 0.95 | [-0.27, 0.17] | -0.48 | > .999
## q_self     |        GPA |  0.10 | 0.95 | [-0.12, 0.31] |  0.91 | > .999
## q_teacher  |        GPA | -0.09 | 0.95 | [-0.30, 0.13] | -0.80 | > .999
## 
## p-value adjustment method: Holm (1979)
## Observations: 80
```

---
##

```r
s <- summary(result)
plot(s, size_point = 2)
```

```
## Warning: Removed 1 rows containing missing values (geom_point).
```

![](index_files/figure-html/unnamed-chunk-28-1.png)

---

```r
plot(s, type = "tile")
```

![](index_files/figure-html/unnamed-chunk-29-1.png)

```r
plot(s, show_values = TRUE, show_p = TRUE, show_legend = FALSE)
```

```
## Warning: Removed 1 rows containing missing values (geom_point).
```

![](index_files/figure-html/unnamed-chunk-29-2.png)

---
## Gaussian Graphical Models (GGMs)

```r
library(ggraph)
result.2 <- correlation(dsr.clean, partial = TRUE); result.2
```

```
## # Correlation table (pearson-method)
## 
## Parameter1 | Parameter2 |     r |   CI |        95% CI | t(78) |      p
## -----------------------------------------------------------------------
## q_self     |  q_teacher | -0.05 | 0.95 | [-0.26, 0.18] | -0.40 | > .999
## q_self     |        GPA |  0.10 | 0.95 | [-0.12, 0.31] |  0.87 | > .999
## q_teacher  |        GPA | -0.09 | 0.95 | [-0.30, 0.14] | -0.76 | > .999
## 
## p-value adjustment method: Holm (1979)
## Observations: 80
```

---
## Gaussian Graphical Models (GGMs)

```r
plot(result.2)
```

![](index_files/figure-html/unnamed-chunk-31-1.png)