background-image: url(https://www.technotification.com/wp-content/uploads/2018/06/R-prograamming-for-data-science.jpg) background-position: center background-size: cover class: title-slide .bg-text[ # Introduction to Programming and Data Science with R ### week.4 <hr /> 3月 18, 2021 謝舒凱 ] --- # 課程資訊公告 | Administrivia .large[ - [Data Scientist with R: career track](https://learn.datacamp.com/career-tracks/data-scientist-with-r) - `DataCamp` 前 20% 同學平時成績加分 (目前排行) - Mid-term exam (Base R) ] --- ## 統神端火鍋與資料科學 - 被同學一再刺激下的老師決定也來研究一下。
- 8 seconds arouses [abundant video memes](https://www.youtube.com/watch?v=QHROwZHvLR4) --- ## Internet Meme and Participatory Culture (`#TikTokization`) - 數據產品與現象的 **spreadability (virality)** and **popularity** 也是資料科學的主題。 -- > What makes content go viral? Which (videos/memes/songs/movies/...) become popular and why others don't? -- - .small[最近興起的計算社會媒體科學 (computational social media)] <center> <img src="img/viral.png" alt="drawing" width="400"/> </center> --- ## 專案素養練習 .small[動機目的、資料處理流程、模型與展現]</br> Explaining and Predicting the Popularity of (Youtube Videos) -- .small[ - Identifying prospective popular videos - Simulating video reaction to promotion schedules - Comparing videos from different Youtubers / Youtube channels - Visualizing the popularity series fitted and predicted by proposed algorithm. ] -- 目前最強演算 `Hawkes Intensity Processes` for Social Media Popularity <img src="img/hip.png" alt="drawing" width="400"/> -- [R 實作](https://github.com/computationalmedia/hipie) <img style="float: right;" src="img/hipAPP.png" alt="drawing" width="350"/> --- ## 學習方式建議 - 給我六個小時砍樹,我會花前四個小時磨斧 - 即早進入 `\(<g,t>\)` 態 - 好的資料科學家不是只有程式厲害,數位素養是關鍵。 ??? typeof() , class(), vs mode() --- ## 「數養」舉例 `數感`與`邏輯感` 讓自己不被數據操控,也不操控別人。 -- - `比率偏誤` (ratio bias effect) : choose the lower probability of winning instead of the higher one, simply because of the way the ratios are expressed. 有兩個裝著不同玻璃珠的碗, 1. 有 10 個玻璃珠,9 個白色,1 個紅色 2. 有 100 個玻璃珠,92 個白色,8 個紅色 如果你被告知要被矇眼,選擇最可能挑出紅色的玻璃珠? <br> -- (53% 選了 2 !) --- ## 「數養」舉例 - `大數據的迷惑` </br> -- 【每天有 100 人死於癌症】 `\(<\)` 【每年有 36,500 人死於癌症】? -- - `分母的忽略` (denominator neglect) <br> -- 【每 100 人之中,就有 8 人能中獎!】 `\(>\)` 【每 10 之中,就有 1 人能贏得獎金】? -- - `脈絡前後景操弄` <br> - 在民進黨政府的執政之下,有 30% 的人過得更糟 <br> -- (.small[= 與之前相比,有至少 70% 的人擁有跟過去一樣的生活水平]) -- - 每 50 位居民當中,有 30 位能夠活到超過 70 歲 <br> -- (.small[= 有 60% 的居民能活超過 70 歲]) -- - `孤立背景` <br> -- 【20 名學生因為嗑藥被退學】vs 【這學校有 2000 名學生, 99% 的學生沒嗑藥】 --- ## 「數養」再舉例 這段話問題是什麼?怎麼改? > 十個人之中就有兩個人可以透過充分的運動,來降低罹患心臟疾病的風險。而有另外三分之一的人,可以透過充分的運動減少百分之十五罹患心臟疾病的風險。 -- (全改成百分比) > 充分運動能讓 20% 的人降低罹病機率 30%, 讓 33% 的人降低罹病機率 15%。(我們也同時知道有 47% 的人,未能保持充分運動) 100-(20+33) = 47 --- # Base R 要常常有個學習的概覽 birds-view --- ## Data Structure - 為何需要資料結構? - `\(<g,t>\)`: **data type** vs **data structure** ? - .large[R 提供 6 種基礎的資料結構] <span style="color:green; font-weight:bold">向量 (vector), 矩陣 (matrix), 陣列 (array), 因子 (factor), 列表 (list) and 數據框 (data frame).</span> - .large[重點在於:怎麼建立、確認、轉換、取值、操作與計算] - **create, convert, access, manipulate, calculation** ??? list() vs as.list() --- ## 圖示 ![](https://miro.medium.com/proxy/1*JjZYjvyBurwgQa1RBRtzAA.png) --- ## 向量 Vector 複習 - All vectors are one-dimensional and each element is of the same type. --- ## 矩陣 Matrix - a collection of elements that has a two-dimensional representation(i.e., columns and rows.) - A matrix can contain elements of the *same* data type only. (`character`, `numeric`, `logical`) - **create, convert, access, manipulate, calculation** ```r m0 <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol =3) m1 <- matrix(1:25, nrow = 5, ncol = 5) # check byrow= #rnames <- c("R1", "R2", "R3", "R4", "R5") #cnames <- c("C1", "C2", "C3", "C4", "C5") #m1 <- matrix(1:25, nrow = 5, ncol = 5, dimnames = list(rnames, cnames)) # class(m); mode(m) ``` --- ## 矩陣 Matrix ```r # access m1[3,4] m1[,3] m1[c(1:3),] # convert v <- as.vector(m1);v ``` --- ## 矩陣 Matrix - Another way is to bind columns or rows using `rbind()` and `cbind()` - can also use the `byrow` argument to specify how the matrix is filled. ```r # manipulate: merge and delete (y <- c(1:10)) m2 <- matrix(y, nrow = 5, ncol = 2);m2 #(m2 <- matrix(y, nrow = 5, ncol = 2, byrow = F)) (m3 <- rbind(m2, c(11,12))) (m4 <- cbind(m3, c(13:18))) (m4 <- m4[2,]) ``` --- background-image: url(../img/emo/boredom-small.png) --- ## 矩陣 Matrix ### 矩陣運算 ```r # Transpose the whole matrix t(m2) # Matrix multiplication m2 %*% t(m2) ``` --- ## 陣列 Array 陣列是矩陣的延伸,矩陣可說是 2 維的陣列。而陣列的維度可以大於 2。 ```r # array(data = NA, dim = length(data), dimnames = NULL) z <- c(1:30) dim1 <- c("a1", "a2","a3") dim2 <- c("b1","b2","b3", "b4", "b5") dim3 <- c("c1","c2") a <- array(z, dim = c(3,5,2), dimnames = list(dim1,dim2,dim3)) ``` --- ## 陣列 Array ```r a[2,4,1] ``` ``` ## [1] 11 ``` ```r a['a1','b4','c1'] ``` ``` ## [1] 10 ``` ```r dim(a) ``` ``` ## [1] 3 5 2 ``` --- ## 資料框 Data Frame .large[最常處理的資料結構] - A dataframe is similar to the matrix, but in a data frame, the columns can hold data elements of different types. - the most commonly used data type for most of the analysis. Number of columns equals to number of observed variables; number of rows equals to number of observations. ```r # create, manipulate, access # iris (iris.simple <- data.frame(Sepal.Length = c(5.1, 4.7,5.0), Sepal.Width = c(3.5, 3.2, 3.6), Pedal.Length = c(1.4, 1.3,1.4))) ``` ``` ## Sepal.Length Sepal.Width Pedal.Length ## 1 5.1 3.5 1.4 ## 2 4.7 3.2 1.3 ## 3 5.0 3.6 1.4 ``` ```r # str(); dim(); summary() ``` --- ## Data Frame - `[]`, `$`, `subset()` ```r iris.simple[,1] iris.simple$Sepal.Width iris.simple$Sepal.Width[2] subset(iris.simple, Sepal.Length < 5) ``` --- ## Data Frame ```r ## cbind(), rbind() names(iris.simple) names(iris.simple)[1] <- "sepal.length" ``` --- ## Data Frame - 基本運算 - 基本統計 `mean(), median(), sum(), min(), max(), sd(), ...` ```r # 練習自己建立一個 data frame students <- data.frame(c("Cedric","Fred","George","Cho","Draco","Ginny"), c(3,2,2,1,0,-1), c("H", "G", "G", "R", "S", "G")) names(students) <- c("name", "year", "house") # name the columns class(students) # "data.frame" class(students$year) # "numeric" class(students[,3]) # "factor" # find the dimensions nrow(students) ncol(students) dim(students) ``` --- ## In-class Exercise `mtcars` 是個很好的練習用例子。(打在 `NTU cool` 讓我知道) ```r #mtcars # The built-in data frame #help(mtcars) dim(mtcars) # The dimensions(rows and columns) nrow(mtcars) # Number of rows ncol(mtcars) # Number of columns names(mtcars) # The column names rownames(mtcars) # The row names summary(mtcars) # A summary of each column ``` --- ## 因子 Factor - 複習一下統計學中「變數」的分類 <img style='border: 1px solid;' width=40% src='./img/var.png'></img> - 在 R 中,類別(【男、女】)和有序(【好-中-差】)的變數稱作「因子」(factor)。 在 data frame 中常看到。 Factors are variables which take on a limited number of values, aka categorical variables. In R, factors are stored as a vector of integer values with the corresponding set of character values you’ll see when displayed (colloquially, labels; in R, levels). --- ## 因子 Factor - Factors 可以視為是一種特殊的向量類型。只是其元素由定性變數所組成。 用 `factor()` 來產生,用 `levels()` 來取得 levels (values the categorical data can take)。 ```r gender <- c("female", "female", "male", "female", "male", "female") gender.2 <- factor(gender) levels(gender.2) ``` --- ## 因子 Factor ```r # 變成有序因子 honor <- c("cum laude","summa cum laude", "cum laude", "summa laude", "magna cum laude","cum laude") honor.fac <- factor(honor, levels =c("cum laude", "magna cum laude", "summa cum laude"), ordered = TRUE); honor.fac ``` --- ## List - 資料結構的大雜燴:其構成元素可以是向量、矩陣、陣列、數據框、甚至是表列。 - list 中的每個元素也可以有不同長度。 --- ## List - **create, access, manipulate** ```r # create v1 <- c(1:10) v2 <- c("life", "is", "short") m1 <- matrix(c(1:9), nrow=3) f1 <- factor(c("positive", "negative", "negative", "neutral", "positive")) name <- c("jessy", "jessica", "jessie") R <- c(60, 90, 92) PYTHON <- c(60, 95, 93) piano <- c("great", "ok","ok") df1 <- data.frame(name, R, PYTHON, piano) mylist <- list(v1,v2,m1,f1, df1) # 命名(注意語法!) mylist <- list(num = v1, char = v2, mat = m1, fac = f1, daframe = df1) ``` - `list()` vs. `as.list()`: create vs coerce --- ## 列表 List ```r ## access: three ways: [[index]], [[element.name]], list$element.name mylist[[1]] mylist[["num"]] mylist$num ``` - 利用 `table()` 建立 contingency table; `prop.table()` 轉成頻率。 ```r table(mylist$fac) ``` ``` ## ## negative neutral positive ## 2 1 2 ``` --- ## 邏輯流程: 條件判斷與迴圈 logical flow: conditionals and loop - 助教實習課會教 R 語法,這裏先談重要的背景知識。 > 條件判斷的邏輯,就是進行【沒有遺漏且互斥的分割】 - `沒有遺漏的` (exhaustive) : 可確定該規則適用各種情況。 - `互斥的` (exclusive):可確定該規則沒有矛盾。 舉例來說,我們怎麼寫票價系統 --- ## R 程式中的邏輯運算 Boolean value, truth table and Venn diagram --- ## R 的運算子 Operators in R [參考](https://www.datamentor.io/r-programming/operator/) --- ## 基本繪圖 Basic plotting - `plot()` 是基本作圖函式。 ```r #plot(iris) #plot(iris$Sepal.Length, iris$Petal.Length) ``` - `qplot()` 是 `ggplot2` 作圖套件的一個基本作圖函式,基本用法類似,但較美觀? ![](index_files/figure-html/unnamed-chunk-19-1.png)<!-- --> --- ## In-class Exercise - 結合上述資料,建立 data frame (無序、分類變數)。 - 利用 `table()` 建立 contingency table; `prop.table()` 轉成頻率。 - 做圖 --- ## Preparing/cleaning data - In many cases, getting our data in the rectangular arrangement of a matrix or data frame is the first step in preparing it for analysis. - As much as 60%-80% of the time Data Scientists spent on data analysis is focused on preparing the data for analysis. - (numerical data) : **handling missing data and outliers** - (textual data) : **tokenization**/**word segmentation** --- ## Missing values 缺失值處理 > Missing values are values that should have been recorded but were not. - a numeric missing value is represented by `NA` (Not Available) while character missing values are represented by `<NA>`. - use the `is.na()` to identify the presence of NA for each column; the function `anyNA()` returns TRUE if the vector contains any missing values. ```r (missing_dat <- data.frame(col.1=c(1,NA,0,1),col.2=c("M","F",NA,"M"))) is.na(missing_dat$col.1) anyNA(missing_dat) # 提取非缺失值 missing_dat[!is.na(missing_dat)] ``` --- ## Missing values 缺失值處理 - We can replace the NA with the mean value or we can **remove these NA rows**. ```r (newdata <- na.omit(missing_dat)) ``` - 有許多函式都帶有 `na.rm` 參數,設成 TRUE 執行時會自動刪除所有的 NA,不然造成 `NA+[anything]=NA`。但要注意:Substitute or remove 從方法論上來說不一定是好事。 ```r sum(c(NA, 1,44,23,NA,99), na.rm = TRUE) ``` ``` ## [1] 167 ``` ??? NaN, NULL, Inf 用 is.na() 來檢查 <!-- --- --> <!-- ## Reading big files with `data.table` --> <!-- The `data.table` package is extremely useful — and much, much faster than `read.table` — for larger files. --> <!-- ```{r, echo=TRUE, results='hide'} --> <!-- require(data.table) --> <!-- students <- as.data.table(students) --> <!-- students # note the slightly different print-out --> <!-- students[name=="Ginny"] # get rows with name == "Ginny" --> <!-- students[year==2] # get rows with year == 2 --> <!-- ``` --> --- ## Basic I/O 了解預設值 - `read.table(file, header = TRUE, sep = "")` - `write.table(x, file = "", append = FALSE, sep = " ", row.name = TRUE, col.names = TRUE)` --- ## Data input - `read.table()` 是最基本的資料輸入函式。至少有幾個參數要了解:`file, header, sep, stringAsFactors` - **file**: 相對路徑或絕對路徑,用 `/` 或是 `\\` 來表示。(e.g., OSX `"~/dsR/data"`, Windows `"C:\\dsR\\data"`) - **header**: 邏輯值。設成 TRUE,會將第一個 row 當成變數名。 - **sep**: 分隔符號。預設為空格。 - **stringAsFactors**: 預設是將字串的資料類型轉換成 factor 變數。想要字串被當成字串,則設成 FALSE. - For data exported from Excel, use `na.strings = c("", "#N/A", "#DIV/0!", "#NUM!")`. - **fill**: Load data file with columns of unequal length. 如果我們的原始檔本身,有不同的 columns 長度,那麼我們用`fill=TRUE`來補上 blank。 --- ## 給還沒習慣路徑概念的人 ```r data <- read.table(file.choose()) # for MAC/Linux data <- read.table(choose.files()) # for Windows ``` --- ## Data I/O 資料的輸出 - `row.names` 和 `col.names` 都是邏輯值。設成 TRUE 則會將 row or column names 一起輸出。 ```r write.csv(data, "~/dsR/data.csv", row.names = FALSE, fileEncoding = "utf8" ) ``` --- ## In-class Exercise 練習讀取外部檔案 [Personality](http://personality-project.org/r/#getdata) ```r personality <- read.table( "http://personality-project.org/r/datasets/maps.mixx.epi.bfi.data", header = TRUE) # or: header = T ``` --- ## Review <img style='border: 1px solid;' width=50% src='./img/data-science.png'></img> 資料科學涉及的歷程: - (操作型)定義可以利用資料回答的問題 (問題的類型決定了答案的類型!) - 蒐集與清理資料 - 探索、分析資料 (資料不適合回答問題,怎麼辦?) - 溝通 (transfer your findings to action!!) --- ## 分組練習 <span style="color:green; font-weight:bold">自己的資料自己玩</span> ```r dsr <- read.csv("data/week3.in.class.csv", header = TRUE, stringsAsFactors = FALSE) dsr.clean <- na.omit(dsr) dsr.clean$gender <-factor(dsr.clean$gender) dsr.clean$grade <-factor(dsr.clean$grade) str(dsr.clean) ``` ``` ## 'data.frame': 80 obs. of 6 variables: ## $ nickname : chr "iakuhs" "iewob" "KingInWorld " "Sam" ... ## $ gender : Factor w/ 3 levels "0","1","2": 3 3 3 3 2 2 3 2 3 2 ... ## $ grade : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 4 4 2 4 1 4 3 ... ## $ q_self : int 80 100 83 80 80 100 87 60 100 85 ... ## $ q_teacher: int 100 100 88 90 100 0 100 100 100 100 ... ## $ GPA : num 4.5 4.3 3.63 3.7 3.7 4.3 4.4 0 3.5 3.9 ... ## - attr(*, "na.action")= 'omit' Named int [1:6] 15 29 77 80 84 85 ## ..- attr(*, "names")= chr [1:6] "15" "29" "77" "80" ... ``` ```r table(dsr.clean$gender) ``` ``` ## ## 0 1 2 ## 1 44 35 ``` --- ## 進階關聯作圖 先練習程式套件使用,道理在統計週的課再談。 ```r library(correlation) library(see) result <- correlation(dsr.clean); result ``` ``` ## # Correlation table (pearson-method) ## ## Parameter1 | Parameter2 | r | CI | 95% CI | t(78) | p ## ----------------------------------------------------------------------- ## q_self | q_teacher | -0.05 | 0.95 | [-0.27, 0.17] | -0.48 | > .999 ## q_self | GPA | 0.10 | 0.95 | [-0.12, 0.31] | 0.91 | > .999 ## q_teacher | GPA | -0.09 | 0.95 | [-0.30, 0.13] | -0.80 | > .999 ## ## p-value adjustment method: Holm (1979) ## Observations: 80 ``` --- ## ```r s <- summary(result) plot(s, size_point = 2) ``` ``` ## Warning: Removed 1 rows containing missing values (geom_point). ``` ![](index_files/figure-html/unnamed-chunk-28-1.png)<!-- --> --- ```r plot(s, type = "tile") ``` ![](index_files/figure-html/unnamed-chunk-29-1.png)<!-- --> ```r plot(s, show_values = TRUE, show_p = TRUE, show_legend = FALSE) ``` ``` ## Warning: Removed 1 rows containing missing values (geom_point). ``` ![](index_files/figure-html/unnamed-chunk-29-2.png)<!-- --> --- ## Gaussian Graphical Models (GGMs) ```r library(ggraph) result.2 <- correlation(dsr.clean, partial = TRUE); result.2 ``` ``` ## # Correlation table (pearson-method) ## ## Parameter1 | Parameter2 | r | CI | 95% CI | t(78) | p ## ----------------------------------------------------------------------- ## q_self | q_teacher | -0.05 | 0.95 | [-0.26, 0.18] | -0.40 | > .999 ## q_self | GPA | 0.10 | 0.95 | [-0.12, 0.31] | 0.87 | > .999 ## q_teacher | GPA | -0.09 | 0.95 | [-0.30, 0.14] | -0.76 | > .999 ## ## p-value adjustment method: Holm (1979) ## Observations: 80 ``` --- ## Gaussian Graphical Models (GGMs) ```r plot(result.2) ``` ![](index_files/figure-html/unnamed-chunk-31-1.png)<!-- -->