class: title-slide .bg-text[ # Introduction to Data Science with R ### week.3 <hr /> 3月 12, 2021 謝舒凱 ] --- # 課程資訊 - [課程網頁課綱已更新](https://lopentu.github.io/rlads2021/) --- # 自學翻轉與互學精神 - `DataCamp` 閱讀與練習作業: 20% 同學平時成績加分 - 自己也可以利用 `swirl` 學習,也有 [中文翻譯](https://datascienceandr.org/) - [語言分析與資料科學臉書社團](https://www.facebook.com/groups/652099794893097/) --- ## 工具提醒 不一定要用 RStudio (e.g. 指令列 `Rscript test.R`), 但它還可以做很多事 (to be continued...) - 做 ([可重製 reproducible](http://rmarkdown.rstudio.com/lesson-1.html)、[可調參數 Parameterized](http://rmarkdown.rstudio.com/developer_parameterized_reports.html)、[互動型 interactive](http://rmarkdown.rstudio.com/lesson-14.html)) 筆記(notebook) 與報告 (report) - 做投影片 (presentation) - 做網站 (website) 與 web application (using `shiny`) - 做「數位報表」(dashboard) - 做專業科學文件 (using `\(\LaTeX\)`) --- ## 各取強項 <img src = https://i2.wp.com/www.business-science.io/assets/2018-10-08-python-and-r/python_r_workflow.png?zoom=2&w=456 scale="50%"></img> --- ## Learning DS with R `對應式`學習意識 | R Programming Skills for Data Science: Writing Code to Wrangle, Analyze, and Visualize Data **R syntax** `\(< >\)` - 【獲取】Obtaining data - 【整理】Scrubbing data - 【探索】Exploring data - 【建模】Modeling data - 【詮解】Interpreting (and reporting) data --- ## R 的初體驗 ```r data() # browse pre-loaded data sets data(rivers) # get this one: "Lengths of Major North American Rivers" ?rivers head(rivers,10) # peek at the data set ``` ``` ## [1] 735 320 325 392 524 450 1459 135 465 600 ``` ```r length(rivers) # how many rivers were measured? ``` ``` ## [1] 141 ``` ```r summary(rivers) # what are some summary statistics? ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 135.0 310.0 425.0 591.2 680.0 3710.0 ``` --- ## R 的初體驗 ```r # make a histogram; play around with these parameters hist(rivers, col="blue", border="white", breaks=25) ``` ![](index_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- ## In-class Exercise.1 - 換看看顏色 (用 `colors()` 看 R 認識什麼顏色) ```r hist(log(rivers), col="sienna", border="white", breaks=25) ``` ![](index_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- ## Base R and Package An R package contains `data sets` and specific `functions` to solve specific question. - 安裝發佈在 CRAN 上的套件: `install.packages("ggplot2")` - 安裝在 Github 上的開發套件: ```r #install.packages("devtools") #devtools::install_github("shukai/coolR") ``` --- ## 比較圖形套件 `plot()` and others ```r library("ggplot2") # qplot: ggplot2 中最基本的繪圖函數 qplot(data = iris, Sepal.Length, Petal.Length, color = Species, size = Petal.Width) ``` ![](index_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- ## 入門小秘訣 - `ctrl + l` 清除 console 的顯示內容。 - `rm(list=ls())` 清除 workspace 中的變數。 - 但請注意:R 也可以在終端機執行:對於日後在雲端伺服器工作者,特別是結合**指令列 (command line)** 很重要。 - 隨時知道妳在那裡:`getwd()` and `Set Working Directory` --- ## 變數 (variable)、賦值 (assignment) - R 在給予變數值時是利用`<-` 而不是其他程式語言中常見的 `=`。(根據 R 官方文件解釋因為在某些狀況是會出問題)。 - 變數命名中,大小寫有所區別。所以 a 與 A 是不同的變數。 ```r a <- 19 a ``` ``` ## [1] 19 ``` --- ## Modes and classes of R objects - 變數命名規則舉例:cannot start with numbers; it will start with a character or underscore; no special character allowed, such as @, #, $, and *. - 存入變數後,它就是個物件 (object)。兩種最重要的物件屬性 (attribute) 是 `class` 與 `mode` (*numeric, character*, *logical*, *function*). - An object is simply something stored in R's memory. - The `mode()` returns the storage mode of R objects. 表示物件在記憶體中是何種類型存儲的;類別概念以後再談。 ```r # use attributes() to display the attributes associated with the object attributes(mtcars) ``` ``` ## $names ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" ## [11] "carb" ## ## $row.names ## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" ## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ## [7] "Duster 360" "Merc 240D" "Merc 230" ## [10] "Merc 280" "Merc 280C" "Merc 450SE" ## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" ## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128" ## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona" ## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28" ## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" ## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" ## [31] "Maserati Bora" "Volvo 142E" ## ## $class ## [1] "data.frame" ``` ```r mode(rivers) ``` ``` ## [1] "numeric" ``` ```r # cf. typeof(rivers) # provide info about the underlying data structure of objects ``` --- ## 資料類型 (Data type) 與基本運算 (basic arithmetic) 資料類型包含以下幾種,可用 `mode` 函數判斷 - **數值型 (numeric)**:實數(可以寫成整數 integers,小數 floating numners,或 科學記述 scientific notations) ```r b <- 8.31 mode(b) ``` ``` ## [1] "numeric" ``` - **字符型 (character)**:文字字串,放入 "" 或 '' 中 ```r c <- 'coding' mode(c) ``` ``` ## [1] "character" ``` --- ## 資料類型 (Data type) 與基本運算 (basic arithmetic) - **邏輯型 (logical)**:`TRUE`(T) 和 `FALSE`(F) 兩個值 ```r d <- F mode(d) ``` ``` ## [1] "logical" ``` - **複數型 (complex)** :取值包含虛數 `\(a+bi\)` ```r e <- 2+3i mode(e) ``` ``` ## [1] "complex" ``` --- ## NA and NULL - NA (*missing*) - NULL (*undefined*) --- ## 資料類型強制轉換 (type coercion): - If an R object contains both numeric and logical elements, the mode of that object will be numeric and in that case the logical element automatically gets converted to numeric. - if any R object contains a character element along with both numeric and logical elements, it automatically converts to the character mode. ```r # R object containing both numeric and logical element x <- c(2, 4, TRUE, 6, FALSE, 8); mode(x) ``` ``` ## [1] "numeric" ``` ```r # R object containing character, numeric, and logical elements y <- c(1,2,TRUE,FALSE,"a"); mode(y) ``` ``` ## [1] "character" ``` ??? 趕作業 --- ## 資料類型的判斷與轉換 | 類型 | 意義 | 判斷 | 轉換 | |-----------|---------|--------------|--------------| | numeric | 數值 | is.numeric() | as.numeric() | | character | 字符 | is.character() | as.character() | | logical | 邏輯 | is.logical() | as.logical() | | complex | 複數 | is.complex() | as.complex() | | NA | 缺失 | is.na() | as.na() | ```r is.character(b) ``` ``` ## [1] FALSE ``` ```r as.character(b) ``` ``` ## [1] "8.31" ``` --- ## 資料結構 Data structure - 一組(2 個以上)相同或不同**資料類型**的資料元素組合在一起形成**資料結構**. - R 提供 6 個基本的資料結構:`vector`, `matrix`, `array`, `factor`, `list`, `data frame`. - 學習重點在於**如何建立 create 與檢索 access** --- ## 向量 Vector a combination of multiple values (`numeric, character` or `logical`) ### 建立 - `c()` ('concatenate'; to concatenate scalars) - `:` 可產生差距為 1 的等差數列向量。 - `seq()` 可產生等差數列向量,差距值可以自行決定。 - `rep()` 可產生重複數值的向量。 ```r g <- c(1,2,3) h <- c('me','you') i <- 1:6 j <- seq(from=1, to=10, by=2) k <- rep(1:4, times=3, each=2) ``` --- background-image: url(../img/emo/boredom-small.png) --- ## 向量 Vector ### 檢索 access - Get a subset of a vector: `my_vec[i]` to get the `ith` element. - Calculations with vectors: `max(), min(), length(), sum(), mean(),sd(),var()`, etc. ```r m <- c(2:10) m[1] ``` ``` ## [1] 2 ``` ```r m[1:3] ``` ``` ## [1] 2 3 4 ``` --- # 素養與專案模擬 ### 國際公民 - 國力衡量 - 思考人命的品質與不均 (e.g. 基尼指數 Gini index、人類發展指數 Human Development Index (HDI)) - 戰爭與自然災害 --- # 全球幸福指數報告 | World Happiness Report - 每年國際幸福日 (3/20) 公布。 - 台灣連續兩年排名 25 名,蟬聯亞洲第一。 -- - 幾個面向的抽樣問卷調查 (0-10 scale ladder survey) - 人均國內生產毛額 GDP - 預期健康壽命 Healthy life expectancy - 社會救助 Social support - 感覺公部門貪汙的程度 Perceptions of corruption - 人生選擇的自由度 Freedom to make life choices - 慷慨程度 Generosity --- # 專案閱讀 [沒傘的孩子,有著想跑得比雨還快的倔強](https://www.taisounds.com/Taiwan/Society/uid4616131661) --- # 問題: - Money cannot buy happiness? - .... > 練習用數據回答[這些問題](https://worldhappiness.report/faq/) --- # 參考 - [data and analysis on Kaggle](https://www.kaggle.com/mathurinache/world-happiness-report) - [The UN's World Happiness Report Data Analysis with R, 2020](https://rpubs.com/LeonaAnn/645318) --- ## 課堂練習 In-class Exercise.2 Preparing/Obtaining Data - 資料格式 - Comma separated values (`*.csv`) - Text file with Tab delimited (`*.txt` or `*.tbl`) - MS Excel file (`*.xls` or `*.xlsx`) - R data object (`*.RData`) - 資料來源 - Web (下載;網路爬蟲 Scraping and parsing data from the **web** (raw HTML sources); Interacting with APIs) - 資料庫 database --- ## 基本備檔 Data preparation rule - use the first row as **column names** (which represent *variables*). - Use the first column as **row names** (which represent *observations*). - Avoid names with blank spaces. Good example:`person_look`, or `person.look`. Bad example: `person look` - Avoid names with special symbols (excpet `_`) - Avoid beggining variable names with a number. Good: `run_100m`, Bad: `100m`. - R is case sensitive; and row/column name should be unique. - Avoid blank rows in your data; delete any comments in your file. - Replace missing values by **NA**. - Use the four digit format for data. Good: `01/06/1970`, Bad: `01/06/70`. --- ## 課堂練習 In-class Exercise.2 collaboration: the baby step - go to [shared doc](https://docs.google.com/spreadsheets/d/1uJ5bwY1YraZ2-anbaS6zVRMJSq3MdyBjIQ7-gEWrR3E/edit?usp=sharing) - type in your data - download it as `csv`, and read the file into R - quick look at the data and do some preliminary analysis (in groups) --- ## 檔案欄位說明 specification 連結放在本週 sli.do -【gender】: 0(trans)-1(female)-2(male) -【grade】: 1-2-3-4-5 -【q.self】: 0-100 -【GPA】: previous average of GPA -【q.teacher】: 0-100 --- ## 下載成 csv 後讀檔 - Save as `Rclass.csv`, Importing data into R. ```r # csv (逗號分隔 comma separated value file); # csv2 (分號分隔 semicolon separated values) # my_Rclass <- read.csv(file.choose()) # or # my_Rclass <- read.csv("https://bit.ly/3bAXigD")[1:88, 1:6] ``` - Explore the data and see what you can find - [For advanced R user] Use `correlation` and `see` packages [here](https://easystats.github.io/see/articles/correlation.html) to have prettier plotting.