Introduction to Data Science with R

class: title-slide

.bg-text[
# Introduction to Data Science with R
### week.3

<hr />

3月 12, 2021  
謝舒凱
]

---
# 課程資訊

- [課程網頁課綱已更新](https://lopentu.github.io/rlads2021/)

---
# 自學翻轉與互學精神

- `DataCamp` 閱讀與練習作業: 20% 同學平時成績加分

- 自己也可以利用 `swirl` 學習，也有
  [中文翻譯](https://datascienceandr.org/)

- [語言分析與資料科學臉書社團](https://www.facebook.com/groups/652099794893097/)
---
## 工具提醒

不一定要用 RStudio (e.g. 指令列 `Rscript test.R`)， 但它還可以做很多事 (to be continued...)

- 做 ([可重製 reproducible](http://rmarkdown.rstudio.com/lesson-1.html)、[可調參數 Parameterized](http://rmarkdown.rstudio.com/developer_parameterized_reports.html)、[互動型 interactive](http://rmarkdown.rstudio.com/lesson-14.html)) 筆記(notebook) 與報告 (report)

- 做投影片 (presentation)

- 做網站 (website) 與 web application (using `shiny`)

- 做「數位報表」(dashboard)

- 做專業科學文件 (using `$\LaTeX$`)

---
## 各取強項

---
## Learning DS with R

`對應式`學習意識 | R Programming Skills for Data Science: Writing Code to Wrangle, Analyze, and Visualize Data

**R syntax** `$< >$`

- 【獲取】Obtaining data
- 【整理】Scrubbing data
- 【探索】Exploring data
- 【建模】Modeling data
- 【詮解】Interpreting (and reporting) data

---
## R 的初體驗

```r
data()	      # browse pre-loaded data sets
data(rivers)	# get this one: "Lengths of Major North American Rivers"
?rivers
head(rivers,10)	# peek at the data set
```

```
##  [1]  735  320  325  392  524  450 1459  135  465  600
```

```r
length(rivers)	# how many rivers were measured?
```

```
## [1] 141
```

```r
summary(rivers) # what are some summary statistics?
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   135.0   310.0   425.0   591.2   680.0  3710.0
```

---
## R 的初體驗

```r
# make a histogram; play around with these parameters
hist(rivers, col="blue", border="white", breaks=25) 
```

![](index_files/figure-html/unnamed-chunk-2-1.png)

---
## In-class Exercise.1  
- 換看看顏色 (用 `colors()`  看 R 認識什麼顏色)

```r
hist(log(rivers), col="sienna", border="white", breaks=25) 
```

![](index_files/figure-html/unnamed-chunk-3-1.png)

---
## Base R and Package
An R package contains `data sets` and specific `functions` to solve specific question.

- 安裝發佈在 CRAN 上的套件:  `install.packages("ggplot2")`
- 安裝在 Github 上的開發套件：

```r
#install.packages("devtools")
#devtools::install_github("shukai/coolR")
```

---
## 比較圖形套件
`plot()` and others

```r
library("ggplot2")
# qplot: ggplot2 中最基本的繪圖函數
qplot(data = iris, Sepal.Length, Petal.Length, color = Species, size = Petal.Width)
```

![](index_files/figure-html/unnamed-chunk-5-1.png)

---
## 入門小秘訣

- `ctrl + l` 清除 console 的顯示內容。
- `rm(list=ls())` 清除 workspace 中的變數。
- 但請注意：R 也可以在終端機執行：對於日後在雲端伺服器工作者，特別是結合**指令列 (command line)** 很重要。
- 隨時知道妳在那裡：`getwd()` and `Set Working Directory`

---
## 變數 (variable)、賦值 (assignment)

- R 在給予變數值時是利用`<-` 而不是其他程式語言中常見的 `=`。（根據 R 官方文件解釋因為在某些狀況是會出問題）。
- 變數命名中，大小寫有所區別。所以 a 與 A 是不同的變數。

```r
a <- 19
a
```

```
## [1] 19
```

---
## Modes and classes of R objects

- 變數命名規則舉例：cannot start with numbers; it will start with a character or underscore; no special character allowed, such as @, #, $, and *.

- 存入變數後，它就是個物件 (object)。兩種最重要的物件屬性 (attribute) 是 `class` 與 `mode` (*numeric, character*, *logical*, *function*).
  - An object is simply something stored in R's memory.
  - The `mode()` returns the storage mode of R objects. 表示物件在記憶體中是何種類型存儲的；類別概念以後再談。

```r
# use attributes() to display the attributes associated with the object
attributes(mtcars)
```

```
## $names
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
## 
## $row.names
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"         
## 
## $class
## [1] "data.frame"
```

```r
mode(rivers)
```

```
## [1] "numeric"
```

```r
# cf. typeof(rivers) # provide info about the underlying data structure of objects
```

---
## 資料類型 (Data type) 與基本運算 (basic arithmetic)

資料類型包含以下幾種，可用 `mode` 函數判斷

- **數值型 (numeric)**：實數（可以寫成整數 integers，小數 floating numners，或 科學記述 scientific notations）

```r
b <- 8.31
mode(b)
```

```
## [1] "numeric"
```

- **字符型 (character)**：文字字串，放入 "" 或 '' 中

```r
c <- 'coding'
mode(c)
```

```
## [1] "character"
```

---
## 資料類型 (Data type) 與基本運算 (basic arithmetic)

- **邏輯型 (logical)**：`TRUE`(T) 和 `FALSE`(F) 兩個值

```r
d <- F
mode(d)
```

```
## [1] "logical"
```

- **複數型 (complex)** :取值包含虛數 `$a+bi$`

```r
e <- 2+3i
mode(e)
```

```
## [1] "complex"
```

---
## NA and NULL

- NA (*missing*)
- NULL (*undefined*)

---
## 資料類型強制轉換 (type coercion)：
  
  - If an R object contains both numeric and logical elements, the mode of that object will be numeric and in that case the logical element automatically gets converted to numeric. 
  - if any R object contains a character element along with both numeric and logical elements, it automatically converts to the character mode.

```r
# R object containing both numeric and logical element
x <- c(2, 4, TRUE, 6, FALSE, 8); mode(x)
```

```
## [1] "numeric"
```

```r
# R object containing character, numeric, and logical elements
y <- c(1,2,TRUE,FALSE,"a"); mode(y)
```

```
## [1] "character"
```

???
趕作業

---
## 資料類型的判斷與轉換

| 類型      | 意義    | 判斷         | 轉換         |
|-----------|---------|--------------|--------------|
| numeric   | 數值    | is.numeric()   | as.numeric()   |
| character | 字符    | is.character() | as.character() |
| logical   | 邏輯    | is.logical()   | as.logical()   |
| complex   | 複數    | is.complex()   | as.complex()   |
| NA        | 缺失    | is.na()        | as.na()        |

```r
is.character(b)
```

```
## [1] FALSE
```

```r
as.character(b)
```

```
## [1] "8.31"
```

---
## 資料結構 Data structure

- 一組（2 個以上）相同或不同**資料類型**的資料元素組合在一起形成**資料結構**.

- R 提供 6 個基本的資料結構：`vector`, `matrix`, `array`, `factor`, `list`, `data frame`.

- 學習重點在於**如何建立 create 與檢索 access**

---
## 向量 Vector
a combination of multiple values (`numeric, character` or `logical`)

### 建立
- `c()` ('concatenate'; to concatenate scalars)
- `:`  可產生差距為 1 的等差數列向量。
- `seq()` 可產生等差數列向量，差距值可以自行決定。
- `rep()` 可產生重複數值的向量。

```r
g <- c(1,2,3)
h <- c('me','you')
i <- 1:6
j <- seq(from=1, to=10, by=2)
k <- rep(1:4, times=3, each=2)
```

---
background-image: url(../img/emo/boredom-small.png)
---
## 向量 Vector
### 檢索 access

- Get a subset of a vector: `my_vec[i]` to get the `ith` element.
- Calculations with vectors: `max(), min(), length(), sum(), mean(),sd(),var()`, etc.

```r
m <- c(2:10)
m[1]
```

```
## [1] 2
```

```r
m[1:3]
```

```
## [1] 2 3 4
```

---
# 素養與專案模擬

### 國際公民

- 國力衡量

- 思考人命的品質與不均 （e.g. 基尼指數 Gini index、人類發展指數 Human Development Index (HDI)）

- 戰爭與自然災害

---
# 全球幸福指數報告 | World Happiness Report

- 每年國際幸福日 (3/20) 公布。

- 台灣連續兩年排名 25 名，蟬聯亞洲第一。

- 幾個面向的抽樣問卷調查 (0-10 scale ladder survey)
  - 人均國內生產毛額 GDP
  - 預期健康壽命 Healthy life expectancy 
  - 社會救助 Social support 
  - 感覺公部門貪汙的程度 Perceptions of corruption
  - 人生選擇的自由度 Freedom to make life choices 
  - 慷慨程度 Generosity

---
# 專案閱讀

[沒傘的孩子，有著想跑得比雨還快的倔強](https://www.taisounds.com/Taiwan/Society/uid4616131661)

---
# 問題：

- Money cannot buy happiness?
- ....

> 練習用數據回答[這些問題](https://worldhappiness.report/faq/)

---
# 參考

- [data and analysis on Kaggle](https://www.kaggle.com/mathurinache/world-happiness-report)

- [The UN's World Happiness Report Data Analysis with R, 2020](https://rpubs.com/LeonaAnn/645318)

---
## 課堂練習 In-class Exercise.2
Preparing/Obtaining Data

- 資料格式 
 - Comma separated values (`*.csv`)
 - Text file with Tab delimited (`*.txt` or `*.tbl`)
 - MS Excel file (`*.xls` or `*.xlsx`)
 - R data object (`*.RData`)

- 資料來源 
  - Web (下載；網路爬蟲 Scraping and parsing data from the **web** (raw HTML sources)； Interacting with APIs)
  - 資料庫 database

---
## 基本備檔
Data preparation rule

- use the first row as **column names** (which represent *variables*).

- Use the first column as **row names** (which represent *observations*).

- Avoid names with blank spaces. Good example：`person_look`, or `person.look`. Bad example: `person look`

- Avoid names with special symbols (excpet `_`)

- Avoid beggining variable names with a number. Good: `run_100m`, Bad: `100m`.

- R is case sensitive; and row/column name should be unique.

- Avoid blank rows in your data; delete any comments in your file.

- Replace missing values by **NA**.

- Use the four digit format for data. Good: `01/06/1970`, Bad: `01/06/70`.

---
## 課堂練習 In-class Exercise.2
collaboration: the baby step

- go to [shared doc](https://docs.google.com/spreadsheets/d/1uJ5bwY1YraZ2-anbaS6zVRMJSq3MdyBjIQ7-gEWrR3E/edit?usp=sharing)

- type in your data
- download it as `csv`, and read the file into R

- quick look at the data and do some preliminary analysis (in groups)

---
## 檔案欄位說明
specification

連結放在本週 sli.do

-【gender】: 0(trans)-1(female)-2(male)

-【grade】: 1-2-3-4-5

-【q.self】: 0-100

-【GPA】: previous average of GPA

-【q.teacher】: 0-100

---
## 下載成 csv 後讀檔

- Save as `Rclass.csv`, Importing data into R.

```r
# csv (逗號分隔 comma separated value file); 
# csv2 (分號分隔 semicolon separated values)

# my_Rclass <- read.csv(file.choose())
# or
# my_Rclass <- read.csv("https://bit.ly/3bAXigD")[1:88, 1:6]
```

- Explore the data and see what you can find

- [For advanced R user] Use `correlation` and `see` packages [here](https://easystats.github.io/see/articles/correlation.html) to have prettier plotting.