請將作業資料夾內的 titanic.csv
以 tibble
(建議) 或 data.frame
的形式讀入並儲存於變項 titanic
。需注意 titanic.csv
是以分號作為分隔符號的 csv
檔,因此使用 readr::read_csv()
(使用逗號作為分隔符號) 會無法正常讀取檔案。
hint: 可以參考 readr::read_delim()
或是在 RStudio import Dataset
的界面中選擇適當的 “Delimiter” (分隔符號)。
# write your code here
readr::read_delim("titanic.csv", delim = ";",
titanic <-escape_double = FALSE, trim_ws = TRUE)
# 請勿更動下方程式碼
head(titanic)
# should print out:
#> # A tibble: 6 x 12
#> PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
#> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 343 No 2 Coll… male 28 0 0 248740 13 <NA>
#> 2 76 No 3 Moen… male 25 0 0 348123 7.65 F G73
#> 3 641 No 3 Jens… male 20 0 0 350050 7.85 <NA>
#> 4 568 No 3 Pals… fema… 29 0 4 349909 21.1 <NA>
#> 5 672 No 1 Davi… male 31 1 0 F.C. … 52 B71
#> 6 105 No 3 Gust… male 37 2 0 31012… 7.92 <NA>
#> # … with 1 more variable: Embarked <chr>
#> # A tibble: 6 x 12
#> PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
#> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 343 No 2 Coll… male 28 0 0 248740 13 <NA>
#> 2 76 No 3 Moen… male 25 0 0 348123 7.65 F G73
#> 3 641 No 3 Jens… male 20 0 0 350050 7.85 <NA>
#> 4 568 No 3 Pals… fema… 29 0 4 349909 21.1 <NA>
#> 5 672 No 1 Davi… male 31 1 0 F.C. … 52 B71
#> 6 105 No 3 Gust… male 37 2 0 31012… 7.92 <NA>
#> # … with 1 more variable: Embarked <chr>
titanic
是著名的鐵達尼號沉船事件中搭上鐵達尼號的乘客名單,這筆資料還包含了這些乘客的其它資訊:
Pclass
:該乘客所購買的船票等級。船票等級共分 3 級,1
是最高的等級 (最貴),3
是最低的等級Survived
:該乘客是否在沉船事件後存活下來Sex
:該乘客的性別請使用 R 整理資料的功能,去:
Sex
) 以及船票等級 (Pclass
) 進行分組# Write your code here
library(dplyr)
%>%
titanic group_by(Sex, Pclass) %>%
summarise(percent_survived = mean(Survived == "Yes"))
# Should print out:
#> # A tibble: 6 x 3
#> # Groups: Sex [2]
#> Sex Pclass percent_survived
#> <chr> <dbl> <dbl>
#> 1 female 1 0.968
#> 2 female 2 0.921
#> 3 female 3 0.5
#> 4 male 1 0.369
#> 5 male 2 0.157
#> 6 male 3 0.135
#> # A tibble: 6 x 3
#> # Groups: Sex [2]
#> Sex Pclass percent_survived
#> <chr> <dbl> <dbl>
#> 1 female 1 0.968
#> 2 female 2 0.921
#> 3 female 3 0.5
#> 4 male 1 0.369
#> 5 male 2 0.157
#> 6 male 3 0.135
接下來要探討的是乘客生存與否與年齡層是否有關聯。想要探討這點的一個方法,是先將乘客分成不同的年齡層再去看看不同年齡層的生存率。你的任務是:
titanic
創立一個新的變項 age_group
,它必須有 3 個類別 young
, middle
與 old
:
young
:該乘客的年齡小於 18 歲middle
:該乘客的年齡介於 18 (含) 至 60 (不含) 歲之間old
:該乘客的年齡超過 60 (含) 歲Pclass
以及 age_group
分組在使用 dplyr
整理資料時,時常會需要撰寫自己的函數。而要讓撰寫的函數能融入 dplyr
的使用 (尤其是與 mutate()
併用),需要撰寫 vectorized function。下方(未完成)的程式碼即是在撰寫一個 vectorized function,age_group()
。這個函數的目的在於幫助你將不同的年齡分組。
例如,若輸入 age_group(10)
它會回傳 [1] "young"
;
若輸入 age_group(c(NA, 18, 60))
它回傳 [1] NA "middle" "old"
。
Hint:
sapply()
,請勿使用 for loopsapply(<vector>, <function>)
是一個比較抽象的函數。它的功能是用來將第一個 argument 裡 <vector>
的每個元素一一傳入 <function>
運算。最後會傳回一個與 <vector>
等長的 vector (詳閱說明文件)。例如,下方的程式碼即在將一個 numeric vector vec
改以英文字串去表示: c(2, 1, 3, 2)
vec <- function(x) {
atom_func <-if (x == 1) return("One")
if (x == 2) return("Two")
if (x == 3) return("Three")
}
vecsapply(vec, atom_func)
#> [1] 2 1 3 2
#> [1] "Two" "One" "Three" "Two"
你的任務是擴增下方的程式碼,使 age_group()
能正常運作:
function(ages) {
age_group <- sapply(ages, function(x) {
ages <-# Modify the code below
if (is.na(x)) return(NA) # keep
if (x < 18) return("young")
if (18 <= x & x < 60) return("middle")
return('old') # keep
})
return(ages) # keep
}
# Do not modify the code below
age_group(NA)
age_group(c(17, 18, 19, NA, 59, 60))
# should print out:
#> [1] NA
#> [1] "young" "middle" "middle" NA "middle" "old"
#> [1] NA
#> [1] "young" "middle" "middle" NA "middle" "old"
請使用前面完成的函數 age_group()
以及 dplyr
的 mutate()
, filter()
, group_by()
, summarise()
等函數去整理出一份摘要表。這份摘要表要有依據 Pclass
以及 age_group
所分成的 9 組中,各組的人數 (count
) 以及存活率 (percent_survived
)。
# Write your code here
%>%
titanic mutate(age_group = age_group(Age)) %>%
filter(!is.na(age_group)) %>%
group_by(Pclass, age_group) %>%
summarise(percent_survived = mean(Survived == "Yes"),
count = n())
# Should print out:
#> # A tibble: 9 x 4
#> # Groups: Pclass [3]
#> Pclass age_group percent_survived count
#> <dbl> <chr> <dbl> <int>
#> 1 1 middle 0.675 157
#> 2 1 old 0.294 17
#> 3 1 young 0.917 12
#> 4 2 middle 0.418 146
#> 5 2 old 0.25 4
#> 6 2 young 0.913 23
#> 7 3 middle 0.202 272
#> 8 3 old 0.2 5
#> 9 3 young 0.372 78
#> # A tibble: 9 x 4
#> # Groups: Pclass [3]
#> Pclass age_group percent_survived count
#> <dbl> <chr> <dbl> <int>
#> 1 1 middle 0.675 157
#> 2 1 old 0.294 17
#> 3 1 young 0.917 12
#> 4 2 middle 0.418 146
#> 5 2 old 0.25 4
#> 6 2 young 0.913 23
#> 7 3 middle 0.202 272
#> 8 3 old 0.2 5
#> 9 3 young 0.372 78
我又來了!請將資料夾中的week3Rclass.csv
讀進來並取名為Q_Q
:
以下題目請使用
dplyr
套件中的函數完成。
gender
中的資料,將2改成male,1改成female,0改成other。並將結果取名為O_O
。(5分)# Write your code here
readr::read_csv('week3Rclass.csv')
Q_Q <- Q_Q %>%
O_O <- mutate(gender = sapply(gender, function(x) {
if (x==2) return('male')
if (x==1) return('female')
return('other')
}))
# Do not modify the code below
c(21:25, 46:50),]
O_O[
# Should print out:
# A tibble: 10 x 6
# nickname gender grade q_self q_teacher GPA
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 na male 1 100 100 3.7
# 2 OhYah male 3 60 100 4.23
# 3 TAT female 4 75 70 3
# 4 QQQ other 1 100 100 3.3
# 5 Mictu male 4 70 100 4.3
# 6 trumpy other 4 87 99 4
# 7 bolee male 1 70 90 3.13
# 8 mm female 1 70 60 3.33
# 9 shawn female 1 80 100 3.8
# 10 KaiSquare male 3 90 90 3.5
#> # A tibble: 10 x 6
#> nickname gender grade q_self q_teacher GPA
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 na male 1 100 100 3.7
#> 2 OhYah male 3 60 100 4.23
#> 3 TAT female 4 75 70 3
#> 4 QQQ other 1 100 100 3.3
#> 5 Mictu male 4 70 100 4.3
#> 6 trumpy other 4 87 99 4
#> 7 bolee male 1 70 90 3.13
#> 8 mm female 1 70 60 3.33
#> 9 shawn female 1 80 100 3.8
#> 10 KaiSquare male 3 90 90 3.5
n
),以及這些人數在各年級中所佔比例 (average
),並將結果儲存在 A_A
。(5分)# Write your code here
O_O %>%
A_A <- count(grade, gender) %>%
group_by(grade) %>%
mutate(average = n/sum(n))
# Do not modify the code below
A_A
# Should print out:
# A tibble: 12 x 4
# Groups: grade [5]
# grade gender n average
# <dbl> <chr> <int> <dbl>
# 1 1 female 10 0.476
# 2 1 male 10 0.476
# 3 1 other 1 0.0476
# 4 2 female 8 0.4
# 5 2 male 12 0.6
# 6 3 female 1 0.143
# 7 3 male 6 0.857
# 8 4 female 13 0.481
# 9 4 male 13 0.481
# 10 4 other 1 0.0370
# 11 5 female 5 0.5
# 12 5 male 5 0.5
#> # A tibble: 12 x 4
#> # Groups: grade [5]
#> grade gender n average
#> <dbl> <chr> <int> <dbl>
#> 1 1 female 10 0.476
#> 2 1 male 10 0.476
#> 3 1 other 1 0.0476
#> 4 2 female 8 0.4
#> 5 2 male 12 0.6
#> 6 3 female 1 0.143
#> 7 3 male 6 0.857
#> 8 4 female 13 0.481
#> 9 4 male 13 0.481
#> 10 4 other 1 0.0370
#> 11 5 female 5 0.5
#> 12 5 male 5 0.5
N
),以及各年級的 GPA 平均 (GPA_mean
) 和標準差 (GPA_sd
),並以 GPA 平均降冪排列,將結果儲存在 T_T
。(5分)# Write your code here
O_O %>%
T_T <- group_by(grade) %>%
summarize(N = n(), GPA_mean = mean(GPA), GPA_sd = sd(GPA)) %>%
arrange(-GPA_mean)
# Do not modify the code below
T_T
# Should print out:
# A tibble: 5 x 4
# grade N GPA_mean GPA_sd
# <dbl> <int> <dbl> <dbl>
# 1 4 27 3.97 0.289
# 2 2 20 3.81 0.469
# 3 1 21 3.73 0.517
# 4 5 10 3.62 0.785
# 5 3 7 3.40 1.53
#> # A tibble: 5 x 4
#> grade N GPA_mean GPA_sd
#> <dbl> <int> <dbl> <dbl>
#> 1 4 27 3.97 0.289
#> 2 2 20 3.81 0.469
#> 3 1 21 3.73 0.517
#> 4 5 10 3.62 0.785
#> 5 3 7 3.40 1.53
compute_mean()
,用來計算第一題中 O_O
的各年級的 q_self
平均、q_teacher
平均、以及 GPA
平均等數值。compute_mean()
的 argument 為df
: 結構與 O_O
一樣的 tibble
(必要)grades
: 年級 (預設為全部年級)columns
: 要計算平均的變項名稱 (預設為 q_self
, q_teacher
以及 GPA
)# Modify the code below
function(df, grades=df$grade, columns=c('q_self', 'q_teacher', 'GPA')) {
compute_mean <- paste0(columns, '_mean')
selected_variables <-%>%
df group_by(grade) %>%
summarize(q_self_mean = mean(q_self), q_teacher_mean = mean(q_teacher), GPA_mean = mean(GPA)) %>%
filter(grade %in% grades) %>%
select(grade, selected_variables)
}
# Do not modify the code below
compute_mean(O_O)
cat('\n\n')
compute_mean(O_O, c(1, 3, 5), 'GPA')
# Should print out:
# A tibble: 5 x 4
# grade q_self_mean q_teacher_mean GPA_mean
# <dbl> <dbl> <dbl> <dbl>
# 1 1 77.5 91.1 3.73
# 2 2 72.8 87.2 3.81
# 3 3 78.6 95.1 3.40
# 4 4 76.3 89.9 3.97
# 5 5 67.8 80.8 3.62
#
#
# A tibble: 3 x 2
# grade GPA_mean
# <dbl> <dbl>
# 1 1 3.73
# 2 3 3.40
# 3 5 3.62
#> # A tibble: 5 x 4
#> grade q_self_mean q_teacher_mean GPA_mean
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 77.5 91.1 3.73
#> 2 2 72.8 87.2 3.81
#> 3 3 78.6 95.1 3.40
#> 4 4 76.3 89.9 3.97
#> 5 5 67.8 80.8 3.62
#>
#>
#> # A tibble: 3 x 2
#> grade GPA_mean
#> <dbl> <dbl>
#> 1 1 3.73
#> 2 3 3.40
#> 3 5 3.62