1 文本資料前處理 (共 100 分)

你的任務是撰寫一份程式碼，將 samesex_marriage/txt/ 內的所有的 .txt 檔斷詞處理後，整理成一個 document data frame，並將結果儲存於變數 docs_df。

評分標準與提示

斷詞請使用自訂辭典 (辭典位於 samesex_marriage/user_dict.txt) (10 分)
docs_df 需有 4 個變項：

id: 文章檔名 (15 分)
topic: 文章來源 (詳見課程影片) (25 分)
每篇文章的 topic 資訊藏在文字檔的檔名中，例如 samesex_marriage/txt/anti_7.txt 的 topic 為 anti；samesex_marriage/txt/pro_108.txt 的 topic 為 pro。你可以透過 basename() 與 strsplit() 取得檔名中的這項資訊。
title: 文章標題 (25 分)
每篇文章的標題儲存於文字檔的第一行 (第二行為空行)，例如 samesex_marriage/txt/anti_7.txt 的第一行為 同性婚姻與多人婚姻。
content: 文章內文 (25 分)
需斷好詞且以 \u3000 作為詞彙分隔字元。內文不包含文章標題 (i.e. 忽略文字檔第一行)

請在程式碼中加入適量註解 (沒有註解以致程式碼閱讀困難者，酌量扣分)，例如：
```
# Read post from file
post <- readLines(fps[i], encoding = "UTF-8")

# Extract title
title <- ...
```

# Your code goes here
# please save the results to `docs_df`

###### 請勿更動下方程式碼 ######
# 輸出結果請參考 `samesex_marriage.rds`
knitr::kable(docs_df[234,], align = "c")

#> Error in knitr::kable(docs_df[234, ], align = "c"): object 'docs_df' not found

2 Type-token ratio (20 分)

請運用上方整理出來的 data frame (或直接讀取 samesex_marriage.rds)，將文章依據 topic 分成兩組，計算出這兩個 topic 的詞彙豐富度 (以 type-token ratio 衡量，定義見下)。

# Your code goes here

###### 請勿更動下方程式碼 ######
# Should print out:
#> # A tibble: 2 x 2
#>   topic   TTR
#>   <chr> <dbl>
#> 1 anti  0.115
#> 2 pro   0.173

TTR 定義

type-token ratio (TTR) 是一種衡量文本詞彙豐富度的指標，其定義為：

A type-token ratio (TTR) is the total number of UNIQUE words (types) divided by the total number of words (tokens) in a given segment of language.

例如，每天天剛亮時我母親便把我喊醒 這句話含有 9 個 tokens。其中，我 這個 token 出現了兩次，其它的 token 皆只出現一次，因此總共有 8 種 types。這句話的 type-token ratio 因此為 0.8889 (8/9)。

HW 9: 中文文本資料處理

未命名 B01201001 一般系

2021-05-06

Rmd Source (for TAs)

1 文本資料前處理 (共 100 分)

評分標準與提示

2 Type-token ratio (20 分)

TTR 定義

HW 9: 中文文本資料處理

未命名 B01201001 一般系

2021-05-06 Rmd Source (for TAs)

1 文本資料前處理 (共 100 分)

評分標準與提示

2 Type-token ratio (20 分)

TTR 定義

2021-05-06

Rmd Source (for TAs)