Introduction to Programming and Data Science with R

background-image: url(http://fusionanalyticsworld.com/wp-content/uploads/2016/08/Data-Science-with-R1.jpg?a73fae)
background-position: center
background-size: cover

class: title-slide

.bg-text[
# Introduction to Programming and Data Science with R
### Text analytics.1

<hr />

5月 15, 2021  
謝舒凱
]

---
## Some administrivia

.large[

- Machine Learning and NLP (I)

- Machine Learning and NLP (II)

- Sentiment analysis
]

---
## 表情符號的威力

(credit: 台大言迷社 林欣誼同學)

---
background-image: url(../img/emo/boredom-small.png)
---
## 文本資料表徵

text representation: from surface to deep representations

- 字串 (character string)：R 中的字符向量 (character string)，這種形式的文本數據一般會先讀進記憶體。

- 語料 (庫) (corpus)： 通常包含了原始的字串，並帶有額外的元數據 (meta-information) 和標記 (annotation) 等。

- 語詞-文檔矩陣 (term-document matrix)：這是一個稀疏矩陣 (sparse matrix)。每個文檔一行，每個語詞一列。矩陣典型的數據為詞的個數或 `tf-idf`。

- 語詞向量 (word vectors)：由神經網路學習出來的向量表徵。

---
# 先從處理單一文本開始

---
## 字串處理常用函數 Character manipulation

- In the area of text mining, character or string manipulation is the most important.
- `nchar()`, `substr()`, `grep()`, `grepl()`, `gsub()`, `strsplit()`, `paste()`

---
## 文本資料的視覺化

- 我們想要利用視覺化技術探勘文本中的訊息、趨勢、模式變化。例如
  - 批踢踢語料中呈現的鄉民行為與社會網路
  - 不同作者的書寫風格
  - （選前選後的）政治觀點、主張、價值比較

- 基本的可能
  - 文字雲 (word cloud) 與比較
  - 關聯圖 (correlation plot) 與詞組樹 (phrase tree)
  - 調整字型 (custom fonts) 與風格

- [更多元的表達可以參考](https://textvis.lnu.se/)

---
## [Exercise] 愛人與太太的消長

```r
require(showtext)
showtext_auto()
ggram(c("情人", "太太"), year_start = 1500, year_end = 2000, 
      corpus = "chi_sim_2012", ignore_case = TRUE, 
      geom = "area", geom_options = list(position = "stack")) + 
      labs(y = NULL)
```

---
# `tidytext`
.large[Text mining using tidy data principles]

文本分析好書 [Tidy Text Mining with R](http://tidytextmining.com/)

- 簡易一點的分析方式：利用 `tidyverse` 套件生態圈。
.small[develop your text mining skills using the `tidytext` package, along with other tidyverse tools.]

- 回憶 `tidy data` 的精神：`One observation per row`。在文本資料中， 觀察值 (OBS) 是 **token** (許多情況下是一個單詞)，用 `unnest_tokens()` 函式來做 `tokenization`。（但注意它預設做的事，除了 tokenization 還有別的。）

- 一起來做 [線上練習](https://juliasilge.shinyapps.io/learntidytext/)

---
## 用 `tidyflow` 處理中文的文本資料探索分析

問題點：

- 前處理中的 **tokenization**
- 停用詞 (stopwords)

---
# 中文前處理 (Chinese Text Proprocessing)

---
# 文本表徵

- 【count-based】one hot encoding, Bag-of-Words
- 【prediction-based】Neural network based

---
# 專題思考

Religious Text Analytics