9 文本與詞彙的向量表徵

9.1 Representing Documents

由於資料科學以及統計學方法上的限制，要對文本進行量化分析之前，常常需要將原本以符碼 (文字) 去表徵的文本轉換成數值的表徵，如此我們才有辦法對文本去進行一些資料科學中常見的分析，例如相似度計算、分群、分類等。

9.1.1 Document-Term Matrix: A Toy Example

將文本轉換成數值的表徵方式相當多，其中一種最簡單的方式，即是使用 document-term matrix 將文本以數值向量去表徵。document-term matrix 裡面記錄著各個文本中的各種詞彙出現的次數。以這 3 篇文本為例：

doc1: I baked the cake and the muffin
doc2: I loved the cake
doc3: I wrote the book

我們可以使用下方的矩陣 dtm 去表示這 3 篇文本。在這個矩陣中，每個 row 即是一篇文本的向量表徵 (由上至下依序為 doc1, doc2, doc3)；每個 column 是某個特定的詞彙；矩陣中的數值則是該種詞彙出現在該篇文本的次數，例如 cell (2, 1) 代表 I 這個詞彙在 doc2 中出現了 1 次。

#' doc1:    "I baked the cake and the muffin"
#' doc2:    "I loved the cake"
#' doc3:    "I wrote the book"
#' TERMS:        I  baked loved wrote the and cake muffin book
dtm <- matrix(c( 1,   1,    0,    0,   2,  1,   1,    1,   0 ,
                 1,   0,    1,    0,   1,  0,   1,    0,   0 , 
                 1,   0,    0,    1,   1,  0,   0,    0,   1 ), 
              nrow = 3, ncol = 9, byrow = TRUE)
dtm

#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#> [1,]    1    1    0    0    2    1    1    1    0
#> [2,]    1    0    1    0    1    0    1    0    0
#> [3,]    1    0    0    1    1    0    0    0    1

有了文本的向量表徵之後，我們就能去量化比較文本之間的相似度，方法是直接利用向量之間的距離公式 \(d(\overrightarrow{p}, \overrightarrow{q})\) 以及相似度公式 \(cos(\theta)\)：

\[ d(\overrightarrow{p}, \overrightarrow{q}) = \sqrt{ (p_1 - q_1)^2 + (p_2 - q_2)^2 + ... + (p_n - q_n)^2 } \]

\[ cos(\theta) = \frac{\overrightarrow{p} \cdot \overrightarrow{q}}{\lVert p \rVert \lVert q \rVert } \]

為了避免每篇文本長度不同造成的文本向量長度不同，在使用距離公式時，我們通常會多一個將向量常規化的動作，讓兩個文本向量的長度變得一樣 (i.e., 皆變成單位向量)

#### Distance / Similarity Measures ####
cossim <- function(x1, x2)
  sum(x1 * x2) / sqrt( sum(x1^2) * sum(x2^2) )

eudist <- function(x1, x2) 
  sqrt( sum( (as_unit_vec(x1) - as_unit_vec(x2))^2 ) )

as_unit_vec <- function(x) x / sqrt(sum(x^2))  # Normalize vector length

eudist(dtm[1, ], dtm[2, ])
eudist(dtm[1, ], dtm[3, ])
eudist(dtm[2, ], dtm[3, ])

#> [1] 0.8164966
#> [1] 1
#> [1] 1

cossim(dtm[1, ], dtm[2, ])
cossim(dtm[1, ], dtm[3, ])
cossim(dtm[2, ], dtm[3, ])

#> [1] 0.6666667
#> [1] 0.5
#> [1] 0.5

9.1.2 Creating Document-Term Matrix with quanteda

在上方的例子，我們是自己透過手刻的方式去製作 document-term matrix。quanteda 則提供了將 tokens object 轉換成 document-term matrix 的函數：

library(dplyr)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textmodels)

# Document data frame
docs_df <- readRDS("samesex_marriage.rds")

# Token object
q_tokens <- corpus(docs_df, docid_field = "id", text_field = "content") %>%
  tokenizers::tokenize_regex(pattern = "\u3000") %>%
  tokens()

# Document-term matrix (feature selection)
q_dfm <- dfm(q_tokens) %>% 
  dfm_remove(pattern = readLines("stopwords.txt"), valuetype = "fixed") %>%
  dfm_select(pattern = "[\u4E00-\u9FFF]", valuetype = "regex") %>%
  dfm_trim(min_termfreq = 5) %>%
  dfm_tfidf()
q_dfm

#> Document-feature matrix of: 300 documents, 4,781 features (95.54% sparse) and 0 docvars.
#>               features
#> docs               去年   全國性      是否      同意     民法      婚姻
#>   anti_1.txt   1.457866 1.273001 0.4240428 3.2552323 4.300816 1.6543448
#>   anti_10.txt  0        0        0         0         0        0        
#>   anti_100.txt 0        0        2.1202141 0.6510465 0        2.7572413
#>   anti_101.txt 0        0        0.4240428 0         0        0        
#>   anti_102.txt 0        0        0.4240428 1.9531394 1.075204 0.5514483
#>   anti_103.txt 0        0        0         0         0        0        
#>               features
#> docs                規定        應     限定  一男一女
#>   anti_1.txt   3.9804537 1.4183996 4.704365 3.7921393
#>   anti_10.txt  0         0         0        0        
#>   anti_100.txt 0         0.9455998 0        0.6320232
#>   anti_101.txt 0         0         0        0        
#>   anti_102.txt 0.5686362 0.4727999 0        0        
#>   anti_103.txt 0         1.8911995 0        0        
#> [ reached max_ndoc ... 294 more documents, reached max_nfeat ... 4,771 more features ]

9.1.3 Pairwise Document Similarity by raw Document-Term Matrix

quanteda.textstats::textstat_simil() 能夠計算 document-term matrix 內文本間的兩兩相似度。如此我們便可透過回傳的矩陣去取得與某篇文章 (e.g., anti_1.txt) 最相似的幾篇文章：

doc_sim <- textstat_simil(q_dfm, method = "cosine") %>% as.matrix()
dim(doc_sim)

#> [1] 300 300

sort(doc_sim["anti_1.txt", ], decreasing = T)[1:8]

#>   anti_1.txt  anti_95.txt anti_122.txt   pro_18.txt  anti_54.txt  anti_55.txt 
#>    1.0000000    0.3846162    0.2652598    0.2584759    0.2577859    0.2548753 
#> anti_132.txt anti_106.txt 
#>    0.2422925    0.2371465

9.1.3.1 Clustering Using Pairwise Similarity

quanteda.textstats::textstat_simil() 所回傳的相似度矩陣也可作為分群演算法的輸入：

doc_name <- function(name, end = 8) 
  paste0(name, '\n', substr(lookup[name], 1, end))

idx <- c(1:10, 151:160)
lookup <- docs_df$title
names(lookup) <- docs_df$id

doc_sim2 <- doc_sim[idx, idx]
row.names(doc_sim2) <- doc_name(row.names(doc_sim2))
colnames(doc_sim2) <- doc_name(colnames(doc_sim2))

## create hclust
clust <- (1 - doc_sim2) %>% as.dist %>% hclust

## plot dendrogram
plot(clust, main = "20 Selected Posts", cex = 0.7,
     xlab="", sub="")

9.1.3.2 Network Plot Using Pairwise Similarity

library(quanteda.textplots)
set.seed(10)
textplot_network(as.dfm(doc_sim2),
                 vertex_labelsize = 2)

9.1.4 Latent Semantic Anlysis (Dimensionality Reduction)

由於 document-term matrix 通常很稀疏 (i.e., 很多值是 0)，使文本向量可能無法抓到某些文本之間的語意關聯。例如，在下圖的例子中，doc2 與 doc4 雖然語意相近，但此二文本的向量的相似度 (cosine similarity) 為零，因為這兩篇文本並未使用到相同的詞彙。

面對這種情形，我們可以將高維的 document-term matrix 透過數學方式轉換成維度比較小的矩陣。在這個過程中，document-term matrix 中一些語意相近的詞彙會被壓縮到某個或是某些維度中，讓這個維度比較小的矩陣反而比較能表徵文本之間的語意關聯。這種方式稱為 Latent Semantic Analysis (LSA)，而用來將矩陣分解降維的數學方法稱為 Singular Value Decomposition (SVD)。

lsa_model <- quanteda.textmodels::textmodel_lsa(q_dfm, nd = 15)
dim(lsa_model$docs)

#> [1] 300  15

# Document similarity
doc_sim2 <- textstat_simil(as.dfm(lsa_model$docs), method = "cosine")
sort(doc_sim2["anti_1.txt", ], decreasing = T)[1:8]

#>   anti_1.txt    pro_2.txt anti_102.txt anti_146.txt   pro_94.txt   pro_84.txt 
#>    1.0000000    0.9963039    0.9959590    0.9936474    0.9933724    0.9933206 
#>  anti_94.txt  pro_133.txt 
#>    0.9923490    0.9921601

9.1.5 Converting unseen documents to vectors

現在我們已經知道如何將一個語料庫內的所有文本轉換成向量表徵。現在要處理的是新的資料：若今天有一篇新的文本，我們要如何將此篇文本轉換成向量，好讓我們可以去將這篇文本與語料庫中的其它文本進行比較？

要達成這件事，在將新文本轉換成 document-term matrix 時，需要讓新文本的 document-term matrix 在維度上 (詞彙種類以及其在矩陣中順序) 能夠與語料庫的 document-term matrix 對應起來。這可以透過 quanteda::dfm_match() 達成。確保了 document-term matrix 的維度相同之後，接著就可以將這個 document-term matrix 餵到 LSA 模型，讓它為這個 document-term matrix 進行降維，進而得到新文本的向量表徵：

#### Converting new texts to vector representation ####

# New document
doc <- readLines("sample_post.txt") %>% 
  paste(collapse = "\n")

# Convert raw text to document term matrix
library(jiebaR)
seg <- worker(user = "user_dict.txt")
new_doc_dtm <- list(segment(doc, seg)) %>%
  tokens() %>%
  dfm() %>%
  dfm_match(features = featnames(q_dfm))

# Dimensionality reduction with LSA
p <- predict(lsa_model, newdata = new_doc_dtm)
doc_vec_lsa <- p$docs_newspace[1, ]
doc_vec_lsa

#>  [1]  0.010827972 -0.005043434  0.019064686  0.021167702 -0.012646628
#>  [6]  0.017416307  0.030359762  0.024489521 -0.010629365 -0.005384180
#> [11] -0.033654985  0.005100957 -0.015062376 -0.007130964  0.001976622

如此，我們便可使用這個新文本的向量去和語料庫中的其它文本進行比較：

cossim(doc_vec_lsa, lsa_model$docs["pro_18.txt", ])
cossim(doc_vec_lsa, lsa_model$docs["anti_18.txt", ])

#> [1] 0.7625148
#> [1] -0.1026607

9.2 Using Word Vectors

詞彙與文本類似，一樣可以透過數值性的向量去表徵。在近幾年表徵詞彙的詞向量技術進展相當快速，且能抓到一些相當細微的語意。但這些詞向量通常需要透過相對大量的資料訓練，其表現才會相對穩定。通常我們也不必自行蒐集語料訓練這些詞向量，因為網路上已有相當多公開的詞向量資源，可直接下載使用。

下方的例子即是使用預先訓練好的詞向量 (儲存於 ppmi_embeddings_50dim.txt)。這份詞向量²³是透過中研院平衡語料庫訓練而來的，可以透過 functions.R 中的 read_ppmi() 以 matrix 的格式讀進 R 裡：

source("functions.R")
wv <- read_ppmi("ppmi_embeddings_50dim.txt")

讀進來後，就可以透過 wv["{詞彙}", ] 去取得詞向量 (length == 50 的 numeric vector)：

wv["爸爸", ]

#>  [1] -0.046565545  0.460913664 -0.053094442 -0.057553476 -0.024442142
#>  [6]  0.001117499 -0.179143899 -0.213770824 -0.087744568  0.044913735
#> [11] -0.011217009  0.081547154  0.127092145  0.067408753  0.036906909
#> [16]  0.085890520  0.102822128 -0.199562705  0.040325577  0.150103819
#> [21]  0.232702185  0.007350324  0.131141177  0.146634070 -0.049885084
#> [26] -0.192042121  0.027501879 -0.238921004  0.113918191  0.001545429
#> [31] -0.027848669 -0.098988912 -0.079275493  0.074673585  0.113090980
#> [36]  0.222950818  0.096059215 -0.089295350 -0.157370032 -0.089475721
#> [41]  0.025158373  0.236198917 -0.023956936  0.169465113  0.228609064
#> [46]  0.038412816  0.037344191  0.184070583  0.224648376 -0.131128332

並可以套用 cosine similarity 的公式去計算兩個詞彙間的相似度：

cossim(wv["媽媽", ], wv["爸爸", ])
cossim(wv["老師", ], wv["爸爸", ])

#> [1] 0.9721566
#> [1] 0.3958473

9.2.1 Finding Most Similar Words

由於詞向量的檔案通常非常大 (因為詞彙的種類通常遠比文本的數量多很多)，我們無法透過 quanteda.textstats::textstat_simil() 去計算詞彙間的兩兩相似度 (運算時間太長)。因此，若想找出與某個特定詞彙 (e.g., 媽媽) 語意最相似的詞彙，我們可以透過 base R 的 apply() 去將 媽媽 的詞向量去跟所有的詞向量算出相似度之後再進行排序：

most_simil <- function(word_vec, topn = 10) {
  siml <- apply(wv, 1, function(row) cossim(row, word_vec))
  return(sort(siml, decreasing = T)[1:topn])
}

most_simil(wv["媽媽", ])

#>      媽媽      爸爸      妹妹      哥哥      小孩      高興      弟弟      孩子 
#> 1.0000000 0.9721566 0.8704723 0.8639341 0.8603998 0.8585911 0.8536466 0.8528005 
#>      回家      回來 
#> 0.8485474 0.8395401

9.2.2 Word Analogy

近幾年的詞向量技術令人驚奇的其中一個地方來自它們抓到詞彙語意上類比關係的能力。透過詞向量的運算，我們可以要電腦為我們找出類似這種問題的答案：

父親之於兒子相當於母親之於？
弟弟之於哥哥相當於妹妹之於？

父親 : 兒子 == 母親 :  ？
弟弟 : 哥哥 == 妹妹 :  ？
v1  -  v2   =  v3  -  v4
v4  =  v3   -  v1  +  v2

v1 <- wv["父親", ]
v2 <- wv["兒子", ]
v3 <- wv["母親", ]
v4 <- v3 - v1 + v2
most_simil(v4)

#>      兒子      母親      女兒      小孩      父母      父親      孩子      丈夫 
#> 0.9487272 0.9425165 0.9305924 0.8863384 0.8806144 0.8620591 0.8385362 0.8338454 
#>      長大      媽媽 
#> 0.8174763 0.8096411

v1 <- wv["弟弟", ]
v2 <- wv["哥哥", ]
v3 <- wv["妹妹", ]
v4 <- v3 - v1 + v2
most_simil(v4)

#>      哥哥      妹妹      爸爸      弟弟      回去      姊姊      媽媽      回家 
#> 0.9663716 0.9476825 0.8894544 0.8650535 0.8603728 0.8535357 0.8396181 0.8242290 
#>      高興      回來 
#> 0.8238468 0.8150439

8 中文文本資料處理

10 擷取網路資料：Web API