Topic Modeling

The video that accompanies this notebook is available at https://ucdavis.box.com/v/sts-205-notebook-7.

Topic modeling is a way to imagine the shared vocabularies (topics) underlying the documents in the corpus. Running a topic model is quite easy. The hard parts are preparing your text, choosing model parameters and hyperparameters, and interpreting the model.

Topic modeling should be reserved for large corpora (more documents than unique words). Long documents should be chunked into smaller documents, as topic modeling treats a document as a bag of words. If two words are in the same document, the model doesn’t care if they are right next to each other or very far apart. Therefore, working with short documents or short segments of longer documents will ensure that the words the model associates with one another actually occur near one another in the original documents.

Start by loading packages, sourcing functions, and building the sotu data frame.

library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3     ✓ purrr   0.3.4
✓ tibble  3.0.6     ✓ dplyr   1.0.4
✓ tidyr   1.1.2     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
library(tidytext)
#install.packages("textstem")
library(textstem)
Loading required package: koRpus.lang.en
Loading required package: koRpus
Loading required package: sylly
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     
For information on available language packages for 'koRpus', run

  available.koRpus.lang()

and see ?install.koRpus.lang()


Attaching package: ‘koRpus’

The following object is masked from ‘package:readr’:

    tokenize
#install.packages("topicmodels")
library(topicmodels)
source("functions.r")
sotu <- make_sotu()

── Column specification ────────────────────────────────────────────────────────
cols(
  year = col_double(),
  pres = col_character(),
  use_last = col_logical()
)
To prepare the text, we will chunk it up into paragraphs, using the "

" tags. What we want is a data frame where each row is one paragraph of one SOTU address. We will create a unique identifier for each paragraph.

sotu_paragraphs <- data.frame()
for(i in 1:nrow(sotu)) {
  text <- str_split(sotu$text[i], " <p> ") %>% unlist
  paragraphs <- tibble(text) %>% 
                mutate(id = str_c(str_replace_all(sotu$pres[i], " ", "_"), 
                            "_", sotu$year[i], "_", 1:length(text)))
  sotu_paragraphs <- rbind(sotu_paragraphs, paragraphs)
}
head(sotu_paragraphs)

Next we will unnest tokens and remove stopwords and tokens containing digits.

sotu_words <- sotu_paragraphs %>% unnest_tokens(word, text) %>%
                anti_join(stop_words) %>% filter(!str_detect(word, "[:digit:]"))
Joining, by = "word"
head(sotu_words)

Next we will lemmatize the text. Lemmatization converts all forms of a word to the most basic form. For example, “children” becomes “child” and “men” becomes “man.” Lemmatization is not necessary for topic modeling, and you can also use it for other types of text analysis, including types we have discussed in class. I am introducing it here mainly so that you are aware of it and know how to do it.

The first step is to create a data frame with a column of all unique words in the corpus.

sotu_lemmas <- tibble(word = unique(sotu_words$word))
head(sotu_lemmas)

Now we will use the lemmatize_words function in the textstem package to create a new column of lemmas. The lemmatize_words function takes as its argument the vector of words you want to lemmatize and returns the lemma for each one.

sotu_lemmas$lemma <- lemmatize_words(sotu_lemmas$word)
head(sotu_lemmas)

Now we will join our list of lemmas to the sotu_words data frame

sotu_words <- left_join(sotu_words, sotu_lemmas) 
Joining, by = "word"
head(sotu_words)

We will use the topicmodels package for topic modeling. It requires that our corpus be in a document-term matrix, so we use the cast_dtm() function from the tidytext package to convert our token data frame into a document-term matrix.

sotu_dtm <- sotu_words %>% group_by(id) %>% count(lemma) %>% cast_dtm(id, lemma, n)

Now we can use the LDA() function in the topicmodels package to build a topic model. For our first attempt, we will ask for 15 topics. This process takes a lot of computational time. To save you that time, I have already run the model and saved it as sotu_lda_15.RDS. Read it into working memory with the code below.

 sotu_lda <- LDA(sotu_dtm, k = 15)
 saveRDS(sotu_lda, "sotu_lda_15.RDS")
sotu_lda_15 <- readRDS("sotu_lda_15.RDS")
sotu_lda_15
A LDA_VEM topic model with 15 topics.

The following code creates two data frames of results: sotu_topics indicates the probability distribution of words over topics; sotu_docs indicates the proportion of each document (paragraph) attributed to each topic.

sotu_topics <- tidy(sotu_lda_15, matrix = "beta")
head(sotu_topics)
sotu_topics %>% group_by(topic) %>% summarize(total = sum(beta))
sotu_docs <- tidy(sotu_lda_15, matrix = "gamma") 
head(sotu_docs)
sotu_docs %>% group_by(document) %>% summarize(total = sum(gamma))

One way to visualize topics is with bar charts of the five most frequent words in each topic.

topics_5 <- function(lda) {
  tidy(lda, matrix = "beta") %>% 
        group_by(topic) %>% top_n(5, beta) %>%
        ggplot(aes(x = reorder(term, beta), y = beta, fill = term)) + 
          geom_col() + coord_flip() + guides(fill = FALSE) +
          facet_wrap(vars(topic), ncol = 5, scales = "free") +
          theme_minimal(base_size = 20) +
          scale_y_continuous(labels = c()) +
          labs(title = "Topic Model - Top Five Words By Topic", 
               x = "Top Five Words", y = "Beta")
}
topics_5(sotu_lda_15)

One way to visualize the distribution of topics over documents is to calculate the proportion of each document attributed to each topic. Here I am aggregating over years.

topics_years <- function(lda) {
  tidy(lda, matrix = "gamma") %>%
    mutate(year = str_replace(document, "_[0-9]+$", "")) %>% 
    mutate(year = str_replace(year, "[A-Za-z_\\.]+_", "")) %>%
    mutate(year = as.numeric(year)) %>% 
    group_by(year, topic) %>% summarize(paras = sum(gamma)) %>%
    ungroup %>% group_by(year) %>% mutate(p = paras/sum(paras)) %>%
    ggplot(aes(x = year, y = p, fill = factor(topic))) + geom_area() +
      scale_y_continuous(labels = scales::percent) +
      labs(title = "Topic Distribution by Year - SOTU Topic Model",
           x = "Year", y = "Percent of Address (by paragraph)", fill = "Topic")
}
topics_years(sotu_lda_15)
`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

This graph indicates that the model very evenly distributed the topics among documents. We can see that by exploring the gamma column of the sotu_docs data frame.

summary(sotu_docs$gamma)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.03843 0.06624 0.06661 0.06667 0.06699 0.38449 
qplot(gamma, data = sotu_docs, binwidth = 0.00001)

This distribution is controlled by the alpha hyperparameter. We can see what value of alpha was used with the following code:

attr(sotu_lda_15, "alpha")
[1] 32.93211

This is very high! Let’s try a smaller alpha.

sotu_lda <- LDA(sotu_dtm, k = 15, control = list(alpha = 1))
saveRDS(sotu_lda, "sotu_lda_15_a1.RDS")

sotu_lda_15a1 <- readRDS("sotu_lda_15_a1.RDS")
topics_5(sotu_lda_15a1)

topics_years(sotu_lda_15a1)
`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

Now we are seeing some variation and change over time! We are still in the early stages of building this model, but one way we can evaluate and think with it is by looking at the words in each document that were assigned to each topic. The augment() function shows the frequency of each word in each document, and the topic to which it was assigned by the model.

We can now see what proportion of each presidents’ words was assigned to which topic.

We have not yet explored different values of the k parameter, which specifies the number of topics. One package that can help with that is ldatuning. This package allows you to try a variety of topic numbers, and generates four metrics that indicate fit.

It looks like more topics is better, so let’s try it with 50.

---
title: "Notebook 7 - Topic Modeling Part 1"
author: "Emily Klancher Merchant"
date: "STS 205"
output: html_notebook
---

# Topic Modeling
The video that accompanies this notebook is available at https://ucdavis.box.com/v/sts-205-notebook-7.

Topic modeling is a way to imagine the shared vocabularies (topics) underlying the documents in the corpus. Running a topic model is quite easy. The hard parts are preparing your text, choosing model parameters and hyperparameters, and interpreting the model.

Topic modeling should be reserved for large corpora (more documents than unique words). Long documents should be *chunked* into smaller documents, as topic modeling treats a document as a *bag of words*. If two words are in the same document, the model doesn't care if they are right next to each other or very far apart. Therefore, working with short documents or short segments of longer documents will ensure that the words the model associates with one another actually occur near one another in the original documents.

Start by loading packages, sourcing functions, and building the `sotu` data frame.
```{r}
library(tidyverse)
library(tidytext)
#install.packages("textstem")
library(textstem)
#install.packages("topicmodels")
library(topicmodels)
source("functions.r")
sotu <- make_sotu()
```
To prepare the text, we will chunk it up into paragraphs, using the "<p>" tags. What we want is a data frame where each row is one paragraph of one SOTU address. We will create a unique identifier for each paragraph.
```{r}
sotu_paragraphs <- data.frame()
for(i in 1:nrow(sotu)) {
  text <- str_split(sotu$text[i], " <p> ") %>% unlist
  paragraphs <- tibble(text) %>% 
                mutate(id = str_c(str_replace_all(sotu$pres[i], " ", "_"), 
                            "_", sotu$year[i], "_", 1:length(text)))
  sotu_paragraphs <- rbind(sotu_paragraphs, paragraphs)
}
head(sotu_paragraphs)
```
Next we will unnest tokens and remove stopwords and tokens containing digits.
```{r}
sotu_words <- sotu_paragraphs %>% unnest_tokens(word, text) %>%
                anti_join(stop_words) %>% filter(!str_detect(word, "[:digit:]"))
head(sotu_words)
```
Next we will lemmatize the text. Lemmatization converts all forms of a word to the most basic form. For example, "children" becomes "child" and "men" becomes "man." Lemmatization is not necessary for topic modeling, and you can also use it for other types of text analysis, including types we have discussed in class. I am introducing it here mainly so that you are aware of it and know how to do it.

The first step is to create a data frame with a column of all unique words in the corpus.
```{r}
sotu_lemmas <- tibble(word = unique(sotu_words$word))
head(sotu_lemmas)
```
Now we will use the `lemmatize_words` function in the `textstem` package to create a new column of lemmas. The `lemmatize_words` function takes as its argument the vector of words you want to lemmatize and returns the lemma for each one.
```{r}
sotu_lemmas$lemma <- lemmatize_words(sotu_lemmas$word)
head(sotu_lemmas)
```
Now we will join our list of lemmas to the `sotu_words` data frame
```{r}
sotu_words <- left_join(sotu_words, sotu_lemmas) 
head(sotu_words)
```
We will use the `topicmodels` package for topic modeling. It requires that our corpus be in a document-term matrix, so we use the `cast_dtm()` function from the `tidytext` package to convert our token data frame into a document-term matrix.
```{r}
sotu_dtm <- sotu_words %>% group_by(id) %>% count(lemma) %>% cast_dtm(id, lemma, n)
```
Now we can use the `LDA()` function in the `topicmodels` package to build a topic model. For our first attempt, we will ask for 15 topics. This process takes a lot of computational time. To save you that time, I have already run the model and saved it as `sotu_lda_15.RDS`. Read it into working memory with the code below.
```{r}
# sotu_lda <- LDA(sotu_dtm, k = 15)
# saveRDS(sotu_lda, "sotu_lda_15.RDS")
sotu_lda_15 <- readRDS("sotu_lda_15.RDS")
sotu_lda_15
```
The following code creates two data frames of results: `sotu_topics` indicates the probability distribution of words over topics; `sotu_docs` indicates the proportion of each document (paragraph) attributed to each topic.
```{r}
sotu_topics <- tidy(sotu_lda_15, matrix = "beta")
head(sotu_topics)
sotu_topics %>% group_by(topic) %>% summarize(total = sum(beta))
sotu_docs <- tidy(sotu_lda_15, matrix = "gamma") 
head(sotu_docs)
sotu_docs %>% group_by(document) %>% summarize(total = sum(gamma))
```

One way to visualize topics is with bar charts of the five most frequent words in each topic.
```{r fig.height = 10}
topics_5 <- function(lda) {
  tidy(lda, matrix = "beta") %>% 
        group_by(topic) %>% top_n(5, beta) %>%
        ggplot(aes(x = reorder(term, beta), y = beta, fill = term)) + 
          geom_col() + coord_flip() + guides(fill = FALSE) +
          facet_wrap(vars(topic), ncol = 5, scales = "free") +
          theme_minimal(base_size = 20) +
          scale_y_continuous(labels = c()) +
          labs(title = "Topic Model - Top Five Words By Topic", 
               x = "Top Five Words", y = "Beta")
}
topics_5(sotu_lda_15)
```
One way to visualize the distribution of topics over documents is to calculate the proportion of each document attributed to each topic. Here I am aggregating over years.
```{r}
topics_years <- function(lda) {
  tidy(lda, matrix = "gamma") %>%
    mutate(year = str_replace(document, "_[0-9]+$", "")) %>% 
    mutate(year = str_replace(year, "[A-Za-z_\\.]+_", "")) %>%
    mutate(year = as.numeric(year)) %>% 
    group_by(year, topic) %>% summarize(paras = sum(gamma)) %>%
    ungroup %>% group_by(year) %>% mutate(p = paras/sum(paras)) %>%
    ggplot(aes(x = year, y = p, fill = factor(topic))) + geom_area() +
      scale_y_continuous(labels = scales::percent) +
      labs(title = "Topic Distribution by Year - SOTU Topic Model",
           x = "Year", y = "Percent of Address (by paragraph)", fill = "Topic")
}
topics_years(sotu_lda_15)
```
This graph indicates that the model very evenly distributed the topics among documents. We can see that by exploring the `gamma` column of the `sotu_docs` data frame.
```{r}
summary(sotu_docs$gamma)
qplot(gamma, data = sotu_docs, binwidth = 0.00001)
```
This distribution is controlled by the alpha **hyperparameter**. We can see what value of alpha was used with the following code:
```{r}
attr(sotu_lda_15, "alpha")
```
This is very high! Let's try a smaller alpha.
```{r fig.height = 10}
# sotu_lda <- LDA(sotu_dtm, k = 15, control = list(alpha = 1))
# saveRDS(sotu_lda, "sotu_lda_15_a1.RDS")

sotu_lda_15a1 <- readRDS("sotu_lda_15_a1.RDS")
topics_5(sotu_lda_15a1)
topics_years(sotu_lda_15a1)
```
Now we are seeing some variation and change over time! We are still in the early stages of building this model, but one way we can evaluate and think with it is by looking at the words in each document that were assigned to each topic. The `augment()` function shows the frequency of each word in each document, and the topic to which it was assigned by the model.
```{r}
sotu_assignments <- augment(sotu_lda_15a1, data = sotu_dtm)
head(sotu_assignments)
tail(sotu_assignments)
```
We can now see what proportion of each presidents' words was assigned to which topic. 
```{r}
hmap <- function(lda, dtm, t) {
  pres_words <- augment(lda, data = dtm) %>% 
                  mutate(pres = str_replace_all(document, "[_0-9\\.]+$", "")) %>% 
                  mutate(pres = str_replace_all(pres, "_", " ")) %>%
                  mutate(pres = fct_relevel(factor(pres), unique(sotu$pres))) %>%
                  group_by(pres, .topic) %>% 
                  summarize(n = sum(count)) %>% ungroup %>%
                  group_by(pres) %>% mutate(p = n/sum(n))
  t_words <- tidy(lda, matrix = "beta") %>% 
              arrange(topic, -beta) %>%
              group_by(topic) %>% top_n(5, beta) %>% 
              summarize(words = paste(term, collapse = " "))
  left_join(pres_words, t_words, by = c(".topic" = "topic")) %>%
    ggplot(aes(x = pres, y = words, fill = p)) + geom_tile() +
      scale_fill_gradient(low = "white", high = "steelblue") +
      theme_minimal() + 
      scale_x_discrete(expand = c(0, 0)) +
      scale_y_discrete(expand = c(0, 0)) +
      theme(legend.position = "none", axis.text.x = element_text(angle = 270, hjust = 0)) +
      labs(title = paste0("State of the Union Addresses by President, ", t, "-Topic Model"), 
          x = "President", y = "Topic")
}
png("hmap_15_topics.png", height = 5, width = 15, units = "in", res = 100)
  hmap(sotu_lda_15a1, sotu_dtm, 15)
dev.off()
```



We have not yet explored different values of the k parameter, which specifies the number of topics. One package that can help with that is `ldatuning`. This package allows you to try a variety of topic numbers, and generates four metrics that indicate fit.
```{r}
library(ldatuning)
control_list <- list(alpha = 1)
  topic_number <- FindTopicsNumber(sotu_dtm, 
                                 topics = c(seq(from = 10, to = 20, by = 1), 
                                            seq(20, 30, 2), seq(30, 50, 5)),
                                 metrics = c("Griffiths2004", "CaoJuan2009", 
                                             "Arun2010", "Deveaud2014"),
                                 method = "Gibbs",
                                 control = control_list,
                                 mc.cores = 4L,
                                 verbose = TRUE)
saveRDS(topic_number, "topic_number.RDS")
topic_number <- readRDS("topic_number.RDS")
FindTopicsNumber_plot(topic_number)
```
It looks like more topics is better, so let's try it with 50.
```{r}
sotu_lda_50a1 <- LDA(sotu_dtm, k = 50, control = list(alpha = 1))
saveRDS(sotu_lda_50a1, "sotu_lda_50_a1.RDS")
```
