The video that accompanies this notebook is available at https://ucdavis.box.com/v/sts-205-notebook-6.
In this notebook, we will use classification algorithms to consider the party of U.S. presidents. Since the early nineteenth century, most presidents (all presidents after Andrew Johnson) have been either Democrats or Republicans, though as we know, the parties have stood for different things at different times. In this notebook, we will use classification algorithms to assign earlier presidents to one or the other of these parties, on the basis of word usage in their SOTU addresses.
Begin by loading the packages we will use, sourcing your functions, and making the sotu
data frame.
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3 ✓ purrr 0.3.4
✓ tibble 3.0.6 ✓ dplyr 1.0.4
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(tidytext)
#install.packages("class")
library(class)
#install.packages("e1071")
library(e1071)
source("functions.r")
sotu <- make_sotu()
── Column specification ────────────────────────────────────────────────────────
cols(
year = col_double(),
pres = col_character(),
use_last = col_logical()
)
We will need to add a column to sotu
indicating whether each president was a Democrat, Republican, or Other. We will begin by making a vector of presidents in each party. It is important to spell each president’s name exactly the way it is spelled in the sotu
data frame.
republicans <- c("Abraham Lincoln", "Ulysses S. Grant", "Rutherford B. Hayes", "James Garfield",
"Chester A. Arthur", "Benjamin Harrison", "William McKinley", "Theodore Roosevelt",
"William H. Taft", "Warren Harding", "Calvin Coolidge", "Herbert Hoover",
"Dwight D. Eisenhower", "Richard Nixon", "Gerald R. Ford", "Ronald Reagan",
"George H.W. Bush", "George W. Bush", "Donald J. Trump")
democrats <- c("Andrew Jackson", "Martin van Buren", "James Polk", "Franklin Pierce",
"James Buchanan", "Grover Cleveland", "Woodrow Wilson", "Franklin D. Roosevelt",
"Harry S. Truman", "John F. Kennedy", "Lyndon B. Johnson", "Jimmy Carter",
"William J. Clinton", "Barack Obama")
federalists <- c("George Washington", "John Adams")
democratic_republicans <- c("Thomas Jefferson", "James Madison", "James Monroe", "John Quincy Adams")
whigs <- c("William Henry Harrison", "John Tyler", "Zachary Taylor", "Millard Fillmore")
union <- c("Andrew Johnson")
Now add the party
column to sotu
.
sotu <- sotu %>% mutate(party = ifelse(pres %in% republicans, "Republican",
ifelse(pres %in% democrats, "Democratic", "Other")))
#Check which presidents are being classified as "Other"
unique(sotu$pres[sotu$party == "Other"])
[1] "George Washington" "John Adams" "Thomas Jefferson"
[4] "James Madison" "James Monroe" "John Quincy Adams"
[7] "John Tyler" "Zachary Taylor" "Millard Fillmore"
[10] "Andrew Johnson"
For classification, as for clustering, we need to decide on the criteria we will use. For this notebook, let’s use the 500 most frequently-used words. The first classification algorithm we will use is a Support Vector Machine (SVM). The SVM algorithm identifies a hyperplane in n-dimensional space (where n is the number of features you are using for classification; in this case 500). Unknown data points are assigned based on where they fall relative to that hyperplane. In this case, imagine a line separating the Republican addresses from the Democratic addresses.
Start by identifying the top 500 words and making a document-term matrix. We will do it manually rather than using the cast_dtm()
function, which gives us a data frame that looks like a dtm but still has the year
and party
columns.
#Identify top 500 words
top_words <- sotu_tokenize_words() %>% count(gram) %>% top_n(500)
Selecting by n
#Make a document-term matrix of those words (rename year and party to avoid confusion)
dtm <- sotu_tokenize_words() %>% filter(gram %in% top_words$gram) %>%
group_by(year, party) %>% count(gram) %>% mutate(tf = n/sum(n)) %>%
rename(year_ = year, party_ = party) %>% select(-n) %>% spread(gram, tf)
head(dtm)
#Substitute zero for missing values
dtm[is.na(dtm)] <- 0
Now we will divide the data frame into two groups: a training set (Republicans and Democrats) and a set to be classified (Others)
#Divide the dtm into a training set and a classification set
train <- dtm[dtm$party_ != "Other", ]
class <- dtm[dtm$party_ == "Other", ]
head(train)
head(class)
The svm()
function from the e1071
package trains the SVM model. It takes two parameters: the features to be used for classification (columns 3 and onward of the train
data frme) and the category for each known document (the party
column of the train
data frame).
#Train the model on the training set
#svm(location of training documents, classification of training documents)
model <- svm(train[ , 3:ncol(train)], factor(train$party_))
To classify the addresses given by presidents who were neither Democrats nor Republicans, we use the predict()
function. It takes two arguments: the model (which we have saved as “model”) and the locations of the documents we want to classify.
predict(model, class[ , 3:ncol(class)])
1 2 3 4 5 6 7
Democratic Democratic Democratic Democratic Democratic Democratic Democratic
8 9 10 11 12 13 14
Democratic Democratic Democratic Democratic Democratic Democratic Democratic
15 16 17 18 19 20 21
Democratic Democratic Democratic Democratic Democratic Democratic Democratic
22 23 24 25 26 27 28
Democratic Democratic Democratic Democratic Democratic Democratic Democratic
29 30 31 32 33 34 35
Democratic Democratic Democratic Democratic Democratic Democratic Democratic
36 37 38 39 40 41 42
Democratic Democratic Democratic Democratic Democratic Democratic Democratic
43 44 45 46 47 48 49
Democratic Democratic Democratic Democratic Democratic Democratic Democratic
50 51 52
Democratic Democratic Democratic
Levels: Democratic Republican
The result is a vector of predictions, one for each row of the class
data frame. To see which presidents these correspond to, we can use the vector to create a new column of the sotu
data frame, where we have filtered to include only presidents who were not either Republicans or Democrats.
predicted <- sotu %>% select(-use_last, -text) %>% filter(party == "Other") %>%
mutate(prediction = predict(model, class[ , 3:ncol(class)]))
Now we can tally up the results by president using a two-way table.
table(predicted$pres, predicted$prediction)
Democratic Republican
Andrew Johnson 4 0
George Washington 8 0
James Madison 8 0
James Monroe 8 0
John Adams 4 0
John Quincy Adams 4 0
John Tyler 4 0
Millard Fillmore 3 0
Thomas Jefferson 8 0
Zachary Taylor 1 0
We don’t know how accurate this classification is (in a sense, it is not at all accurate, since none of these presidents were either Democrats or Republicans), but it suggests that, when our current parties emerged, the Democratic Party offered more consistency with previous politics and the Republican Party offered something new (remember that the Republican party at that time was the pro-business and anti-slavery party).
We will get to the problem of accuracy later. For now, let’s see what happens when we use the top 100 bigrams rather than the top 500 words as our feature set for classification.
top_bigrams <- sotu_tokenize_bigrams() %>% count(gram) %>% top_n(100)
Selecting by n
dtm <- sotu_tokenize_bigrams() %>% filter(gram %in% top_bigrams$gram) %>%
group_by(year, party) %>% count(gram) %>%
mutate(tf = n/sum(n), gram = str_replace(gram, " ", "_")) %>%
rename(year_ = year, party_ = party) %>% select(-n) %>% spread(gram, tf)
dtm[is.na(dtm)] <- 0
train <- dtm[dtm$party_ != "Other", ]
class <- dtm[dtm$party_ == "Other", ]
model <- svm(train[ , 3:ncol(train)], factor(train$party_))
predicted <- sotu %>% select(-use_last, -text) %>% filter(party == "Other") %>%
mutate(prediction = predict(model, class[ , 3:ncol(class)]))
svm_bigrams <- table(predicted$pres, predicted$prediction)
svm_bigrams
Democratic Republican
Andrew Johnson 3 1
George Washington 8 0
James Madison 8 0
James Monroe 8 0
John Adams 4 0
John Quincy Adams 4 0
John Tyler 4 0
Millard Fillmore 3 0
Thomas Jefferson 8 0
Zachary Taylor 1 0
We get very similar results with this model.
The next algorithm we will use is k nearest neighbors. As you may recall from the reading, KNN assigns an unknown object to a category based on the identity of its nearest neighbors in n-dimensional space, where n is the number of features being used for classification. For KNN, we will use the knn()
function from the class
package. We will do it with k = 3 and k = 11, but really you can use any odd number for k. Our training and classification data sets are the same as the ones we used for the bigram SVM above, so we are classifying according to the top 100 bigrams (we could also do it with individual words, trigrams, etc.). With KNN, training and prediction are done in the same step. The knn()
function takes three arguments: the training data set, the classification data set, and k. It returns a vector of predictions for the classification data set.
#knn(location of training documents, location of classification documents,
# classification of training documents, k)
pred3 <- knn(train[, 3:ncol(train)], class[, 3:ncol(class)], factor(train$party_), k = 3)
We can convert these predictions into the same kind of table we made before.
knn3_bigrams <- sotu %>% filter(party == "Other") %>% mutate(prediction = pred3) %>%
group_by(pres) %>% count(prediction) %>% spread(prediction, n)
knn3_bigrams
We can write a function to do the whole process for any value of k.
predict_party <- function(k) {
prediction <- knn(train[, 3:ncol(train)], class[, 3:ncol(class)], factor(train$party_), k = k)
pred_words <- sotu %>% filter(party == "Other") %>% mutate(prediction = prediction) %>%
group_by(pres) %>% count(prediction) %>% spread(prediction, n)
return(pred_words)
}
predict_party(11)
knn11_bigrams <- predict_party(11)
As you can see, we get similar, but not identical results with k = 11. We can look at the predictions generated by all of the bigram models side-by-side.
data.frame(svm_bigrams) %>% rename(pres = Var1) %>% spread(Var2, Freq) %>%
rename(Dem_SVM = Democratic, Rep_SVM = Republican) %>%
inner_join(knn3_bigrams) %>% rename(Dem_3NN = Democratic, Rep_3NN = Republican) %>%
inner_join(knn11_bigrams) %>% rename(Dem_11NN = Democratic, Rep_11NN = Republican)
Joining, by = "pres"
Joining, by = "pres"
From this, we learn a few things. The SOTU addresses of most of the earliest presidents are consistently more similar (when considering the top 100 bigrams) to those of later Democrats than to those of later Republicans. Others (Monroe, Adams, Tyler, Taylor) are either less consistent or less easily classified.
However, we have no way of deciding which of the three models gives a better prediction. We know that all of them are wrong in the sense that none of these presidents was actually a Democrat or Republican.
We can test the models by examining how well they do to classify the SOTU addresses of presidents who really were Democrats or Republicans. One way to do this is with leave one out cross-validation. The idea here is that, for as many known data points as we have (in this case, 181 SOTU addresses were given by presidents who were either Republicans or Democrats), we fit the model that many times, each time leaving out one data point. Each time we fit the model, we use it to predict the category of the one document we left out. Once we have completed this task, we have one prediction for each document in our data set, which we can then use to build a confusion matrix.
Here is how we would do it for the SVM model.
#Make an empty data frame to hold predictions
predictions <- data.frame()
#For each row in the "train" data frame
for(i in 1:nrow(train)) {
#Build a model using all addresses except i
model <- svm(train[-i, 3:ncol(train)], factor(train[-i, ]$party_))
#Use the model to classify address i
pred <- predict(model, train[i, 3:ncol(train)])
#Attach the prediction for address i to its actual party
predict <- train[i, 1:2] %>% ungroup %>% mutate(prediction = pred)
#Add this prediction to the "predictions" data frame
predictions <- rbind(predictions, predict)
}
#Create confusion matrix
table(predictions$party_, predictions$prediction)
Democratic Republican
Democratic 72 17
Republican 25 67
Here is how we would do leave one out cross-validation for the KNN model.
loo_knn <- function(k) {
predictions <- data.frame()
for(i in 1:nrow(train)) {
pred <- knn(train[-i, 3:ncol(train)],
train[i, 3:ncol(train)],
factor(train[-i, ]$party_), k = k)
predict <- train[i, 1:2] %>% ungroup %>% mutate(prediction = pred)
predictions <- rbind(predictions, predict)
}
print(table(predictions$party_, predictions$prediction))
}
loo_knn(1)
Democratic Republican
Democratic 68 21
Republican 26 66
loo_knn(3)
Democratic Republican
Democratic 63 26
Republican 23 69
Leave one out cross validation is a special case of a more general concept called k-fold cross-validation. In k-fold cross-validation, you split your training data into k groups and fit the model k times, each time leaving out a different group and using that group as the prediction set. This means that each document in the data set will get predicted once and will be used for training k-1 times. Leave one out cross-validation is k-fold cross validation where k is equal to the number of documents in the training set. Most commonly, k = 10, giving us 10-fold cross-validation. The coding is a bit more cumbersome, but the general idea is the same. For this notebook, we will only do 10-fold cross-validation for the SVM model.
The first thing we need to do is shuffle our documents randomly. We will do this by creating a vector of the numbers from 1 to 181 in random order.
test <- sample(1:181, 181, replace = FALSE)
Now we will make a for loop that will leave out approximately 10% of the documents each time it runs
for(i in 1:10) {
start <- (i - 1) * 18 + 1
end <- ifelse(i < 10, start + 17, 181)
testing <- start:end
print(testing)
print(test[testing])
}
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[1] 141 73 108 30 127 118 9 179 6 92 132 140 157 60 35 26 56 116
[1] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[1] 125 22 180 166 96 53 65 151 62 107 178 64 34 3 123 87 144 11
[1] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[1] 143 59 82 50 97 21 10 38 119 75 175 147 115 137 41 54 79 100
[1] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[1] 103 45 110 174 149 162 48 51 128 42 169 98 63 181 131 49 109 15
[1] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[1] 17 90 105 13 20 12 163 99 156 164 95 37 146 168 89 160 68 114
[1] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
[1] 32 24 85 172 124 25 154 133 31 83 44 139 43 122 8 36 93 29
[1] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
[1] 71 2 176 46 161 94 102 112 104 148 5 86 117 72 134 58 19 18
[1] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
[1] 159 74 120 111 142 145 152 77 91 4 55 129 67 47 80 101 130 40
[1] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
[1] 69 170 16 81 177 106 76 155 113 88 7 84 23 153 167 138 52 135
[1] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
[1] 61 78 165 28 1 39 158 173 57 14 121 126 136 66 150 33 27 70 171
Each time we test this for loop, we want to fit the SVM model using all observations except for those we are testing in that loop. The observations we are testing will always be train[test[testing], ]
.
for(i in 1:10) {
start <- (i - 1) * 18 + 1
end <- ifelse(i < 10, start + 17, 181)
testing <- start:end
model <- svm(train[-test[testing], 3:ncol(train)], factor(train[-test[testing], ]$party_))
pred <- predict(model, train[test[testing], 3:ncol(train)])
predict <- train[test[testing], 1:2] %>% ungroup %>% mutate(prediction = pred)
print(predict)
}
Now we will just add the data frame to collect the results, and then we can make the confusion matrix.
predictions <- data.frame()
test <- sample(1:181, 181, replace = FALSE)
for(i in 1:10) {
start <- (i - 1) * 18 + 1
end <- ifelse(i < 10, start + 17, 181)
testing <- start:end
model <- svm(train[-test[testing], 3:ncol(train)], factor(train[-test[testing], ]$party_))
pred <- predict(model, train[test[testing], 3:ncol(train)])
predict <- train[test[testing], 1:2] %>% ungroup %>% mutate(prediction = pred)
predictions <- rbind(predictions, predict)
}
table(predictions$party_, predictions$prediction)
Democratic Republican
Democratic 70 19
Republican 24 68