The video that accompanies this notebook is available at https://ucdavis.box.com/v/sts-205-notebook-2.
In this notebook we will be analyzing and visualizing word frequencies in the State of the Union addresses. Let’s begin by looking at the metadata table we created last week.
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3 ✓ purrr 0.3.4
✓ tibble 3.0.6 ✓ dplyr 1.0.4
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
sotu_meta <- read_csv("sotu_metadata.csv")
── Column specification ────────────────────────────────────────────────────────
cols(
year = col_double(),
pres = col_character(),
use_last = col_logical()
)
head(sotu_meta)
In general, there are two ways to work with data frames in R. Base R allows you to access any column of a data frame by specifying the name of the data frame followed by a dollar sign and the name of the column.
sotu_meta$year
[1] 1790.0 1790.5 1791.0 1792.0 1793.0 1794.0 1795.0 1796.0 1797.0 1798.0
[11] 1799.0 1800.0 1801.0 1802.0 1803.0 1804.0 1805.0 1806.0 1807.0 1808.0
[21] 1809.0 1810.0 1811.0 1812.0 1813.0 1814.0 1815.0 1816.0 1817.0 1818.0
[31] 1819.0 1820.0 1821.0 1822.0 1823.0 1824.0 1825.0 1826.0 1827.0 1828.0
[41] 1829.0 1830.0 1831.0 1832.0 1833.0 1834.0 1835.0 1836.0 1837.0 1838.0
[51] 1839.0 1840.0 1841.0 1842.0 1843.0 1844.0 1845.0 1846.0 1847.0 1848.0
[61] 1849.0 1850.0 1851.0 1852.0 1853.0 1854.0 1855.0 1856.0 1857.0 1858.0
[71] 1859.0 1860.0 1861.0 1862.0 1863.0 1864.0 1865.0 1866.0 1867.0 1868.0
[81] 1869.0 1870.0 1871.0 1872.0 1873.0 1874.0 1875.0 1876.0 1877.0 1878.0
[91] 1879.0 1880.0 1881.0 1882.0 1883.0 1884.0 1885.0 1886.0 1887.0 1888.0
[101] 1889.0 1890.0 1891.0 1892.0 1893.0 1894.0 1895.0 1896.0 1897.0 1898.0
[111] 1899.0 1900.0 1901.0 1902.0 1903.0 1904.0 1905.0 1906.0 1907.0 1908.0
[121] 1909.0 1910.0 1911.0 1912.0 1913.0 1914.0 1915.0 1916.0 1917.0 1918.0
[131] 1919.0 1920.0 1921.0 1922.0 1923.0 1924.0 1925.0 1926.0 1927.0 1928.0
[141] 1929.0 1930.0 1931.0 1932.0 1934.0 1935.0 1936.0 1937.0 1938.0 1939.0
[151] 1940.0 1941.0 1942.0 1943.0 1944.0 1945.0 1946.0 1947.0 1948.0 1949.0
[161] 1950.0 1951.0 1952.0 1953.0 1953.5 1954.0 1955.0 1956.0 1957.0 1958.0
[171] 1959.0 1960.0 1961.0 1961.5 1962.0 1963.0 1964.0 1965.0 1966.0 1967.0
[181] 1968.0 1969.0 1970.0 1971.0 1972.0 1973.0 1974.0 1975.0 1976.0 1977.0
[191] 1978.0 1979.0 1980.0 1981.0 1982.0 1983.0 1984.0 1985.0 1986.0 1987.0
[201] 1988.0 1989.0 1990.0 1991.0 1992.0 1993.0 1994.0 1995.0 1996.0 1997.0
[211] 1998.0 1999.0 2000.0 2001.0 2001.5 2002.0 2003.0 2004.0 2005.0 2006.0
[221] 2007.0 2008.0 2009.0 2010.0 2011.0 2012.0 2013.0 2014.0 2015.0 2016.0
[231] 2017.0 2018.0 2019.0
You can also use functions to summarize the columns. Here we use the table()
function to see how many addresses were given by each president.
table(sotu_meta$pres)
Abraham Lincoln Andrew Jackson Andrew Johnson
4 8 4
Barack Obama Benjamin Harrison Calvin Coolidge
8 4 6
Chester A. Arthur Donald J. Trump Dwight D. Eisenhower
4 3 9
Franklin D. Roosevelt Franklin Pierce George H.W. Bush
12 4 4
George W. Bush George Washington Gerald R. Ford
9 8 3
Grover Cleveland Harry S. Truman Herbert Hoover
8 8 4
James Buchanan James Madison James Monroe
4 8 8
James Polk Jimmy Carter John Adams
4 4 4
John F. Kennedy John Quincy Adams John Tyler
3 4 4
Lyndon B. Johnson Martin van Buren Millard Fillmore
6 4 3
Richard Nixon Ronald Reagan Rutherford B. Hayes
5 7 4
Theodore Roosevelt Thomas Jefferson Ulysses S. Grant
8 8 8
Warren Harding William H. Taft William J. Clinton
2 4 8
William McKinley Woodrow Wilson Zachary Taylor
4 8 1
You can subset a data frame by either rows or columns using square brackets.
sotu_meta[1:2, ] #First two rows
sotu_meta[, 1:2] #First two columns
sotu_meta[1:2, 1:2] #First two rows and columns
sotu_meta[sotu_meta$pres == "George Washington", ] #Rows where pres is George Washington
sotu_meta[sotu_meta$use_last, ] #Rows where use_last is TRUE
sotu_meta[sotu_meta$year %in% 1800:1899, ] #Addresses given between 1800 and 1899
You can add a column by naming it and assigning it a value. Here we will add a column with the president’s last name.
sotu_meta$pres_last <- str_extract(sotu_meta$pres, "[A-Z][a-z]+$")
head(sotu_meta)
The dplyr package allows you to manipulate data frames without using the dollar sign notation, which often makes your code cleaner. The dplyr
package is part of the tidyverse
, so we don’t need to install it separately. The dplyr
function select()
allows you to select columns by name.
select(sotu_meta, year, pres)
You can also use pipes with dplyr
functions:
sotu_meta <- sotu_meta %>% select(-pres_last)
head(sotu_meta)
In the code above, I changed the sotu_meta
data frame to eliminate the pres_last
column. The dplyr
function filter()
allows you to select rows by given criteria.
sotu_meta %>% filter(pres == "George Washington")
The dyplyr
function arrange()
lets you sort the data frame by the value of a given column.
arrange(sotu_meta, -year)
sotu_meta %>% arrange(-year)
The dplyr
function mutate()
lets you change the value of a column or create a new column.
sotu_meta %>% mutate(pres_last = str_extract(pres, "[A-Z][a-z]+$"))
Now we are going to read in the text of the addresses and convert them to a vector where each element is a separate address, just as we did last week.
sotu <- readLines("stateoftheunion1790-2019.txt") %>%
str_c(collapse = " ") %>%
str_split("\\*\\*\\* ") %>% unlist
The sotu_meta
table has 233 rows and the sotu
vector has 233 elements, so we can make the sotu
vector a column of the sotu_meta
table.
sotu_meta$text <- sotu
head(sotu_meta)
Now the text of each address is in the text
column of the sotu_meta
table. We want to get rid of the first paragraph of every address, since that is only metadata, and we want to get rid of the last paragraph of every address where use_last
is FALSE. The code below replaces the first paragraph with nothing, using the same regular expression we used last week to extract the first paragraph.
sotu_meta <- sotu_meta %>% mutate(text = str_replace(text, "[^<]+<p> ", ""))
head(sotu_meta)
sotu_meta <- sotu_meta %>% mutate(remove = ifelse(use_last, "", str_extract(text, " <p>[^<]+$")),
text = ifelse(use_last, text, str_replace(text, " <p>[^<]+$", "")))
sotu_meta$remove
[1] ""
[2] " <p> GO. WASHINGTON "
[3] " <p> GO. WASHINGTON "
[4] " <p> GO. WASHINGTON "
[5] " <p> GO. WASHINGTON "
[6] " <p> GO. WASHINGTON "
[7] " <p> GO. WASHINGTON "
[8] " <p> GO. WASHINGTON "
[9] ""
[10] ""
[11] ""
[12] ""
[13] ""
[14] " <p> TH. JEFFERSON "
[15] " <p> TH. JEFFERSON "
[16] " <p> TH. JEFFERSON "
[17] " <p> TH. JEFFERSON "
[18] " <p> TH. JEFFERSON "
[19] " <p> TH. JEFFERSON "
[20] " <p> TH. JEFFERSON "
[21] ""
[22] ""
[23] ""
[24] ""
[25] ""
[26] ""
[27] ""
[28] ""
[29] ""
[30] ""
[31] ""
[32] ""
[33] ""
[34] ""
[35] ""
[36] ""
[37] " <p> JOHN QUINCY ADAMS "
[38] " <p> JOHN QUINCY ADAMS "
[39] " <p> JOHN QUINCY ADAMS "
[40] " <p> JOHN QUINCY ADAMS "
[41] ""
[42] ""
[43] ""
[44] ""
[45] ""
[46] ""
[47] ""
[48] ""
[49] ""
[50] " <p> M. VAN BUREN "
[51] " <p> M. VAN BUREN "
[52] " <p> M. VAN BUREN "
[53] ""
[54] ""
[55] ""
[56] ""
[57] " <p> JAMES K. POLK "
[58] " <p> JAMES K. POLK "
[59] " <p> JAMES K. POLK "
[60] " <p> JAMES K. POLK "
[61] " <p> Z. TAYLOR. "
[62] ""
[63] ""
[64] ""
[65] ""
[66] ""
[67] ""
[68] ""
[69] ""
[70] ""
[71] ""
[72] ""
[73] ""
[74] ""
[75] ""
[76] ""
[77] ""
[78] ""
[79] ""
[80] ""
[81] " <p> U. S. GRANT "
[82] " <p> U. S. GRANT "
[83] " <p> U. S. GRANT "
[84] ""
[85] " <p> U. S. GRANT "
[86] ""
[87] " <p> U. S. GRANT "
[88] " <p> U. S. GRANT "
[89] ""
[90] " <p> R. B. HAYES "
[91] ""
[92] ""
[93] ""
[94] ""
[95] ""
[96] ""
[97] ""
[98] ""
[99] ""
[100] ""
[101] " <p> BENJ. HARRISON "
[102] ""
[103] " <p> BENJ. HARRISON "
[104] " <p> BENJ. HARRISON "
[105] " <p> GROVER CLEVELAND "
[106] " <p> GROVER CLEVELAND "
[107] " <p> GROVER CLEVELAND "
[108] " <p> GROVER CLEVELAND "
[109] ""
[110] ""
[111] ""
[112] ""
[113] ""
[114] ""
[115] ""
[116] ""
[117] ""
[118] ""
[119] ""
[120] " <p> Tuesday, December 8, 1908. "
[121] ""
[122] ""
[123] ""
[124] ""
[125] ""
[126] ""
[127] ""
[128] ""
[129] ""
[130] ""
[131] ""
[132] ""
[133] ""
[134] ""
[135] ""
[136] ""
[137] ""
[138] ""
[139] ""
[140] ""
[141] " <p> December 3, 1929 "
[142] " <p> December 2, 1930 "
[143] " <p> December 8, 1931 "
[144] " <p> HERBERT HOOVER The White House, December 6, 1932. "
[145] ""
[146] ""
[147] ""
[148] ""
[149] ""
[150] ""
[151] ""
[152] ""
[153] ""
[154] ""
[155] ""
[156] ""
[157] ""
[158] ""
[159] ""
[160] ""
[161] ""
[162] ""
[163] ""
[164] ""
[165] ""
[166] ""
[167] ""
[168] ""
[169] ""
[170] ""
[171] ""
[172] ""
[173] ""
[174] ""
[175] ""
[176] ""
[177] ""
[178] ""
[179] ""
[180] ""
[181] ""
[182] ""
[183] ""
[184] ""
[185] ""
[186] " <p> February 2, 1973. "
[187] ""
[188] ""
[189] ""
[190] ""
[191] ""
[192] ""
[193] ""
[194] ""
[195] " <p> NOTE: The President spoke at 9 p.m. in the House Chamber at the Capitol. He was introduced by Thomas P. O'Neill, Jr., Speaker of the House of Representatives. The address was broadcast live on nationwide radio and television. "
[196] " <p> NOTE: The President spoke at 9:03 p.m. in the House Chamber of the Capitol. He was introduced by Thomas P. O'Neill, Jr., Speaker of the House of Representatives. The address was broadcast live on nationwide radio and television. "
[197] " <p> NOTE: The President spoke at 9:02 p.m. in the House Chamber of the Capitol. He was introduced by Thomas P. O'Neill, Jr., Speaker of the House of Representatives. The address was broadcast live on nationwide radio and television. "
[198] " <p> NOTE: The President spoke at 9:05 p.m. in the House Chamber of the Capitol. He was introduced by Thomas P. O'Neill, Jr., Speaker of the House of Representatives. The address was broadcast live on nationwide radio and television. "
[199] " <p> NOTE: The President spoke at 8:04 p.m. in the House Chamber of the Capitol. He was introduced by Thomas P. O'Neill, Jr., Speaker of the House of Representatives. The address was broadcast live on nationwide radio and television. "
[200] " <p> NOTE: The President spoke at 9:03 p.m. in the House Chamber of the Capitol. He was introduced by Jim Wright, Speaker of the House of Representatives. The address was broadcast live on nationwide radio and television. "
[201] " <p> NOTE: The President spoke at 9:07 p.m. in the House Chamber of the Capitol. He was introduced by Jim Wright, Speaker of the House of Representatives. The address was broadcast live on nationwide radio and television. "
[202] ""
[203] ""
[204] ""
[205] ""
[206] ""
[207] ""
[208] ""
[209] ""
[210] ""
[211] ""
[212] ""
[213] ""
[214] " <p> GEORGE W. BUSH. THE WHITE HOUSE, February 27, 2001. "
[215] " <p> GEORGE W. BUSH. THE WHITE HOUSE, September 20, 2001. "
[216] ""
[217] ""
[218] ""
[219] ""
[220] " <p> [Applause, the Members rising.] "
[221] " <p> [Applause, the Members rising.] "
[222] ""
[223] ""
[224] ""
[225] ""
[226] ""
[227] ""
[228] ""
[229] ""
[230] ""
[231] ""
[232] " <p> THE WHITE HOUSE, January 30, 2018. "
[233] ""
Now that we have what we want (a data frame with the text of the State of the Unions and the metadata), we are going to write a function so that we can do this all in one step any time we want.
A function is just an algorithm that we make up ourselves and want to re-use. To give an example, we can make a function called do_math()
that will take any number, multiply it by two, and then add five.
do_math <- function(donuts) {
donuts*2 + 5
}
do_math(5)
[1] 15
#Because all R operations are vectorized, we can also pass a vector to the function
do_math(c(1, 3, 5, 7))
[1] 7 11 15 19
The function to make the State of the Union table won’t take any arguments, it will just make the table of State of the Union addresses and return it.
make_sotu <- function() {
sotu_meta <- read_csv("sotu_metadata.csv")
sotu_meta$text <- unlist(str_split(str_c(readLines("stateoftheUnion1790-2019.txt"), collapse = " "), "\\*\\*\\* "))
sotu_meta <- sotu_meta %>% mutate(text = str_replace(text, "[^<]+<p> ", "")) %>%
mutate(text = ifelse(use_last, text, str_replace(text, " <p>[^<]+$", "")))
return(sotu_meta)
}
Copy this function into a new R script called functions.r
. The code below clears out your working environment, runs the functions.r
script, and makes the sotu
data frame.
rm(list = ls())
source("functions.r")
sotu <- make_sotu()
── Column specification ────────────────────────────────────────────────────────
cols(
year = col_double(),
pres = col_character(),
use_last = col_logical()
)
head(sotu)
We won’t be using the paragraph tags today, so we can remove them from the text
column of sotu
.
#Eliminate paragraph tags
sotu$text <- str_replace_all(sotu$text, "<p>", "")
The first step in calculating word frequencies is tokenizing our text, which means splitting it up into units. The tidytext
package offers useful tools for tokenizing.
#install.packages("tidytext")
library(tidytext)
For word frequencies, our units will be words. We will use the unnest_tokens()
function. The default unit of unnest_tokens()
is the word, so we only need to specify two arguments: the name of the new column that will contain the words (“gram”) and the name of the column that contains the text we want to tokenize (“text”).
sotu_words <- sotu %>% unnest_tokens(gram, text)
head(sotu_words)
As you can see, the unnest_tokens()
function also removed all punctuation and converted all letters to lower case. We now have one row for each word in each address.
We can visualize the number of words in each year with the ggplot()
function. This is a very powerful and flexible function that we can use for all kinds of visualizations, so it is worth taking some time to understand how it works. You may also want to consult the ggplot2 cheat sheet. The ggplot2
package is part of the tidyverse
package, so you don’t need to install it separately.
At its most basic, ggplot()
takes two arguments: the data frame you want to visualize, and a mapping of aesthetics (aes()
). Here we are saying that we want to graph the sotu_words
data frame with the year
column on the x-axis.
ggplot(sotu_words, aes(x = year))
When we run this, we get a blank graph. Now we need to add a layer telling ggplot()
what kind of geometry we want to use to visualize our data. In this case, we will use geom_bar()
, which is a vertical bar. The height of the bar is the number of rows with the given x-axis value (year
). So what we will see here is the number of words in each year.
ggplot(sotu_words, aes(x = year)) + geom_bar()
We can use color to indicate which president gave each speech. Since color will map on to an element of the data (the pres
column), it goes in the aes()
function.
ggplot(sotu_words, aes(x = year, fill = pres)) + geom_bar()
The legend is quite large and really unnecessary, since there was only one president in each year, so we can use a theme()
layer to remove it.
ggplot(sotu_words, aes(x = year, fill = pres)) + geom_bar() + theme(legend.position = "none")
We can use the layers scale_x_continuous()
and scale_y_continuous()
to control the appearance of the x- and y-axes, and the labs()
layers to label the axes and add a title to the graph.
ggplot(sotu_words, aes(x = year, fill = pres)) +
geom_bar() + theme(legend.position = "none") +
scale_x_continuous(breaks = seq(1790, 2030, 20)) +
scale_y_continuous(labels = scales::comma) +
labs(x = "Year", y = "Words", title = "State of the Union Addresses, 1790-2019: Number of Words")
Now let’s get a sense of what the words are. We can count the number of times each word appears with the count()
function and view the top words with the sort = TRUE
argument.
sotu_words %>% count(gram, sort = TRUE)
It may not be surprising to find that the most common words are short words that don’t carry much meaning. In text mining, these are known as stopwords. Depending on what you are doing, you may or may not want to remove them. For example, these words are often very useful for author attribution. In this case, however, we will remove them. The tidytext
package provides a list of stopwords from three lexicons (SMART, snowball, and onix). The code below prints each lexicon.
data(stop_words)
head(stop_words)
for (l in 1:length(unique(stop_words$lexicon))) {
lex <- unique(stop_words$lexicon)[l]
print(lex)
print(stop_words$word[stop_words$lexicon == lex])
}
[1] "SMART"
[1] "a" "a's" "able" "about"
[5] "above" "according" "accordingly" "across"
[9] "actually" "after" "afterwards" "again"
[13] "against" "ain't" "all" "allow"
[17] "allows" "almost" "alone" "along"
[21] "already" "also" "although" "always"
[25] "am" "among" "amongst" "an"
[29] "and" "another" "any" "anybody"
[33] "anyhow" "anyone" "anything" "anyway"
[37] "anyways" "anywhere" "apart" "appear"
[41] "appreciate" "appropriate" "are" "aren't"
[45] "around" "as" "aside" "ask"
[49] "asking" "associated" "at" "available"
[53] "away" "awfully" "b" "be"
[57] "became" "because" "become" "becomes"
[61] "becoming" "been" "before" "beforehand"
[65] "behind" "being" "believe" "below"
[69] "beside" "besides" "best" "better"
[73] "between" "beyond" "both" "brief"
[77] "but" "by" "c" "c'mon"
[81] "c's" "came" "can" "can't"
[85] "cannot" "cant" "cause" "causes"
[89] "certain" "certainly" "changes" "clearly"
[93] "co" "com" "come" "comes"
[97] "concerning" "consequently" "consider" "considering"
[101] "contain" "containing" "contains" "corresponding"
[105] "could" "couldn't" "course" "currently"
[109] "d" "definitely" "described" "despite"
[113] "did" "didn't" "different" "do"
[117] "does" "doesn't" "doing" "don't"
[121] "done" "down" "downwards" "during"
[125] "e" "each" "edu" "eg"
[129] "eight" "either" "else" "elsewhere"
[133] "enough" "entirely" "especially" "et"
[137] "etc" "even" "ever" "every"
[141] "everybody" "everyone" "everything" "everywhere"
[145] "ex" "exactly" "example" "except"
[149] "f" "far" "few" "fifth"
[153] "first" "five" "followed" "following"
[157] "follows" "for" "former" "formerly"
[161] "forth" "four" "from" "further"
[165] "furthermore" "g" "get" "gets"
[169] "getting" "given" "gives" "go"
[173] "goes" "going" "gone" "got"
[177] "gotten" "greetings" "h" "had"
[181] "hadn't" "happens" "hardly" "has"
[185] "hasn't" "have" "haven't" "having"
[189] "he" "he's" "hello" "help"
[193] "hence" "her" "here" "here's"
[197] "hereafter" "hereby" "herein" "hereupon"
[201] "hers" "herself" "hi" "him"
[205] "himself" "his" "hither" "hopefully"
[209] "how" "howbeit" "however" "i"
[213] "i'd" "i'll" "i'm" "i've"
[217] "ie" "if" "ignored" "immediate"
[221] "in" "inasmuch" "inc" "indeed"
[225] "indicate" "indicated" "indicates" "inner"
[229] "insofar" "instead" "into" "inward"
[233] "is" "isn't" "it" "it'd"
[237] "it'll" "it's" "its" "itself"
[241] "j" "just" "k" "keep"
[245] "keeps" "kept" "know" "knows"
[249] "known" "l" "last" "lately"
[253] "later" "latter" "latterly" "least"
[257] "less" "lest" "let" "let's"
[261] "like" "liked" "likely" "little"
[265] "look" "looking" "looks" "ltd"
[269] "m" "mainly" "many" "may"
[273] "maybe" "me" "mean" "meanwhile"
[277] "merely" "might" "more" "moreover"
[281] "most" "mostly" "much" "must"
[285] "my" "myself" "n" "name"
[289] "namely" "nd" "near" "nearly"
[293] "necessary" "need" "needs" "neither"
[297] "never" "nevertheless" "new" "next"
[301] "nine" "no" "nobody" "non"
[305] "none" "noone" "nor" "normally"
[309] "not" "nothing" "novel" "now"
[313] "nowhere" "o" "obviously" "of"
[317] "off" "often" "oh" "ok"
[321] "okay" "old" "on" "once"
[325] "one" "ones" "only" "onto"
[329] "or" "other" "others" "otherwise"
[333] "ought" "our" "ours" "ourselves"
[337] "out" "outside" "over" "overall"
[341] "own" "p" "particular" "particularly"
[345] "per" "perhaps" "placed" "please"
[349] "plus" "possible" "presumably" "probably"
[353] "provides" "q" "que" "quite"
[357] "qv" "r" "rather" "rd"
[361] "re" "really" "reasonably" "regarding"
[365] "regardless" "regards" "relatively" "respectively"
[369] "right" "s" "said" "same"
[373] "saw" "say" "saying" "says"
[377] "second" "secondly" "see" "seeing"
[381] "seem" "seemed" "seeming" "seems"
[385] "seen" "self" "selves" "sensible"
[389] "sent" "serious" "seriously" "seven"
[393] "several" "shall" "she" "should"
[397] "shouldn't" "since" "six" "so"
[401] "some" "somebody" "somehow" "someone"
[405] "something" "sometime" "sometimes" "somewhat"
[409] "somewhere" "soon" "sorry" "specified"
[413] "specify" "specifying" "still" "sub"
[417] "such" "sup" "sure" "t"
[421] "t's" "take" "taken" "tell"
[425] "tends" "th" "than" "thank"
[429] "thanks" "thanx" "that" "that's"
[433] "thats" "the" "their" "theirs"
[437] "them" "themselves" "then" "thence"
[441] "there" "there's" "thereafter" "thereby"
[445] "therefore" "therein" "theres" "thereupon"
[449] "these" "they" "they'd" "they'll"
[453] "they're" "they've" "think" "third"
[457] "this" "thorough" "thoroughly" "those"
[461] "though" "three" "through" "throughout"
[465] "thru" "thus" "to" "together"
[469] "too" "took" "toward" "towards"
[473] "tried" "tries" "truly" "try"
[477] "trying" "twice" "two" "u"
[481] "un" "under" "unfortunately" "unless"
[485] "unlikely" "until" "unto" "up"
[489] "upon" "us" "use" "used"
[493] "useful" "uses" "using" "usually"
[497] "uucp" "v" "value" "various"
[501] "very" "via" "viz" "vs"
[505] "w" "want" "wants" "was"
[509] "wasn't" "way" "we" "we'd"
[513] "we'll" "we're" "we've" "welcome"
[517] "well" "went" "were" "weren't"
[521] "what" "what's" "whatever" "when"
[525] "whence" "whenever" "where" "where's"
[529] "whereafter" "whereas" "whereby" "wherein"
[533] "whereupon" "wherever" "whether" "which"
[537] "while" "whither" "who" "who's"
[541] "whoever" "whole" "whom" "whose"
[545] "why" "will" "willing" "wish"
[549] "with" "within" "without" "won't"
[553] "wonder" "would" "would" "wouldn't"
[557] "x" "y" "yes" "yet"
[561] "you" "you'd" "you'll" "you're"
[565] "you've" "your" "yours" "yourself"
[569] "yourselves" "z" "zero"
[1] "snowball"
[1] "i" "me" "my" "myself" "we"
[6] "our" "ours" "ourselves" "you" "your"
[11] "yours" "yourself" "yourselves" "he" "him"
[16] "his" "himself" "she" "her" "hers"
[21] "herself" "it" "its" "itself" "they"
[26] "them" "their" "theirs" "themselves" "what"
[31] "which" "who" "whom" "this" "that"
[36] "these" "those" "am" "is" "are"
[41] "was" "were" "be" "been" "being"
[46] "have" "has" "had" "having" "do"
[51] "does" "did" "doing" "would" "should"
[56] "could" "ought" "i'm" "you're" "he's"
[61] "she's" "it's" "we're" "they're" "i've"
[66] "you've" "we've" "they've" "i'd" "you'd"
[71] "he'd" "she'd" "we'd" "they'd" "i'll"
[76] "you'll" "he'll" "she'll" "we'll" "they'll"
[81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
[86] "haven't" "hadn't" "doesn't" "don't" "didn't"
[91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
[96] "cannot" "couldn't" "mustn't" "let's" "that's"
[101] "who's" "what's" "here's" "there's" "when's"
[106] "where's" "why's" "how's" "a" "an"
[111] "the" "and" "but" "if" "or"
[116] "because" "as" "until" "while" "of"
[121] "at" "by" "for" "with" "about"
[126] "against" "between" "into" "through" "during"
[131] "before" "after" "above" "below" "to"
[136] "from" "up" "down" "in" "out"
[141] "on" "off" "over" "under" "again"
[146] "further" "then" "once" "here" "there"
[151] "when" "where" "why" "how" "all"
[156] "any" "both" "each" "few" "more"
[161] "most" "other" "some" "such" "no"
[166] "nor" "not" "only" "own" "same"
[171] "so" "than" "too" "very"
[1] "onix"
[1] "a" "about" "above" "across" "after"
[6] "again" "against" "all" "almost" "alone"
[11] "along" "already" "also" "although" "always"
[16] "among" "an" "and" "another" "any"
[21] "anybody" "anyone" "anything" "anywhere" "are"
[26] "area" "areas" "around" "as" "ask"
[31] "asked" "asking" "asks" "at" "away"
[36] "back" "backed" "backing" "backs" "be"
[41] "became" "because" "become" "becomes" "been"
[46] "before" "began" "behind" "being" "beings"
[51] "best" "better" "between" "big" "both"
[56] "but" "by" "came" "can" "cannot"
[61] "case" "cases" "certain" "certainly" "clear"
[66] "clearly" "come" "could" "did" "differ"
[71] "different" "differently" "do" "does" "done"
[76] "down" "down" "downed" "downing" "downs"
[81] "during" "each" "early" "either" "end"
[86] "ended" "ending" "ends" "enough" "even"
[91] "evenly" "ever" "every" "everybody" "everyone"
[96] "everything" "everywhere" "face" "faces" "fact"
[101] "facts" "far" "felt" "few" "find"
[106] "finds" "first" "for" "four" "from"
[111] "full" "fully" "further" "furthered" "furthering"
[116] "furthers" "gave" "general" "generally" "get"
[121] "gets" "give" "given" "gives" "go"
[126] "going" "good" "goods" "got" "great"
[131] "greater" "greatest" "group" "grouped" "grouping"
[136] "groups" "had" "has" "have" "having"
[141] "he" "her" "here" "herself" "high"
[146] "high" "high" "higher" "highest" "him"
[151] "himself" "his" "how" "however" "i"
[156] "if" "important" "in" "interest" "interested"
[161] "interesting" "interests" "into" "is" "it"
[166] "its" "itself" "just" "keep" "keeps"
[171] "kind" "knew" "know" "known" "knows"
[176] "large" "largely" "last" "later" "latest"
[181] "least" "less" "let" "lets" "like"
[186] "likely" "long" "longer" "longest" "made"
[191] "make" "making" "man" "many" "may"
[196] "me" "member" "members" "men" "might"
[201] "more" "most" "mostly" "mr" "mrs"
[206] "much" "must" "my" "myself" "necessary"
[211] "need" "needed" "needing" "needs" "never"
[216] "new" "new" "newer" "newest" "next"
[221] "no" "nobody" "non" "noone" "not"
[226] "nothing" "now" "nowhere" "number" "numbers"
[231] "of" "off" "often" "old" "older"
[236] "oldest" "on" "once" "one" "only"
[241] "open" "opened" "opening" "opens" "or"
[246] "order" "ordered" "ordering" "orders" "other"
[251] "others" "our" "out" "over" "part"
[256] "parted" "parting" "parts" "per" "perhaps"
[261] "place" "places" "point" "pointed" "pointing"
[266] "points" "possible" "present" "presented" "presenting"
[271] "presents" "problem" "problems" "put" "puts"
[276] "quite" "rather" "really" "right" "right"
[281] "room" "rooms" "said" "same" "saw"
[286] "say" "says" "second" "seconds" "see"
[291] "seem" "seemed" "seeming" "seems" "sees"
[296] "several" "shall" "she" "should" "show"
[301] "showed" "showing" "shows" "side" "sides"
[306] "since" "small" "smaller" "smallest" "some"
[311] "somebody" "someone" "something" "somewhere" "state"
[316] "states" "still" "still" "such" "sure"
[321] "take" "taken" "than" "that" "the"
[326] "their" "them" "then" "there" "therefore"
[331] "these" "they" "thing" "things" "think"
[336] "thinks" "this" "those" "though" "thought"
[341] "thoughts" "three" "through" "thus" "to"
[346] "today" "together" "too" "took" "toward"
[351] "turn" "turned" "turning" "turns" "two"
[356] "under" "until" "up" "upon" "us"
[361] "use" "used" "uses" "very" "want"
[366] "wanted" "wanting" "wants" "was" "way"
[371] "ways" "we" "well" "wells" "went"
[376] "were" "what" "when" "where" "whether"
[381] "which" "while" "who" "whole" "whose"
[386] "why" "will" "with" "within" "without"
[391] "work" "worked" "working" "works" "would"
[396] "year" "years" "yet" "you" "young"
[401] "younger" "youngest" "your" "yours"
You can use the filter()
function to remove stopwords. I also want to take out tokens with digits in them. Since tokenizing and removing stopwords is something I will want to do again in the future, I’m going to write a function for it.
sotu_tokenize_words <- function() {
sotu_words <- sotu %>% mutate(text = str_replace_all(text, "<p>", "")) %>% unnest_tokens(gram, text) %>%
filter(!gram %in% stop_words$word & !str_detect(gram, "[:digit:]"))
return(sotu_words)
}
rm(list = ls())
source("functions.r")
sotu <- make_sotu()
── Column specification ────────────────────────────────────────────────────────
cols(
year = col_double(),
pres = col_character(),
use_last = col_logical()
)
sotu_words <- sotu_tokenize_words()
Now we can see the most frequent non-stop words across all of the State of the Union addresses.
sotu_words %>% count(gram, sort = TRUE)
There are many different ways that we can analyze word frequency over time. One way is to graph the frequency of the overall top 25 (or any other number of) words in each year.
We can get the overall top 25 words with the top_n()
function:
tops <- (sotu_words %>% count(gram) %>% top_n(25, n))$gram
tops
[1] "act" "america" "american" "citizens" "congress"
[6] "country" "department" "federal" "foreign" "government"
[11] "law" "nation" "national" "nations" "peace"
[16] "people" "power" "public" "service" "subject"
[21] "system" "time" "united" "war" "world"
Now we select only the rows of sotu_words
that refer to these words.
sotu_words %>% filter(gram %in% tops)
We can use the group_by()
function together with the count()
function to see how many times each of these words appear in each State of the Union address.
sotu_words %>% filter(gram %in% tops) %>% group_by(year) %>% count(gram)
We can easily compare from year to year by graphing the results.
sotu_words %>% filter(gram %in% tops) %>% group_by(year) %>% count(gram) %>%
ggplot(aes(x = year, y = gram, size = n)) + geom_point() + theme(legend.position = "none")
If we want to color the dots by president, we need to add pres
to our group_by()
function and to the aes()
function.
sotu_words %>% filter(gram %in% tops) %>% group_by(year, pres) %>% count(gram) %>%
ggplot(aes(x = year, y = gram, size = n, color = pres)) +
geom_point() + theme(legend.position = "none")
Now we can generalize everything we just did into a function.
all_top_grams <- function(dataset, number) {
tops <- (dataset %>% count(gram) %>% top_n(number))$gram
dataset %>% filter(gram %in% tops) %>% group_by(year, pres) %>% count(gram) %>%
ggplot(aes(x = year, y = gram, size = n, color = pres)) +
geom_point() + theme(legend.position = "none")
}
all_top_grams(sotu_words, 10)
Selecting by n
When we graph raw counts, we find that the longer addresses use all of the words more often (see Carter, 1980). We can standardize the counts by dividing each one by the total number of (non-stop) words in each address. Below I’m going to revise the all_top_grams()
function to standardize the word count.
all_top_grams <- function(dataset, number) {
tops <- (dataset %>% count(gram) %>% top_n(number))$gram
dataset %>% group_by(year, pres) %>% count(gram) %>%
mutate(pct = n/sum(n)) %>% filter(gram %in% tops) %>%
ggplot(aes(x = year, y = gram, size = pct, color = pres)) +
geom_point() + theme(legend.position = "none")
}
all_top_grams(sotu_words, 10)
Selecting by n
all_top_grams(sotu_words, 25)
Selecting by n
Another option would be to consider the top 5 (or any other number of) words used by each president.
top_grams_pres <- function(dataset, number) {
tops <- (dataset %>% group_by(pres) %>% count(gram) %>% top_n(number))$gram
dataset %>% group_by(year, pres) %>% count(gram) %>%
mutate(pct = n/sum(n)) %>% filter(gram %in% tops) %>%
ggplot(aes(x = year, y = gram, size = pct, color = pres)) +
geom_point() + theme(legend.position = "none")
}
top_grams_pres(sotu_words, 5)
Selecting by n
Another way to visualize changes in word usage over time is to make separate graphs of the top words for each president.
sotu_words %>% group_by(pres) %>% count(gram) %>% mutate(pct = n/sum(n)) %>% top_n(5, pct) %>%
ggplot(aes(x = gram, y = pct, fill = pres)) + geom_col() +
coord_flip() + facet_wrap(vars(pres), scale = "free_y") + theme(legend.position = "none")
This graph is too big to see in our notebook, but we can export it to a .png
file.
png("words_by_pres.png", height = 20, width = 30, units = "in", res = 200)
sotu_words %>% group_by(pres) %>% count(gram) %>% mutate(pct = n/sum(n)) %>% top_n(5) %>%
ggplot(aes(x = gram, y = pct, fill = pres)) + geom_col() +
coord_flip() + facet_wrap(vars(pres), scale = "free_y") + theme(legend.position = "none")
Selecting by pct
dev.off()
null device
1
This is pretty, but the alphabetical ordering of presidents is not conducive to analyzing change over time. What we need to do is convert the pres
variable to a factor to preserve the chronological order. A factor is a special data type that stores strings as numbers. We can’t manipulate factors the way we would manipulate regular character variables, and we also can’t do math with them.
table(factor(sotu_words$pres))
Abraham Lincoln Andrew Jackson Andrew Johnson
10066 30626 14491
Barack Obama Benjamin Harrison Calvin Coolidge
19777 20290 20092
Chester A. Arthur Donald J. Trump Dwight D. Eisenhower
7999 6313 23634
Franklin D. Roosevelt Franklin Pierce George H.W. Bush
17934 16010 6428
George W. Bush George Washington Gerald R. Ford
18101 6180 5626
Grover Cleveland Harry S. Truman Herbert Hoover
42412 26853 10480
James Buchanan James Madison James Monroe
21020 8277 14925
James Polk Jimmy Carter John Adams
27014 20584 2739
John F. Kennedy John Quincy Adams John Tyler
7147 11333 12442
Lyndon B. Johnson Martin van Buren Millard Fillmore
10725 16934 11752
Richard Nixon Ronald Reagan Rutherford B. Hayes
6874 12819 13413
Theodore Roosevelt Thomas Jefferson Ulysses S. Grant
58242 7360 24428
Warren Harding William H. Taft William J. Clinton
4529 27370 21834
William McKinley Woodrow Wilson Zachary Taylor
26986 11902 2909
table(fct_inorder(factor(sotu_words$pres)))
George Washington John Adams Thomas Jefferson
6180 2739 7360
James Madison James Monroe John Quincy Adams
8277 14925 11333
Andrew Jackson Martin van Buren John Tyler
30626 16934 12442
James Polk Zachary Taylor Millard Fillmore
27014 2909 11752
Franklin Pierce James Buchanan Abraham Lincoln
16010 21020 10066
Andrew Johnson Ulysses S. Grant Rutherford B. Hayes
14491 24428 13413
Chester A. Arthur Grover Cleveland Benjamin Harrison
7999 42412 20290
William McKinley Theodore Roosevelt William H. Taft
26986 58242 27370
Woodrow Wilson Warren Harding Calvin Coolidge
11902 4529 20092
Herbert Hoover Franklin D. Roosevelt Harry S. Truman
10480 17934 26853
Dwight D. Eisenhower John F. Kennedy Lyndon B. Johnson
23634 7147 10725
Richard Nixon Gerald R. Ford Jimmy Carter
6874 5626 20584
Ronald Reagan George H.W. Bush William J. Clinton
12819 6428 21834
George W. Bush Barack Obama Donald J. Trump
18101 19777 6313
We can see when we make a table of the new president
column that the presidents are listed in chronological order rather than alphabetical order. Now let’s make a general function to create the words_by_pres.png
chart.
grams_by_pres <- function(dataset, filename, number) {
graph <- dataset %>% mutate(president = fct_inorder(factor(pres))) %>%
group_by(president) %>% count(gram) %>% mutate(pct = n/sum(n)) %>% top_n(number) %>%
ggplot(aes(x = gram, y = pct, fill = president)) + geom_col() +
coord_flip() + facet_wrap(vars(president), scale = "free_y") + theme(legend.position = "none")
png(filename, height = 20, width = 30, units = "in", res = 200)
print(graph)
dev.off()
}
grams_by_pres(sotu_words, "words_by_pres2.png", 10)
Selecting by pct
null device
1