Notes for “Text Mining with R: A Tidy Approach”

Document-term matrices are often stored as a sparse matrix object. These objects can be treated as though they were matrices (for example, accessing particular rows and columns), but are stored in a more efficient format.

tidytext provides ways of converting between these two formats:

DocumentTermMatrix class is built into the tm package. Notice that this DTM is 99% sparse (99% of document-word pairs are zero).

library(tm) library(topicmodels) data("AssociatedPress", package = "topicmodels") AssociatedPress #> > #> Non-/sparse entries: 302031/23220327 #> Sparsity : 99% #> Maximal term length: 18 #> Weighting : term frequency (tf)

Terms() is a accessor function to extract the full distinct word vector

Terms(AssociatedPress) %>% head() #> [1] "aaron" "abandon" "abandoned" "abandoning" "abbott"  #> [6] "abboud"

tidy it to get a tidy data frame

# convert to tidy data frames with counts ap_tidy tidy(AssociatedPress) ap_tidy #> # A tibble: 302,031 x 3 #> document term count #> #> 1 1 adding 1 #> 2 1 adult 2 #> 3 1 ago 1 #> 4 1 alcohol 1 #> 5 1 allegedly 1 #> 6 1 allen 1 #> # . with 3.02e+05 more rows

quanteda uses dfm (document-feauture matrix) as a common data structure for text data. For example, the quanteda package comes with a corpus of presidential inauguration speeches, which can be converted to a dfm using the appropriate function.

data("data_corpus_inaugural", package = "quanteda") quanteda::dfm(data_corpus_inaugural) #> Document-feature matrix of: 58 documents, 9,360 features (91.8% sparse) and 4 docvars. #> features #> docs fellow-citizens of the senate and house representatives : #> 1789-Washington 1 71 116 1 48 2 2 1 #> 1793-Washington 0 11 13 0 2 0 0 1 #> 1797-Adams 3 140 163 1 130 0 2 0 #> 1801-Jefferson 2 104 130 0 81 0 0 1 #> 1805-Jefferson 0 101 143 0 93 0 0 0 #> 1809-Madison 1 69 104 0 43 0 0 0 #> features #> docs among vicissitudes #> 1789-Washington 1 1 #> 1793-Washington 0 0 #> 1797-Adams 4 0 #> 1801-Jefferson 1 0 #> 1805-Jefferson 7 0 #> 1809-Madison 0 0 #> [ reached max_ndoc . 52 more documents, reached max_nfeat . 9,350 more features ]

We, of course, want to tidy it

inaugural quanteda::dfm(data_corpus_inaugural) %>%  tidy() inaugural #> # A tibble: 44,710 x 3 #> document term count #> #> 1 1789-Washington fellow-citizens 1 #> 2 1797-Adams fellow-citizens 3 #> 3 1801-Jefferson fellow-citizens 2 #> 4 1809-Madison fellow-citizens 1 #> 5 1813-Madison fellow-citizens 1 #> 6 1817-Monroe fellow-citizens 5 #> # . with 4.47e+04 more rows

Suppose we would like to see how the usage of some user specified words change over time. We start by complete() the data frame, and then total words per speech:

year_term_counts inaugural %>%  extract(document, into = "year", regex = "(\\d)", convert = TRUE) %>%  complete(year, term, fill = list(count = 0)) %>%  add_count(year, wt = count, name = "year_total") year_term_counts #> # A tibble: 542,880 x 4 #> year term count year_total #>  #> 1 1789 "'" 0 1537 #> 2 1789 "-" 1 1537 #> 3 1789 "!" 0 1537 #> 4 1789 "\"" 2 1537 #> 5 1789 "$" 0 1537 #> 6 1789 "(" 1 1537 #> # . with 5.429e+05 more rows
year_term_counts %>% filter(term %in% c("god", "america", "foreign", "union", "constitution", "freedom")) %>% ggplot(aes(year, count / year_total)) + geom_point() + geom_smooth() + facet_wrap(~ term, scales = "free_y") + scale_y_continuous(labels = scales::percent_format()) + labs(title = "% frequency of word in inaugural address")