In this section, we will use text mining methods to derive information from the text of keywords. We explore the frequency, coverage, and the relationship between keywords, therefore identifying keywords which are important for our further analysis and model building.
library(knitr)
library(readxl)
library(tidyverse)
library(tidytext)
library(igraph)
library(ggraph)
library(textstem)
library(RSQLite)
library(plotly)
data <- read_excel("~/Downloads//paper_keyword.xlsx")
We transform the data into a tibble (tibbles are a modern take on data frames) and add the row number with the column name ‘document’. We clean the data by removing certain special characters and irrelevant information from the ‘keyword’ column.
data2 <- as_tibble(data) %>%
mutate(keyword = tolower(str_trim(sub("((Pages|Size|\\().*)|(Â|®)", "", keyword), "left")), document = row_number()) %>%
select(document, title, keyword)
kable(data2[1:10,])
document | title | keyword |
---|---|---|
1 | MARKUP: The Power of Choice and Change | code from github eric gebhart, sas institute inc. |
2 | A Tutorial on Reduced Error Logistic Regression | updated jsm 2008 paper |
3 | The SAS Supervisor | sas communities page |
4 | Introduction to the Macro Language | macro |
5 | Unlimiting a Limited Macro Environment | macro |
6 | The Right Approach to Learning PROC TABULATE | tabulate |
7 | SAS Macro Environments: Local and Global | macro |
8 | Introduction to the Macro Language | macro |
9 | The Right Approach to Learning PROC TABULATE | tabulate |
10 | Conquering the Dreaded Macro Error | macro |
we need to split the text into individual words (a process called tokenization) and transform it to a tidy data structure (i.e. each word has its own row). To do this, we use tidytext’s unnest_tokens() function. We remove duplicates so that if a word appeared in a paper multiple times, it will be counted only once. We also remove keywords that contain numbers only (e.g. years).
Often in text analysis, we will want to remove stop words: Stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth. We remove stop words in our data by using tidytext’s get_stopwords() function with an anti_join() from the package dplyr.
In our data, there are words such as “macro” and macros" that means the same but are in different inflected forms. In order to analyze them as a single item, we need to reduce words to their lemma form. Below is an example of lemmatizing “be”, using textstem’s lemmatize_words().
bw <- c('are', 'am', 'being', 'been', 'be')
lemmatize_words(bw)
## [1] "be" "be" "be" "be" "be"
# tokenization
kw <- data2 %>%
unnest_tokens(oldword, keyword) %>%
# lemmatizing words
mutate(word = case_when(length(oldword) < 6 | oldword %in% c('ods','data','mining','learning') ~ oldword,
TRUE ~ lemmatize_words(oldword))) %>%
distinct() %>% # remove duplicates
anti_join(get_stopwords()) %>% # stop words
filter(is.na(as.numeric(word))) # remove numbers
kw %>%
filter(document==1|document==2) %>%
kable()
document | title | oldword | word |
---|---|---|---|
1 | MARKUP: The Power of Choice and Change | code | code |
1 | MARKUP: The Power of Choice and Change | github | github |
1 | MARKUP: The Power of Choice and Change | eric | eric |
1 | MARKUP: The Power of Choice and Change | gebhart | gebhart |
1 | MARKUP: The Power of Choice and Change | sas | sas |
1 | MARKUP: The Power of Choice and Change | institute | institute |
1 | MARKUP: The Power of Choice and Change | inc | inc |
2 | A Tutorial on Reduced Error Logistic Regression | updated | update |
2 | A Tutorial on Reduced Error Logistic Regression | jsm | jsm |
2 | A Tutorial on Reduced Error Logistic Regression | paper | paper |
One measure of how important a word may be is its term frequency. Plot words with a frequency greater than 400:
kw %>%
count(word, sort = TRUE) %>%
filter(n > 400) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col(aes()) +
xlab(NULL) +
scale_y_continuous(expand = c(0, 0)) +
coord_flip() +
theme_classic(base_size = 12) +
labs(title="Word frequency", subtitle="n > 400")+
theme(plot.title = element_text(lineheight=.8, face="bold")) +
scale_fill_brewer()
We want to analyze the ability of keywords to cover the articles for further analysis. First, sort keywords in descending order of frequency and give the count of keywords. There are total 1821 cleaned keywords.
kw_clean <- read_excel("~/Downloads//kw_clean.xlsx")
keyword <- count(kw_clean, word, sort = TRUE)
keyword$keyword_count <- seq_len(nrow(keyword))
nrow(keyword)
## [1] 1821
Secondly, calculate each keyword can cover how many articles and exclude duplicates.
conn <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(conn,"Title",kw_clean)
final <- keyword
for (i in 1:nrow(keyword)){
each <- keyword[i,]
dbWriteTable(conn, "aa", each, append = TRUE)
final[i,'paper_count'] <- dbGetQuery(conn, "SELECT count (distinct title) from Title where word in (select word from aa)")
}
dbDisconnect(conn)
Thirdly, generate the keyword coverage plot.
a <- ggplot(final) +
geom_point(aes(x=keyword_count, y=paper_count)) +
geom_label(
label="(200,10685)", x=200, y=10800,
label.padding = unit(0.55, "lines"),
label.size = 0.30, vjust = 0,
) +
labs(
x = "keyword count",
y = "paper count",
title = "Keyword Coverage") +
theme_classic(base_size = 12)
ggplotly(a)
We can see that as the keywords increase, the more articles are covered. However, when the keywords increase to a certain range, the coverage rate becomes limited.
The results show that the most frequent 50 keywords can cover 87% (9826/11222) of the articles. The top 100 can cover 92% (10365/11222) of the articles and the top 200 can cover more than 95% (10685/11222) of the articles.
We’ve been using the unnest_tokens function to tokenize by word, but we can also use the function to tokenize into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.
# tokenizing by n-gram
bigram <- data2 %>%
unnest_tokens(bigram, keyword, token = "ngrams", n = 2) %>%
distinct()
Now we use tidyr’s separate(), which splits a column into multiple columns based on a delimiter. This lets us separate it into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word.
# separate words
bigram_separated <- bigram %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
# lemmatizing words
mutate(word1 = case_when(length(word1) < 6 | word1 %in% c('ods','data','mining','learning') ~ word1,
TRUE ~ lemmatize_words(word1)),
word2 = case_when(length(word2) < 6 | word2 %in% c('ods','data','mining','learning') ~ word2,
TRUE ~ lemmatize_words(word2)))
# filter stop words and NA
stopword <- stopwords::stopwords("en")
bigram_filtered <- bigram_separated %>%
filter(!word1 %in% stopword & is.na(as.numeric(word1))) %>%
filter(!word2 %in% stopword & is.na(as.numeric(word2))) %>%
filter(word1 != word2)
# new bigram counts
bigram_count <- bigram_filtered %>%
count(word1, word2, sort = TRUE)
kable(bigram_count[1:10,])
word1 | word2 | n |
---|---|---|
enterprise | guide | 380 |
proc | report | 375 |
sas | macro | 164 |
sas | graph | 157 |
data | step | 144 |
data | management | 116 |
proc | sql | 113 |
data | warehouse | 112 |
clinical | trial | 110 |
We may be interested in visualizing all of the relationships among words simultaneously. As one common visualization, we can arrange the words into a network graph. A graph can be constructed from a tidy object since it has three variables:
We use the graph_from_data_frame() function from the package igraph, which takes a data frame of edges with columns for “from”, “to”, and edge attributes (in this case n). Then we use the ggraph package to convert the igraph object into a ggraph with the ggraph() function.
# filter for only relatively common combinations
bigram_graph <- bigram_count %>%
filter(n > 35) %>%
graph_from_data_frame()
# network graph
set.seed(999)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()