Keyword Analysis

2.1.1 Purpose

In this section, we will use text mining methods to derive information from the text of keywords. We explore the frequency, coverage, and the relationship between keywords, therefore identifying keywords which are important for our further analysis and model building.

2.1.2 Pre-prepare: install R packages and import data

library(knitr)
library(readxl)
library(tidyverse)
library(tidytext)
library(igraph)
library(ggraph)
library(textstem)
library(RSQLite)
library(plotly)

data <- read_excel("~/Downloads//paper_keyword.xlsx")

2.1.3 Data transformation

2.1.3.1 Keyword cleaning

We transform the data into a tibble (tibbles are a modern take on data frames) and add the row number with the column name ‘document’. We clean the data by removing certain special characters and irrelevant information from the ‘keyword’ column.

data2 <- as_tibble(data) %>%
  mutate(keyword = tolower(str_trim(sub("((Pages|Size|\\().*)|(Â|®)", "", keyword), "left")), document = row_number()) %>%
  select(document, title, keyword)
kable(data2[1:10,])
document title keyword
1 MARKUP: The Power of Choice and Change code from github eric gebhart, sas institute inc.
2 A Tutorial on Reduced Error Logistic Regression updated jsm 2008 paper
3 The SAS Supervisor sas communities page
4 Introduction to the Macro Language macro
5 Unlimiting a Limited Macro Environment macro
6 The Right Approach to Learning PROC TABULATE tabulate
7 SAS Macro Environments: Local and Global macro
8 Introduction to the Macro Language macro
9 The Right Approach to Learning PROC TABULATE tabulate
10 Conquering the Dreaded Macro Error macro

2.1.3.2 Tokenization

we need to split the text into individual words (a process called tokenization) and transform it to a tidy data structure (i.e. each word has its own row). To do this, we use tidytext’s unnest_tokens() function. We remove duplicates so that if a word appeared in a paper multiple times, it will be counted only once. We also remove keywords that contain numbers only (e.g. years).

2.1.3.3 Stop words

Often in text analysis, we will want to remove stop words: Stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth. We remove stop words in our data by using tidytext’s get_stopwords() function with an anti_join() from the package dplyr.

2.1.3.4 Lemmatization

In our data, there are words such as “macro” and macros" that means the same but are in different inflected forms. In order to analyze them as a single item, we need to reduce words to their lemma form. Below is an example of lemmatizing “be”, using textstem’s lemmatize_words().

bw <- c('are', 'am', 'being', 'been', 'be')
lemmatize_words(bw)

## [1] "be" "be" "be" "be" "be"

# tokenization
kw <- data2 %>%
  unnest_tokens(oldword, keyword) %>%
  # lemmatizing words
  mutate(word = case_when(length(oldword) < 6 | oldword %in% c('ods','data','mining','learning') ~ oldword, 
                          TRUE ~ lemmatize_words(oldword))) %>% 
  distinct() %>% # remove duplicates
  anti_join(get_stopwords())  %>% # stop words
  filter(is.na(as.numeric(word))) # remove numbers

kw %>%
  filter(document==1|document==2) %>%
  kable()
document title oldword word
1 MARKUP: The Power of Choice and Change code code
1 MARKUP: The Power of Choice and Change github github
1 MARKUP: The Power of Choice and Change eric eric
1 MARKUP: The Power of Choice and Change gebhart gebhart
1 MARKUP: The Power of Choice and Change sas sas
1 MARKUP: The Power of Choice and Change institute institute
1 MARKUP: The Power of Choice and Change inc inc
2 A Tutorial on Reduced Error Logistic Regression updated update
2 A Tutorial on Reduced Error Logistic Regression jsm jsm
2 A Tutorial on Reduced Error Logistic Regression paper paper

2.1.4 Exploratory analysis

2.1.4.1 Term frequency

One measure of how important a word may be is its term frequency. Plot words with a frequency greater than 400:

kw %>%
  count(word, sort = TRUE) %>%
  filter(n > 400) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col(aes()) +
  xlab(NULL) +
  scale_y_continuous(expand = c(0, 0)) +
  coord_flip() +
  theme_classic(base_size = 12) +
  labs(title="Word frequency", subtitle="n > 400")+
  theme(plot.title = element_text(lineheight=.8, face="bold")) +
  scale_fill_brewer()

2.1.4.2 Keyword Coverage Analysis

We want to analyze the ability of keywords to cover the articles for further analysis. First, sort keywords in descending order of frequency and give the count of keywords. There are total 1821 cleaned keywords.

kw_clean <- read_excel("~/Downloads//kw_clean.xlsx")
keyword <- count(kw_clean, word, sort = TRUE)
keyword$keyword_count <- seq_len(nrow(keyword))
nrow(keyword)

## [1] 1821

Secondly, calculate each keyword can cover how many articles and exclude duplicates.

conn <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(conn,"Title",kw_clean)
final <- keyword
for (i in 1:nrow(keyword)){
  each <- keyword[i,]
  dbWriteTable(conn, "aa", each, append = TRUE)
  final[i,'paper_count'] <- dbGetQuery(conn, "SELECT count (distinct title) from Title where word in (select word from aa)")
}
dbDisconnect(conn)

Thirdly, generate the keyword coverage plot.

a <- ggplot(final) +
  geom_point(aes(x=keyword_count, y=paper_count)) + 
  geom_label(
    label="(200,10685)", x=200, y=10800,
    label.padding = unit(0.55, "lines"),
    label.size = 0.30, vjust = 0,
  ) + 
  labs(
    x = "keyword count",
    y = "paper count",
    title = "Keyword Coverage") +
  theme_classic(base_size = 12)

ggplotly(a)

We can see that as the keywords increase, the more articles are covered. However, when the keywords increase to a certain range, the coverage rate becomes limited.

The results show that the most frequent 50 keywords can cover 87% (9826/11222) of the articles. The top 100 can cover 92% (10365/11222) of the articles and the top 200 can cover more than 95% (10685/11222) of the articles.

2.1.4.3 Tokenization by n-gram

We’ve been using the unnest_tokens function to tokenize by word, but we can also use the function to tokenize into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.

# tokenizing by n-gram
bigram <- data2 %>%
  unnest_tokens(bigram, keyword, token = "ngrams", n = 2) %>%
  distinct()

Now we use tidyr’s separate(), which splits a column into multiple columns based on a delimiter. This lets us separate it into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word.

# separate words
bigram_separated <- bigram %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  # lemmatizing words
  mutate(word1 = case_when(length(word1) < 6 | word1 %in% c('ods','data','mining','learning') ~ word1, 
                          TRUE ~ lemmatize_words(word1)),
         word2 = case_when(length(word2) < 6 | word2 %in% c('ods','data','mining','learning') ~ word2, 
                          TRUE ~ lemmatize_words(word2)))

# filter stop words and NA
stopword <- stopwords::stopwords("en")
bigram_filtered <- bigram_separated %>%
  filter(!word1 %in% stopword & is.na(as.numeric(word1))) %>%
  filter(!word2 %in% stopword & is.na(as.numeric(word2))) %>% 
  filter(word1 != word2)

# new bigram counts
bigram_count <- bigram_filtered %>% 
  count(word1, word2, sort = TRUE)
kable(bigram_count[1:10,])
word1 word2 n
enterprise guide 380
proc report 375
sas macro 164
sas graph 157
data step 144
data management 116
proc sql 113
data warehouse 112
clinical trial 110

2.1.4.5 Network analysis

We may be interested in visualizing all of the relationships among words simultaneously. As one common visualization, we can arrange the words into a network graph. A graph can be constructed from a tidy object since it has three variables:

  • from: the node an edge is coming from
  • to: the node an edge is going towards
  • weight: a numeric value associated with each edge

We use the graph_from_data_frame() function from the package igraph, which takes a data frame of edges with columns for “from”, “to”, and edge attributes (in this case n). Then we use the ggraph package to convert the igraph object into a ggraph with the ggraph() function.

# filter for only relatively common combinations
bigram_graph <- bigram_count %>%
  filter(n > 35) %>%
  graph_from_data_frame()

# network graph
set.seed(999)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()