Keyword Analysis

2.1.1 Purpose

In this section, we will use text mining methods to derive information from the text of keywords. We explore the frequency, coverage, and the relationship between keywords, therefore identifying keywords which are important for our further analysis and model building.

2.1.2 Pre-prepare: install R packages and import data

library(knitr)
library(readxl)
library(tidyverse)
library(tidytext)
library(igraph)
library(ggraph)
library(textstem)
library(RSQLite)
library(plotly)

data <- read_excel("~/Downloads//paper_keyword.xlsx")

2.1.3 Data transformation

2.1.3.1 Keyword cleaning

We transform the data into a tibble (tibbles are a modern take on data frames) and add the row number with the column name ‘document’. We clean the data by removing certain special characters and irrelevant information from the ‘keyword’ column.

data2 <- as_tibble(data) %>%
  mutate(keyword = tolower(str_trim(sub("((Pages|Size|\\().*)|(Â|®)", "", keyword), "left")), document = row_number()) %>%
  select(document, title, keyword)
kable(data2[1:10,])

document	title	keyword
1	MARKUP: The Power of Choice and Change	code from github eric gebhart, sas institute inc.
2	A Tutorial on Reduced Error Logistic Regression	updated jsm 2008 paper
3	The SAS Supervisor	sas communities page
4	Introduction to the Macro Language	macro
5	Unlimiting a Limited Macro Environment	macro
6	The Right Approach to Learning PROC TABULATE	tabulate
7	SAS Macro Environments: Local and Global	macro
8	Introduction to the Macro Language	macro
9	The Right Approach to Learning PROC TABULATE	tabulate
10	Conquering the Dreaded Macro Error	macro

2.1.3.2 Tokenization

we need to split the text into individual words (a process called tokenization) and transform it to a tidy data structure (i.e. each word has its own row). To do this, we use tidytext’s unnest_tokens() function. We remove duplicates so that if a word appeared in a paper multiple times, it will be counted only once. We also remove keywords that contain numbers only (e.g. years).

2.1.3.3 Stop words

Often in text analysis, we will want to remove stop words: Stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth. We remove stop words in our data by using tidytext’s get_stopwords() function with an anti_join() from the package dplyr.

2.1.3.4 Lemmatization

In our data, there are words such as “macro” and macros" that means the same but are in different inflected forms. In order to analyze them as a single item, we need to reduce words to their lemma form. Below is an example of lemmatizing “be”, using textstem’s lemmatize_words().

bw <- c('are', 'am', 'being', 'been', 'be')
lemmatize_words(bw)

## [1] "be" "be" "be" "be" "be"

# tokenization
kw <- data2 %>%
  unnest_tokens(oldword, keyword) %>%
  # lemmatizing words
  mutate(word = case_when(length(oldword) < 6 | oldword %in% c('ods','data','mining','learning') ~ oldword, 
                          TRUE ~ lemmatize_words(oldword))) %>% 
  distinct() %>% # remove duplicates
  anti_join(get_stopwords())  %>% # stop words
  filter(is.na(as.numeric(word))) # remove numbers

kw %>%
  filter(document==1|document==2) %>%
  kable()

document	title	oldword	word
1	MARKUP: The Power of Choice and Change	code	code
1	MARKUP: The Power of Choice and Change	github	github
1	MARKUP: The Power of Choice and Change	eric	eric
1	MARKUP: The Power of Choice and Change	gebhart	gebhart
1	MARKUP: The Power of Choice and Change	sas	sas
1	MARKUP: The Power of Choice and Change	institute	institute
1	MARKUP: The Power of Choice and Change	inc	inc
2	A Tutorial on Reduced Error Logistic Regression	updated	update
2	A Tutorial on Reduced Error Logistic Regression	jsm	jsm
2	A Tutorial on Reduced Error Logistic Regression	paper	paper

2.1.4 Exploratory analysis

2.1.4.1 Term frequency

One measure of how important a word may be is its term frequency. Plot words with a frequency greater than 400:

kw %>%
  count(word, sort = TRUE) %>%
  filter(n > 400) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col(aes()) +
  xlab(NULL) +
  scale_y_continuous(expand = c(0, 0)) +
  coord_flip() +
  theme_classic(base_size = 12) +
  labs(title="Word frequency", subtitle="n > 400")+
  theme(plot.title = element_text(lineheight=.8, face="bold")) +
  scale_fill_brewer()

2.1.4.2 Keyword Coverage Analysis

We want to analyze the ability of keywords to cover the articles for further analysis. First, sort keywords in descending order of frequency and give the count of keywords. There are total 1821 cleaned keywords.

kw_clean <- read_excel("~/Downloads//kw_clean.xlsx")
keyword <- count(kw_clean, word, sort = TRUE)
keyword$keyword_count <- seq_len(nrow(keyword))
nrow(keyword)

## [1] 1821

Secondly, calculate each keyword can cover how many articles and exclude duplicates.

conn <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(conn,"Title",kw_clean)
final <- keyword
for (i in 1:nrow(keyword)){
  each <- keyword[i,]
  dbWriteTable(conn, "aa", each, append = TRUE)
  final[i,'paper_count'] <- dbGetQuery(conn, "SELECT count (distinct title) from Title where word in (select word from aa)")
}
dbDisconnect(conn)

Thirdly, generate the keyword coverage plot.

a <- ggplot(final) +
  geom_point(aes(x=keyword_count, y=paper_count)) + 
  geom_label(
    label="(200,10685)", x=200, y=10800,
    label.padding = unit(0.55, "lines"),
    label.size = 0.30, vjust = 0,
  ) + 
  labs(
    x = "keyword count",
    y = "paper count",
    title = "Keyword Coverage") +
  theme_classic(base_size = 12)

ggplotly(a)

We can see that as the keywords increase, the more articles are covered. However, when the keywords increase to a certain range, the coverage rate becomes limited.

The results show that the most frequent 50 keywords can cover 87% (9826/11222) of the articles. The top 100 can cover 92% (10365/11222) of the articles and the top 200 can cover more than 95% (10685/11222) of the articles.

2.1.4.3 Tokenization by n-gram

We’ve been using the unnest_tokens function to tokenize by word, but we can also use the function to tokenize into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.

# tokenizing by n-gram
bigram <- data2 %>%
  unnest_tokens(bigram, keyword, token = "ngrams", n = 2) %>%
  distinct()

Now we use tidyr’s separate(), which splits a column into multiple columns based on a delimiter. This lets us separate it into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word.

# separate words
bigram_separated <- bigram %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  # lemmatizing words
  mutate(word1 = case_when(length(word1) < 6 | word1 %in% c('ods','data','mining','learning') ~ word1, 
                          TRUE ~ lemmatize_words(word1)),
         word2 = case_when(length(word2) < 6 | word2 %in% c('ods','data','mining','learning') ~ word2, 
                          TRUE ~ lemmatize_words(word2)))

# filter stop words and NA
stopword <- stopwords::stopwords("en")
bigram_filtered <- bigram_separated %>%
  filter(!word1 %in% stopword & is.na(as.numeric(word1))) %>%
  filter(!word2 %in% stopword & is.na(as.numeric(word2))) %>% 
  filter(word1 != word2)

# new bigram counts
bigram_count <- bigram_filtered %>% 
  count(word1, word2, sort = TRUE)
kable(bigram_count[1:10,])

word1	word2	n
enterprise	guide	380
proc	report	375
sas	macro	164
sas	graph	157
data	step	144
data	management	116
proc	sql	113
data	warehouse	112
clinical	trial	110

2.1.4.5 Network analysis

We may be interested in visualizing all of the relationships among words simultaneously. As one common visualization, we can arrange the words into a network graph. A graph can be constructed from a tidy object since it has three variables:

from: the node an edge is coming from
to: the node an edge is going towards
weight: a numeric value associated with each edge

We use the graph_from_data_frame() function from the package igraph, which takes a data frame of edges with columns for “from”, “to”, and edge attributes (in this case n). Then we use the ggraph package to convert the igraph object into a ggraph with the ggraph() function.

# filter for only relatively common combinations
bigram_graph <- bigram_count %>%
  filter(n > 35) %>%
  graph_from_data_frame()

# network graph
set.seed(999)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()