I use several visualization, and text mining techniques to explore the NeurIPS paper dataset.
The dataset consists of 7 variables, of which one is the primary key, one is integer, and the rest 5 are string. 7241 papers from 1987-2017 are contained in the dataset. After validation the reliability of the data, 33 papers are dropped out. I extract abstracts with regular expression from the original paper context, and concentrate on analyzing the abstract. After some data manipulation, I try some initial exploration such as term frequenct, term frequenct - inverse document frequency, word co-occourance, and word correlation. Then I build my topic model with Structural Topic Models, and finally select a model with 40 topics. Topic prevalence, and topic trend over all the time are discussed. I find that research topics in NeurIPS become more diverse as time goes on. Researchers’ interests shift a lot during the past 31 years. It takes about 10 to 15 years for a topic to thrive ot diminish.
Most of the web interactive tables, and plots in this report are built by HTMLwidget packages. Readers can explore the data by themselves.
Validate the reliability of the dataset
Clean and transform the data into a tidy structure
Explore paper with basic text mining techniques
Detect latent topic structure, model topic trends in the NeurIPS papers, understand how research fields evolve
NeurIPS, which is the acronym of Conference on Neural Information Processing Systems, formerly abbreviated as NIPS, is a machine learning and computational neuroscience conference held anually. The conference attended by machine learning researchers and statisticians aims to foster the exchange of research on neural information processing systems in their theoretical and applicable aspects.
knitr::include_graphics("nips intro.png")
This dataset is collected from Kaggle, which includes the title, author, and extracted text for all NeurIPS papers, ranging from the first 1987 conference to the 2017 one. The author of this dataset used python to scrape and create the dataset. The dataset is released as an SQLite database. Due to the limitation of my time and energy, I just extracte papers table from the database, and verify, clean, and explore the papers table.
From the summary of the papers table, I know there are 7241 papers recorded. Each paper is assigned a unique id
, which could perform as the primary key later, along with other 6 variables named year
, title
, event_type
, pdf_name
, abstract
, and paper_text
. Except that variables id
, and year
are encoded as integer, all other 5 variables are encoded as character.
library(RSQLite)
library(tidyverse)
setwd("~/uwaterloo-DESKTOP-DS9EK3H-3/2019 Winter/STAT 847/finalproj")
db <- dbConnect(dbDriver("SQLite"), "database.sqlite")
papers<-dbGetQuery(db,"SELECT * FROM papers")
dbDisconnect(db)
summary(papers)
## id year title event_type
## Min. : 1 Min. :1987 Length:7241 Length:7241
## 1st Qu.:1849 1st Qu.:2000 Class :character Class :character
## Median :3659 Median :2009 Mode :character Mode :character
## Mean :3656 Mean :2006
## 3rd Qu.:5473 3rd Qu.:2014
## Max. :7284 Max. :2017
## pdf_name abstract paper_text
## Length:7241 Length:7241 Length:7241
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
str(papers)
## 'data.frame': 7241 obs. of 7 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
## $ title : chr "Self-Organization of Associative Database and Its Applications" "The Capacity of the Kanerva Associative Memory is Exponential" "Supervised Learning of Probability Distributions by Neural Networks" "Constrained Differential Optimization" ...
## $ event_type: chr "" "" "" "" ...
## $ pdf_name : chr "1-self-organization-of-associative-database-and-its-applications.pdf" "2-the-capacity-of-the-kanerva-associative-memory-is-exponential.pdf" "3-supervised-learning-of-probability-distributions-by-neural-networks.pdf" "4-constrained-differential-optimization.pdf" ...
## $ abstract : chr "Abstract Missing" "Abstract Missing" "Abstract Missing" "Abstract Missing" ...
## $ paper_text: chr "767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABASE\nAND ITS APPLICATIONS\nHisashi Suzuki and Suguru Arimoto\nOsak"| __truncated__ "184\n\nTHE CAPACITY OF THE KANERVA ASSOCIATIVE MEMORY IS EXPONENTIAL\nP. A. Choul\nStanford University. Stanfor"| __truncated__ "52\n\nSupervised Learning of Probability Distributions\nby Neural Networks\nEric B. Baum\nJet Propulsion Labora"| __truncated__ "612\n\nConstrained Differential Optimization\nJohn C. Platt\nAlan H. Barr\nCalifornia Institute of Technology, "| __truncated__ ...
There are around 45.81 % papers not available abstracts. I will deal with them later.
Bar chart below illustrates the number of accepted papers between \(1987\) and \(2017\). It is clear that the number of papers continues to grow gradually before 2000s. After year 2000, there is a sharp increase in the number of papers. This leads to a total number of \(679\) accepted papers in 2017, which is nearly 8 times more than that in 1987. Overall, there is a clear upward trend in the number of accepted papers, which indicates that more researchers are dedicated to the field of machine learning, and computational neuroscience.
library(ggplot2)
library(plotly)
papers %>%
ggplot(aes(year)) +
geom_bar(fill=colorspace::rainbow_hcl(31))+
scale_x_continuous(breaks = seq(1987,2017,1)) +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Number of Papers over time") ->
g_paper_time
ggplotly(g_paper_time)
event_type
is a categorical variable. There are \(4\) unique values in it, which are Oral
, Spotlight
, Poster
, and black space, standing for missing data. Missing data count for nearly 66.55 % of the data. These missing data are generated because some events took place too long ago so that these data were lost or they were not uploaded on the website (so that the author of the data set could not scrape them down). The second most part is Post
, which takes 29.64 % of the whole dataset. Oral
and Spotlight
only stand for a very little part of the data. Spotlight
presentations are “brief summaries of the context and main results of your paper with perhaps a mention of the novelty of the approach taken”. Each spotlight is limited to \(4\) minutes. ref Oral
session let speaker deliver an more detailed oral presentation than Spotlight
, so the time is extended to \(20\) minutes or even one hour. Post
event let researchers display their paper in front of a post, and they can introduce their findings to whoever is interested. Because many posts can be shown simoutaneously, Post
is a common event_type
in the dataset.
papers %>%
ggplot(aes(event_type)) +
geom_bar(aes(fill=event_type)) +
ggtitle("Number of Papers over different event type") ->
g_paper_event
#papers %>%
# ggplot(aes(x=factor(1),fill=factor(event_type,labels = c("Missing","Oral","Spotlight","Poster"),levels = c("","Oral","Spotlight","Poster")))) +
# geom_bar(width=1) +
# coord_polar(theta="y") +
# labs(fill="Event_type",x=NULL) ->
# g_paper_event
ggplotly(g_paper_event)
As I am curious about the length about all the titles, and accepted papers, I also take a look at the distribution of title_length
, and paper_length
. Here I count the length with the number of characters in the sting.
The histogram of title_length
seems to be bell-curved, with mode at \(50\). There is an outlier in the plot. That paper was published in 1992 has a title with 156 words, Perceiving Complex Visual Scenes: An Oscillator Neural Network Model that Integrates Selective Attention, Perceptual Organisation, and Invariant Recognition , which is quite unusual.
library(stringr)
library(DT)
paper_length<-papers %>%
mutate(text_length=str_length(paper_text),
abstract_length=str_length(abstract),
title_length=str_length(title))
paper_length %>%
ggplot(aes(title_length)) +
geom_histogram(binwidth = 10,fill="#F8766D") +
ggtitle("Histogram of Title Length") ->
g_title_length
ggplotly(g_title_length)
The paper_length
distribution looks like a bimodal distribution. One peak is near \(33,700\) characters, and the other is around \(20,000\) characters.
There is also one outlier in paper_length
. Learning Multiagent Communication with Backpropagation, published in 2016, has \(123727\) characters in column paper_text
. I checked the pdf version and there are only \(9\) pages there, even including Reference part. Finally, I found that there were many messy symbols in the paper_text
, which results in the unexpected paper_length
.
paper_length %>%
ggplot(aes(text_length)) +
geom_histogram(binwidth = 1000, fill="#00BA38") +
ggtitle("Histogram of paper length") ->
g_paper_length
ggplotly(g_paper_length)
paper_length %>%
filter(text_length==123727) %>%
mutate(sample_paper_text=paste0(substr(paper_text, 59500, 60000), "..."),
abstract=paste0(substr(abstract, 1, 200), "...")) %>%
select(id,year,title,abstract,sample_paper_text,text_length) %>%
datatable(options = list(pageLength=1))
To measure the reliability of the data, I first check whether the dataset collected all the NIPS papers year by year. I compare the number of papers in the dataset and the number of papers published at NIPS proceedings archive. For example, I selected year 1990 and found that number of papers matched with the ones listed online. I find that the dataset correctly collect all the online-published papers at NeurIPS.
papers %>%
group_by(year) %>%
summarise(n=n()) %>%
filter(year==1990)
## # A tibble: 1 x 2
## year n
## <int> <int>
## 1 1990 143
knitr::include_graphics("nips1990.png")
Then I verify the content of papers. I find that in one paper titled Analog LSI Implementation of an Auto-Adaptive Network for Real-Time Separation of Independent Signals, there are messy symbols in column paper_text
. It turns out that the author of the dataset failed to create the content of the paper in a proper way. I searched the title of this paper and found that on NeurIPS Proceedings.
papers %>%
filter(title=="Analog LSI Implementation of an Auto-Adaptive Network for Real-Time Separation of Independent Signals") %>%
mutate(sample_paper_text=paste0(substr(.$paper_text, 1, 500), "...")) %>%
select(year,title,sample_paper_text) %>%
datatable(options = list(
pageLength=4))
knitr::include_graphics("nips messy code.png")
Similarly, I find 10 more papers with messy symbols in column paper_text. All these papers are published before 2000. The reason might be the Optical Character Recognition error. ref These messy symbols don’t provide any useful information of my interest. Therefore, I just drop these 11 papers from the dataset.
papers %>%
filter(row_number() %in% c(544,630,853,
885,918,
1062,1067,
1122,1148,
1581,1828)) %>%
mutate(sample_paper_text=paste0(substr(.$paper_text, 1, 500), "...")) %>%
select(year,title,sample_paper_text) %>%
datatable(options = list(
pageLength=4))
papers_dropped<-papers[-c(544,630,853,
885,918,
1062,1067,
1122,1148,
1581,1828),]
From the Validation part above, I know though I get almost all the paper text in the dataset, there are a lot of messy information in the paper_text
due to improper extraction from pdf files to txt files, and some complicated math equations or figures or tables. This confusing and mysterious symbols would not help me to explore the dataset furhter.
Therefore, I decide to just focus on the Abstract of each paper. An abstract is a self-contained, short, but powerful summary of a conferennce proceeding papers, which is used to help the reader quickly ascertain the paper’s purpose. ref In other words, abstract could capture the essence of the whole paper, without representing the details, so I would not be bothered by strange symbols anymore, but I could still grasp sufficient knowledge of the NeurIPS paper dataset.
Here several text mining techniques, including co-occurrences and correlations, tf-idf, and topic modeling are implemented to explore the abstracts from the NeurIPS papers of the past 30 years.
The first obstacle I encounter is that there are 3317 papers without an abstract. Most of the papers published before 2008 are recorded as Abstract Missing. After visiting NIPS Proceedings, I find that the huge amount of papers without additionally displayed abstract is due to the current website interface.
I have to find a way to deal with the large amount of missing data. After a careful observation of column paper_text
, I use regular expression to extract abstract from it.
sum(papers$abstract=="Abstract Missing")
## [1] 3317
missing.astract.labs<-c("Not Missing","Missing")
names(missing.astract.labs) <- c("FALSE","TRUE")
g_missing_papers<-papers_dropped %>%
mutate(missing.abstract=abstract=="Abstract Missing") %>%
group_by(missing.abstract,year) %>%
summarise(n=n()) %>%
ggplot(aes(x=year,y=n)) +
geom_bar(stat = 'identity') +
facet_grid(rows=vars(missing.abstract),
labeller = labeller(missing.abstract = missing.astract.labs)) +
scale_x_continuous(breaks = 1987:2017) +
theme(axis.text.x = element_text(angle = 90,hjust = 1,vjust = 0.5,colour = 'gray50')) +
ggtitle("Number of Papers of Missing or not Missing Abstract") %>%
labs(x=NULL)
ggplotly(g_missing_papers)
It is widely accepted that the beginning of an abtract of a conference proceedings should be the word Abstract, which stands for the section name, and the end of the abstract should be the word Introduction, which represents the next section name. Therefore, I use regular expression to extact all the characters in between these two.
for (i in 1:nrow(papers_dropped)) {
if (str_length(papers_dropped$abstract[i])<=16) {
papers_dropped$abstract[i]<-papers_dropped$paper_text[i] %>%
str_replace_all("\\n"," ") %>%
str_extract("(?<=ABSTRACT\\s).+(?=INTRODUCTION)")
}
if (is.na(papers_dropped$abstract[i])==T) {
papers_dropped$abstract[i]<-papers_dropped$paper_text[i] %>%
str_replace_all("\\n"," ") %>%
str_extract("(?<=Abstract\\s).+(?=INTRODUCTION)")
}
if (is.na(papers_dropped$abstract[i])==T) {
papers_dropped$abstract[i]<-papers_dropped$paper_text[i] %>%
str_replace_all("\\n"," ") %>%
str_extract("(?<=Abstract\\s).+(?=Introduction)")
}
if (is.na(papers_dropped$abstract[i])==T) {
papers_dropped$abstract[i]<-papers_dropped$paper_text[i] %>%
str_replace_all("\\n"," ") %>%
str_extract("(?<=ABSTRACT\\s).+(?=Introduction)")
}
i <- i+1
}
With this procedure, I reduce the number of Missing Abstract to 0
sum(is.na(papers_dropped$abstract))
## [1] 545
papers_dropped %>%
filter(is.na(abstract) == T) %>%
mutate(sample_paper_text=paste0(substr(.$paper_text, 1, 500), "...")) %>%
select(id,year,title,sample_paper_text) %>%
head(n=3) %>%
datatable(options = list(
pageLength=3))
Since 0 is still a large number for me to fill in the abstract manually, I attemp to find other patterns to extract content. I find that there are only Abstract section in some papers without mentioning Introduction section, and there are only Introduction section in the others without mentioning Abstract. I randomly read several papers and found that the average length about an abtract is about 500 characters. Therefore, I extract the first 500 characters after Abstract or the first 500 characters before Introduction.
for (i in 1:nrow(papers_dropped)) {
if (is.na(papers_dropped$abstract[i])==T) {
papers_dropped$abstract[i]<-papers_dropped$paper_text[i] %>%
str_replace_all("\\n"," ") %>%
str_extract("(?<=ABSTRACT\\s).+") %>%
substr(1, 500)
}
if (is.na(papers_dropped$abstract[i])) {
papers_dropped$abstract[i]<-papers_dropped$paper_text[i] %>%
str_replace_all("\\n"," ") %>%
str_extract("(?<=Abstract).+") %>%
substr(1, 500)
}
if (is.na(papers_dropped$abstract[i])) {
papers_dropped$abstract[i]<-papers_dropped$paper_text[i] %>%
str_replace_all("\\n"," ") %>%
str_extract("(?<=ABSTRACT).+") %>%
substr(1, 500)
}
if (is.na(papers_dropped$abstract[i])) {
papers_dropped$abstract[i]<-papers_dropped$paper_text[i] %>%
str_replace_all("\\n"," ") %>%
str_extract(".+(?=INTRODUCTION)") %>%
str_sub(-500,-1)
}
if (is.na(papers_dropped$abstract[i])) {
papers_dropped$abstract[i]<-papers_dropped$paper_text[i] %>%
str_replace_all("\\n"," ") %>%
str_extract(".+(?=Introduction)") %>%
str_sub(-500,-1)
}
i<-i+1
}
sum(is.na(papers_dropped$abstract))
## [1] 22
g_missing_papers2<-papers_dropped %>%
mutate(missing.abstract=is.na(abstract)) %>%
group_by(missing.abstract,year) %>%
summarise(n=n()) %>%
ggplot(aes(x=year,y=n)) +
geom_bar(stat = 'identity') +
facet_grid(rows=vars(missing.abstract),
labeller = labeller(missing.abstract = missing.astract.labs)) +
scale_x_continuous(breaks = 1987:2017) +
theme(axis.text.x = element_text(angle = 90,hjust = 1,vjust = 0.5,colour = 'gray50')) +
ggtitle("Number of Papers of Missing or not Missing Abstract") %>%
labs(x=NULL)
ggplotly(g_missing_papers2)
I reduce the number of Missing Abstract to 0 after this procedure. 2 papers in 1987, 7 papers in 1988, 12 papers in 1993, and 1 paper in 2002 are still without abstract.
Here I conduct some data manipulation to prepare these abstract. I discard all the papers without extracted abstracts. Then all words are transformed into lowercase, and only column id
, year
, title
, and abstract
are kept here. After that, all words are lemmatized so that inflected forms are grouped together as a single base form.ref Lemmatization is closely related to stemming. However, stemming usually chops off the ends of words, or remove affixes. On the other hand, lemmatization usually use a vocabulary and morphological analysis of words to normalize text.ref
library(tidytext)
library(textstem)
papers_clean<-
papers_dropped %>%
filter(is.na(abstract) == F) %>%
# transform all the characters into lowercase
mutate(abstract=tolower(abstract)) %>%
# remove anything that is not a letter
# mutate(abtract=str_replace_all(abstract,"([^A-Za-z0-9 ])+", "")) %>%
mutate(abtract=str_replace_all(abstract,"([^A-Za-z ])+", "")) %>%
# trimming leading and tailing space
mutate(abstract=str_replace_all(abstract,"^\\s+|\\s+$","")) %>%
# trimming leading comma :
mutate(abstract=str_replace_all(abstract,"^\\:","")) %>%
# shrink to one white space
mutate(abstract=str_replace_all(abstract,"[\\s]+", " ")) %>%
select(id,year, title, abstract) %>%
# lemmatize abstract
mutate(abstract=lemmatize_strings(abstract))
papers_clean %>%
head(10) %>%
mutate(abstract=paste0(substr(abstract, 1, 500), "...")) %>%
datatable(options = list(pageLength=5))
Next, the text data are tokenized into a tidy data structure, with one row for each word in each paper. Stop words, such as the
, a
, an
, are removed as well because they would not do me any favors during exploring process. The final tidy word list is displayed below.
tidy_abstract<-
papers_clean %>%
# tokenize the abstract
unnest_tokens(word,abstract) %>%
# remove stop words from abstract
anti_join(stop_words) %>%
# remove numbers
mutate(word=str_match(word,"[A-Za-z ]+")) %>%
# discard NA rows created by str_match()
na.omit()
tidy_abstract %>%
head(10) %>%
datatable(options = list(pageLength=5))
Here I calculate the frequency for each word appearing in the papers. The abstract text consists of 14,793 different words. The top 10 most frequently used words are reasonable. It’s not suprising to find out that the word model
appears nearly 9,000 times in the dataset. Word frequency decays rapidly. The 50th most frequent used word apply
only appears over 1,200 times.
Then a word cloud visually presents the top 200 words. Words like model
, algorithm
, learn
, and network
stand out since they were used more frequently by researchers.
tidy_frequency<-tidy_abstract %>%
count(word,sort = T)
tidy_frequency%>%
head(n=50) %>%
datatable(options = list(pageLength=5))
library(wordcloud)
set.seed(847)
wordcloud(words = tidy_frequency$word, freq = tidy_frequency$n, min.freq = 900,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Then I analyze the term frequency - inverse document frequency (tf-idf) score for each word. tf-idf identifies the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much compared to other documents. ref
The most frequently used words don’t show up in the tf-idf table. The top 10 most tf-idf words give some insight into the content of the papers. For example, I don’t understand the meaning of kfd
or knng
. What’s the meaning of them? Are they the acronym of some machine learning techniques? After I search them online, I find that kfd
stands for kernel Fisher discriminant analysis
, and knng
represents K-point nearest neighbour graph
.
I also plot a wordcloud for words with highest tf-idf. Compared to the wordcloud in previous section, wordcloud with tf-idf is poorly organized, and is composed of many acronyms or abbreviations.
tidy_tf_idf<-tidy_abstract %>%
count(id,word,sort = T) %>%
bind_tf_idf(word,id,n) %>%
arrange(-tf_idf)
tidy_tf_idf%>%
head(n=50) %>%
datatable(options = list(pageLength=5))
set.seed(847)
wordcloud(words = tidy_tf_idf$word[1:100], freq = tidy_tf_idf$tf_idf*1000,
max.words=500, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
I select words appearing more than 100 times in the data to reduce computation burden. Then I count the frequency of each pair of words occurring together in the abstract.
Also, I remove some words with incredibly high frecuency across all papers, such as method
, purpose
, model
, etc., which wouldn’t help me detect the latent topic structure. There are two reasons why I drop these words. One is that there would be too many redundant pairs. The top 5 pairs would only consist of 5 unique words, if I didn’t discard these high-frequenct words. Another reason would be discussed later.
removed_words<-data.frame(word=c("method","propose","datum","model","learn","solution","algorithm","result","technique","provide","dataset","set","base","approach","performance","numb","paper"))
knitr::include_graphics("redundant pairs.png")
Top 5 frequent words without dropping
Here is my top 10 frequent word pairs. It is not surprising that pair neural network
ranks first on the most frequently occured table. Moreover, 8 out of 10 top 10 pairs are either related to neural
or network
, which indicates that neural network
is a main topic in NeurIPS.
library(widyr)
tidy_abstract_100<-tidy_abstract %>%
add_count(word) %>%
mutate(n=as.integer(n)) %>%
filter(n>100) %>%
select(-n) %>%
mutate(word=word[,1])
removed_words<-data.frame(word=c("method","propose","datum","model","learn","solution","algorithm","result","technique","provide","dataset","set","base","approach","performance","numb","paper"))
removed_words$word<-as.character(removed_words$word)
tidy_abstract_100<-tidy_abstract_100%>%
anti_join(removed_words)
tidy_word_pairs <- tidy_abstract_100 %>%
pairwise_count(word, title, sort = TRUE, upper = FALSE)
tidy_word_pairs %>%
head(10) %>%
datatable(options = list(pageLength=5))
Let’s plot the Word Network of these words. Only the word pairs with more than 250 frequency are showed here. The edge width indicates the frequency of the word pair.
I do not see any clustering structure in the word frequent network. Most words are connected with others in a large community, except that bayesian
, and inference
are isolated in the corner. network
and neural
are strongly connected with each other while network
is connected to many other words, but neural
doestn’t. real
and world
are closely binded as well.
library(igraph)
library(ggraph)
set.seed(847)
tidy_graph<-tidy_word_pairs %>%
filter(n >= 250) %>%
graph_from_data_frame(directed = F)
degree_graph<-igraph::degree(tidy_graph, v = igraph::V(tidy_graph), mode = "all")
tidy_graph %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "darkred") +
geom_node_point(size = degree_graph/4) + #size = degree_graph/4
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
# library(networkD3)
# # Convert to object suitable for networkD3
# tidy_graph_d3<-igraph_to_networkD3(tidy_graph)
#
#
#
# # Create force directed network plot
# MyClickScript <-
# ' d3.select(this).select("circle").transition()
# .duration(750)
# .attr("r", 30)'
#
# bet<-igraph::betweenness(tidy_graph,v=V(tidy_graph),directed = F)
#
# forceNetwork(Links = tidy_graph_d3$links,
# Nodes = tidy_graph_d3$nodes,
# NodeID = 'name',Group = 1,
# linkWidth = networkD3::JS("function(d) { return d.value/5; }"),
# opacity = 0.9, zoom = F, bounded = T,
# clickAction = MyClickScript,
# opacityNoHover = 0.6,fontSize = 15)
Then I looked into the correlation between words. Different from the word co-occourance I discussed before, word correlation focus on the likelihood of two words appearing in the same paper relative to the frequency that they appear separately ref.
The following word pairs are sorted by their correlation. It’s not surprising that Monte Carlo
alway occur together with correlation coefficient equal to 1, followed by many other set phrases popular in machine learning area, such as neural networl
, gradient descent
, cross validation
, and computer vision
.
tidy_cor<-tidy_abstract_100 %>%
group_by(word) %>%
pairwise_cor(word,title,sort=T,upper=F)
tidy_cor %>%
head(10) %>%
datatable(options = list(pageLength=5))
Then I plot the Word Correlation Network. I manually filter pairs with correlation higher than \(35\%\).
The clustering structure here is not obvious. There are so many pairs only adjacent to each other. There are also some triple-word chains here, such as principle component pca
, and real world synthetic
. There are even three triangle cycles here. They are monte carlo chain
, chip vlsi analog
, and spike neuron fire
.
set.seed(847)
tidy_cor %>%
filter(correlation > 0.35) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "royalblue") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
Here I build my topic model with Strucural Topic Models. It allows me to discover topics and estimate their relationship to paper metadata. Different from other traditional probabilistic topic models, such as Latent Ditichlet Allocation or Correlated Topic Model, Structure Topic Model let users incorporate arbitrary metadata. ref
First I create document-term matrix from my tidy structure datset.
papers_dfm_100<-tidy_abstract_100 %>%
count(id,word) %>%
cast_dfm(id,word,n)
library(stm)
STM is quite like k-means clustering. I don’t know how many topics I should use, but I have to specify the number of topics in the algorithm. There is no “correct” answer to this question, but I could select K by playing with different number of topics, and evaluating these models.
Though I already drop some useless information from the dataset, the modeling procedure still take a while to run. Therefore, I use parallel processing to speed up.
library(furrr)
plan(multiprocess)
many_models <- data_frame(K=seq(20,100,20)) %>%
mutate(topic_model = future_map(K,~stm(papers_dfm_100,K=.,verbose = F,init.type = "Spectral")))
heldout<-make.heldout(papers_dfm_100)
k_result<-many_models %>%
mutate(exclusivity = map(topic_model,exclusivity),
semantic_coherence = map(topic_model,semanticCoherence,papers_dfm_100),
eval_heldout=map(topic_model,eval.heldout,heldout$missing),
residual=map(topic_model,checkResiduals,papers_dfm_100),
bound=map_dbl(topic_model,function(x) max(x$convergence$bound)),
lfact = map_dbl(topic_model, function(x) lfactorial(x$settings$dim$K)),
lbound=bound+lfact,
iterations=map_dbl(topic_model,function(x) length(x$convergence$bound)))
I evaluate models based on their residuals, the semantic coherence of the topics, the likelihood for held-out datasets, and exclusivity.
The residuals are lowest at 100, and the held-out likelihood is highest at 100, so perhaps a good number of topics would be 100.
k_result %>%
transmute(K,
`Lower bound` = lbound,
Residuals = map_dbl(residual, "dispersion"),
`Semantic coherence` = map_dbl(semantic_coherence, mean),
`Held-out likelihood` = map_dbl(eval_heldout, "expected.heldout")) %>%
gather(Metric, Value, -K) %>%
ggplot(aes(K, Value, color = Metric)) +
geom_line(size = 1.5, alpha = 0.7, show.legend = FALSE) +
facet_wrap(~Metric, scales = "free_y") +
labs(x = "K (number of topics)",
y = NULL,
title = "Model diagnostics by residuals over number of topics")
However, let’s look at the first 9 topics generated by the model \(n=100\). More specific topics are produced now, but it’s getting hard to generalize them. For example, it’s difficult to decode the main content in Topic 9
by the top five key words: weight
, bp
, version
, similar
, sum
. bp
might stands for backpropagation, but I could not understand the meaning of the other words. Large values of topic number might be plausible, but it sacrifices clarity.
topic_model_100_100<-k_result %>%
filter(K == 100) %>%
pull(topic_model) %>%
.[[1]]
tidy(topic_model_100_100) %>%
group_by(topic) %>%
top_n(5,beta) %>%
ungroup() %>%
arrange(topic,-beta) %>%
filter(topic<=12) %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(~ topic, scales = "free", ncol = 3) +
coord_flip()
Then I seek for another method to determine the proper number of topics, that’s exclusivisity vs. semantic coherence. Semantic coherence hits maximal value if the most probable words in a given topic frequently co-occur together, which is similar to human judgment of topic quality ref. Exclusivity is the word usage to a topic relative to a set of other topics ref. Having high semantic coherence is relatively easy when I only have a few topics dominated by very common words, so I have to take count in exclusivity of words. It’s a tradeoff between semantic coherence and exclusivity. Models with fewer topics have higher semantic coherence for more topics, but lower exclusivity.
Here I choose model with 40 topics.
g_k_result<-k_result %>%
select(K, exclusivity, semantic_coherence) %>%
filter(K %in% c(20,40, 60, 80, 100)) %>%
unnest() %>%
mutate(K = as.factor(K)) %>%
ggplot(aes(semantic_coherence, exclusivity, color = K)) +
geom_point(size = 2, alpha = 0.5,aes(shape=K)) +
labs(x = "Semantic coherence",
y = "Exclusivity",
title = "Comparing exclusivity and semantic coherence")
ggplotly(g_k_result)
After I pick one model \(n=40\), I started to evaluate it.
topic_model_100<-k_result %>%
filter(K == 40) %>%
pull(topic_model) %>%
.[[1]]
sum_topic<-summary(topic_model_100)
## A topic model with 40 topics, 7208 documents and a 892 word dictionary.
I plot a bubble chart to illustrate the probability (beta) of a word generated from a topic according to a certain topic model. Topics with larger bubble indicates that there is a higher probability that papers from this topic might have the word. For example, beta of word network in Topic 29
equals \(0.29\), which definitely defeat beta of network in Topic 7
\(\beta_{network,10}=0.08\).
td_beta_100<-tidy(topic_model_100)
word_bubble<-td_beta_100 %>%
group_by(topic) %>%
top_n(5,beta) %>%
ungroup() %>%
mutate(topic = paste0("Topic ", topic)) %>%
group_by(topic) %>%
arrange(-beta, .by_group=T) %>%
ungroup()
g_bubble<-word_bubble %>%
left_join(td_beta_100,by=c("term"="term")) %>%
select(topic.y,term,beta.y) %>%
filter(beta.y>0.005) %>%
mutate(topic=topic.y,beta=beta.y,
bin = cut(beta, breaks = c(-Inf, quantile(beta,c(0.25,0.5,0.75)), Inf), labels = c(1,2,3,4))) %>%
ggplot(aes(x=topic,y=term,fill=bin)) +
geom_point(aes(size=beta,color=bin),alpha=0.5)+
labs(title = "Betas against Topics",
y=NULL) +
scale_x_continuous(breaks = seq(1,40,1)) +
theme(legend.position = "none")
gg_bubble<-ggplotly(g_bubble)
gg_bubble
Let’s throwback to a previous point, why I remove some words in the dataset before I start the topic modeling. The second reason is that if I didn’t drop them, these words would cause granularity in the bubble chart, which means these words wouldn’t help the algorithm detect latent topics because they appears evenly across most of the topics. Therefore, I discard them. The following screenshot displays the granularity situation. It is clear that the bubbles in performance
, and numb
seem to form two lines.
knitr::include_graphics("granularity2.png")
I am interested in the probability which each paper is assigned to each topic. Also, I was curious about the most frequently used words by researchers. In the bar chart below, the x-axis \(gamma\) is the weight that topic takes in the paper-topic classification, and y-axis is the different topics.
I find that the top words selected from each topic could help me easily distinguish the exact content contained in the topic. For example, Topic 5
, which ranks first on the prevalence chart, seems likely to associated with Estimation. Topic 36
is more likely to discuss about stochastic gradient optimization method, which enjoys the 6th place on the chart.
library(reshape2)
top_terms <- tidy(topic_model_100) %>%
arrange(beta) %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarise(terms = list(term)) %>%
mutate(terms = map(terms, paste, collapse = ", ")) %>%
unnest()
gamma_terms <- tidy(topic_model_100,matrix = "gamma",
document_names = rownames(papers_dfm_100)) %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(top_terms, by = "topic") %>%
mutate(topic = paste0("Topic ", topic),
topic = reorder(topic, gamma))
gamma_terms %>%
top_n(40, gamma) %>%
ggplot(aes(topic, gamma, label = terms, fill = colorspace::rainbow_hcl(40))) +
geom_col(show.legend = FALSE) +
geom_text(hjust = 0, nudge_y = 0.0005, size = 3) +
coord_flip() +
scale_y_continuous(expand = c(0,0),
limits = c(0, 0.09)) +
theme(plot.title = element_text(size = 16),
plot.subtitle = element_text(size = 13)) +
labs(x = NULL, y = paste('Gamma'),
title = "Top 40 topics by prevalence in NeurIPS 1987 - 2017",
subtitle = "With the top words that contribute to each topic")
Then I analyze the 40 topics against time, so I could explore the trend of these topics, such as which one dominated the research interests at some year. Therefore, I create a stacked area chart to display the evolution of different topics. The x-axis is year
and the y-axis is the proportion
for each topic given a certain year
. For example, in 1991, papers related to Topic 29
recurrent neural network in NeurIPS amounts to \(15.38\%\), which enjoys an overwhelming advantage to other topics.
I also adde four statistics for each topic. Those are Highest Prob
, FREX
, Lift
, and Score
. Highest Prob
is highest probability words. FREX
is the weighted harmonic mean of the word’s rank in terms of exclusivity and frequency, which measures exclusivity in a way that balances word frequency. ref. Lift
weights words by dividing by their frequency in other topics ref , which is similar to tf-idf. Score
is similar to tf-idf as well. It divides the log frequency of the word in the topic by the log frequency of the word in other topics. ref
From the stacked area chart below, I know there were much less research topics around 1990, at the begining of NeurIPS, compared to 2017. Topic 7
deep neural network, Topic 29
recurrent neural network, Topic 14
Very-Large-Scale-Integration (VLSI) system memory implementat accounted for the main research of interest at 1991. As time went by, more topics about machine leaning or computational neuralscience popped up, such as Topic 13
Regret Multi-Armed Bandit, Topic 36
Batch Gredient Stochastic Optimization, and Topic 17
Latent Topic Probability at 2017. The research topics appering on NeurIPS are getting more rich and diverse.
td_gamma_100 <- topic_model_100 %>%
tidy(matrix = 'gamma',
document_names = rownames(papers_dfm_100)) %>%
dcast(document~topic,value.var = "gamma") %>%
arrange(as.numeric(document))
doc_dict<-td_gamma_100$document
rownames(td_gamma_100)<-doc_dict
td_gamma_100$document<-NULL
doc_group<-td_gamma_100 %>%
max.col(ties.method = "random")
papers_group<-data.frame(document=as.integer(doc_dict),group=doc_group)
colnames(papers_group) <- c("document","group")
year_group<-papers_clean %>%
select(id,year) %>%
inner_join(papers_group,by=c("id"="document")) %>%
group_by(year,group) %>%
summarise(n=n()) %>%
ungroup()
year_total<-year_group %>%
group_by(year) %>%
summarise(total=sum(n)) %>%
ungroup()
year_group_all<-expand.grid(year=1987:2017,group=1:40) %>%
left_join(year_group,by=c("year"="year","group"="group")) %>%
left_join(year_total,by=c("year"="year"))
year_group_all[is.na(year_group_all)] <-0
g_year_group<-year_group_all %>%
mutate(portion=n/total) %>%
ggplot(aes(x=year,y=portion,fill=factor(group))) +
geom_area(color="black",alpha=0.8) +
labs(title = "Topic Distribution given Time",
y=NULL)
gg_year_group<-ggplotly(g_year_group)
for (x in 1:40){
gg_year_group$x$data[[x]]$text<-paste0(gg_year_group$x$data[[x]]$text,
paste("","Highest Prob:", paste(sum_topic$prob[x,1:3],collapse = " "),sep = "<br />"),
paste("","FREX:", paste(sum_topic$frex[x,1:3],collapse = " "),sep = "<br />"),
paste("","Lift:", paste(sum_topic$lift[x,1:3],collapse = " "),sep = "<br />"),
paste("","Score:", paste(sum_topic$score[x,1:3],collapse = " "),sep = "<br />"))
}
gg_year_group
Next line chart displays the trend between different topics. Each line represents one year. The x-axis shows all 40 topics, and the y-axis indicated the number of papers given a certain year.
During 1987-1991, researchers focused on Topic 7
Neural Network, Topic 11
Shape Recognition, Topic 25
Fisher Kernel, and Topic 27
Speech Recognition. This enthusiasm continued into 1992-1996 while more attention was paid to Topic 34
Dynamic Reinforcement Learning. During 1997-2001, researcher’s enthusiasm in Topic 25
Fisher Kernel, and Topic 27
Speech Recognition subsided. On the other hand, a growing interests in Topic 39
Support Vector Machine Classification showed up, which lasted for 15 years, from 1997 to 2011. Starting from 2002, researchers’ interest expand to more topics. Papers appearing at NeurIPS became more diverse. 2007-2011 witnessed a sharp fluctuation in the research interest since the lines don’t overlap much with each other. Topic 2
Distance Metrix Learning, Topic 6
Monte Carlo Posterior, and Topic 32
Sparse Matrix Feature Selection remained as hot topics during that 5 years. After 2012, researcher started to dedicate more energy in Topic 21
Low Rank Matrix Approximation, and Topic 36
Stochastic Descent Gradient Optimization.
By summarizing the trend of spikes across different panels, I conclude that it takes about 10 to 15 years for one topics to be popular or to be not common. For example, Topic 39
Support Vector Machine Classification, from 1992-1996 to 1997-2001, gained more attention from academia. The trend remained in 2002-2006, and 2007-2011. Then the heat vanished in 2012-2016.
year_group_all %>%
mutate(portion=n/total,
bin=cut(year,breaks = c(-Inf,1991.5,1996.5,2001.5,2006.5,2011.5,2016.5,Inf),labels = c("1987-1991","1992-1996","1997-2001","2002-2006","2007-2011","2012-2016","2017"))) %>%
#filter(bin==2) %>%
ggplot(aes(x=group,y=n,color=factor(year))) +
geom_line() +
scale_x_continuous(breaks = seq(1,40,1)) +
#coord_polar() +
facet_wrap(vars(bin),nrow = 7,scales = "free") +
labs(color="year") +
theme(legend.position = "none") ->g_group_year
g_group_year
#ggplotly(g_group_year)
I process NeurIPS data by checking the reliability of all the papers contained in the table, whether the dataset collected all the papers accepted by NeurIPS, and whether the content of each paper was scraped properly. I notice that there are some papers with messy codes, and bring concerns what are the possible reasons that these errors were built (mathematical equations, figures, and Optical Character Recognition error).
After ruling out these papers, I decide to focus on analyzing the abstract of each paper, because abstract contains the essence of the researchers’ thoughts. Also, I can avoid the meaningless symbols created by OCR error. Because the dataset doesn’t provide every abstract before 2012, and a lot of abstracts before 2006 are not available, regular expression is implemented to extract the abstract from the whole text.
Then I conduct my analysis with all papers with existed/extracted abstracts. I prepare the data for further exploration by transforming characters into lowercase, lemmatizing words, tokenizing data into a tidy structure, and removing stop words. I calculate the Term Frequency and Term Frequency - Inverse Document Frequency for each words. The Top 10 most frequently used words are reasonable, such as model
, learn
, algorithm
, etc. However, the Top 10 words with highest tf-idf are not common, at least to me. Wordcloud plots are created to visualize the Top tf, and Top tf-idf words.
After basically exploring the unigram patterns in the data, I move to bigrams. Word co-occourance and word correlation are then calculated based on the words with more than 100 frequency to reduce computational burden. Words with incredibly high frequency are removed becaue they would cause too much redundant pairs, and wouldn’t help me detect the latent topic structure. Word co-occurance result indicates that Neural Network
is a main topic in NeurIPS, and researchers also focus on real world
problems or applications. The network plot of word correlation even reveal some triple-gram structure, such as monte carlo chain
, chip vlsi analog
, and spike neuron fire
.
Structure Topic Model is used here to detect the latent topic patterns. A model with 40 topics is selected. The probability of a word generated from a vertain topic is illustrated by a bubble chart. I removed words with similar probability over almost all topics to increase the specification of the model. Estimation, Support Vector Machine Classification, and Function Estimation are the top 3 most popular topics all over the 31 years. Researchers’ interest has been changing dramatically, from Very-large-Scale-Integration system to Batch Gredient Stochastic Optimization. Topics presented at NeurIPS become more diverse as well. It usually takes 10 to 15 years for a topic to thrive or diminish.
It must be remembered that this project is limited by my understanding of NeurIPS and the exploration of the data. Further studies can be carried out to find out the most contributed institutions, the most popular dataset, or the most influencial researcher in machine learning area. I hope the abstract is kind of like a sufficient statistics of the paper. However, if papers with less OCR errors are provided, better conclusions might be obtained. Last but not least, though I validate the data and model with both quantitative and qualitative techniques, the results of topic models should not be abused.
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.936
## [2] LC_CTYPE=Chinese (Simplified)_China.936
## [3] LC_MONETARY=Chinese (Simplified)_China.936
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_China.936
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reshape2_1.4.3 stm_1.3.3 ggraph_1.0.2
## [4] igraph_1.2.2 widyr_0.1.1 wordcloud_2.6
## [7] RColorBrewer_1.1-2 textstem_0.1.4 koRpus.lang.en_0.1-2
## [10] koRpus_0.11-5 sylly_0.1-5 tidytext_0.2.0
## [13] bindrcpp_0.2.2 DT_0.5 plotly_4.8.0
## [16] forcats_0.3.0 stringr_1.3.1 dplyr_0.7.8
## [19] purrr_0.3.0 readr_1.3.1 tidyr_0.8.2
## [22] tibble_2.0.1 ggplot2_3.1.0 tidyverse_1.2.1
## [25] RSQLite_2.1.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.4-0 qdapRegex_0.7.2 rstudioapi_0.9.0
## [4] farver_1.1.0 SnowballC_0.6.0 ggrepel_0.8.0
## [7] bit64_0.9-7 fansi_0.4.0 lubridate_1.7.4
## [10] textclean_0.9.3 xml2_1.2.0 knitr_1.21
## [13] polyclip_1.10-0 quanteda_1.4.3 jsonlite_1.6
## [16] broom_0.5.1 ggforce_0.2.1 shiny_1.2.0
## [19] compiler_3.5.2 httr_1.4.0 backports_1.1.3
## [22] assertthat_0.2.0 Matrix_1.2-15 lazyeval_0.2.1
## [25] cli_1.0.1 later_0.8.0 tweenr_1.0.1
## [28] htmltools_0.3.6 tools_3.5.2 gtable_0.2.0
## [31] glue_1.3.0 fastmatch_1.1-0 Rcpp_1.0.0
## [34] cellranger_1.1.0 lexicon_1.2.1 nlme_3.1-137
## [37] crosstalk_1.0.0 xfun_0.4 stopwords_0.9.0
## [40] rvest_0.3.2 mime_0.6 MASS_7.3-51.1
## [43] scales_1.0.0 hms_0.4.2 promises_1.0.1
## [46] yaml_2.2.0 memoise_1.1.0 gridExtra_2.3
## [49] stringi_1.2.4 highr_0.7 tokenizers_0.2.1
## [52] sylly.en_0.1-3 matrixStats_0.54.0 rlang_0.3.1
## [55] pkgconfig_2.0.2 evaluate_0.12 lattice_0.20-38
## [58] bindr_0.1.1 htmlwidgets_1.3 labeling_0.3
## [61] bit_1.1-14 tidyselect_0.2.5 plyr_1.8.4
## [64] magrittr_1.5 R6_2.3.0 generics_0.0.2
## [67] DBI_1.0.0 pillar_1.3.1 haven_2.0.0
## [70] withr_2.1.2 textshape_1.6.0 janeaustenr_0.1.5
## [73] modelr_0.1.2 crayon_1.3.4 utf8_1.1.4
## [76] rmarkdown_1.11 viridis_0.5.1 syuzhet_1.0.4
## [79] grid_3.5.2 readxl_1.2.0 data.table_1.12.0
## [82] blob_1.1.1 digest_0.6.18 xtable_1.8-3
## [85] spacyr_1.0 httpuv_1.5.0 RcppParallel_4.4.2
## [88] munsell_0.5.0 viridisLite_0.3.0