--- title: "Topic Modeling with nmfkc" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Topic Modeling with nmfkc} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{quanteda} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction Non-negative Matrix Factorization (NMF) is a powerful technique for topic modeling. By decomposing a document-term matrix, we can simultaneously discover latent **Topics** (clusters of words) and their **Trends** (proportions in documents). This vignette demonstrates how to use the `nmfkc` package to analyze the U.S. presidential inaugural addresses using the `quanteda` package. We will cover: 1. **Data Preparation**: Converting text data into a matrix format suitable for `nmfkc`. 2. **Rank Selection**: Determining the optimal number of topics using robust Cross-Validation. 3. **Standard NMF**: Extracting basic topics and interpreting key words. 4. **Kernel NMF**: Modeling the **temporal evolution** of topics using the "Year" as a covariate. First, let's load the necessary packages. ```{r load-packages} library(nmfkc) library(quanteda) ``` ## 1\. Data Preparation We create a Document-Feature Matrix (DFM) where rows represent documents and columns represent words. ```{r data-prep} # Load the corpus from quanteda corp <- corpus(data_corpus_inaugural) # Preprocessing: tokenize, remove stopwords, and punctuation tok <- tokens(corp, remove_punct = TRUE) tok <- tokens_remove(tok, pattern = stopwords("en", source = "snowball")) # Create DFM and filter df <- dfm(tok) df <- dfm_select(df, min_nchar = 3) # Remove short words (<= 2 chars) df <- dfm_trim(df, min_termfreq = 100) # Remove rare words (appearing < 100 times) # --- CRITICAL STEP --- # quanteda's DFM is (Documents x Words). # nmfkc expects (Words x Documents). # We must transpose the matrix. d <- as.matrix(df) # Sort by frequency index <- order(colSums(d), decreasing=TRUE) d <- d[,index] Y <- t(d) dim(Y) # Features (Words) x Samples (Documents) ``` ## 2\. Rank Selection (Number of Topics) Before fitting the model, we need to decide the number of topics ($rank$). The `nmfkc.rank()` function helps us choose an appropriate rank. The default `detail = "full"` performs **Element-wise Cross-Validation (Wold's CV)**. This method randomly holds out individual matrix elements and evaluates how well the model predicts them, providing a robust measure for rank selection (though it takes more computation time). ```{r rank-selection, fig.width=7, fig.height=6} # Evaluate ranks from 2 to 6 nmfkc.rank(Y, rank = 2:6) ``` Looking at the diagnostics (e.g., the minimum ECV Sigma, the elbow of the R-squared curve, or high Cophenetic Correlation), let's assume **Rank = 3** is a reasonable choice for this overview. ## 3\. Standard NMF We fit the standard NMF model ($Y \approx XB$) with `rank = 3`. In the context of topic modeling: * **$X$ (Basis Matrix):** Represents **Topics** (distribution of words). * **$B$ (Coefficient Matrix):** Represents **Trends** (distribution of topics across documents). ```{r model-fit} rank <- 3 # Set seed for reproducibility res_std <- nmfkc(Y, rank = rank, seed = 123, prefix = "Topic") # Check Goodness of Fit (R-squared) res_std$r.squared ``` ### Interpreting Topics (Keywords) We can identify the meaning of each topic by looking at the words with the highest weights in the basis matrix `X`. ```{r interpret-topics} # Extract top 10 words for each topic from X.prob (normalized X) Xp <- res_std$X.prob for(q in 1:rank){ message(paste0("----- Featured words on Topic [", q, "] -----")) print(paste0(rownames(Xp), "(", rowSums(Y), ") ", round(100*Xp[,q], 1), "%")[Xp[,q]>=0.5]) } ``` *(Note: Interpretation depends on the result. For example, one topic might contain words like "government, people, states" (Political), while another might have "world, peace, freedom" (International).)* ## 4\. Kernel NMF: Temporal Topic Evolution One of the unique features of `nmfkc` is **Kernel NMF**. In standard NMF, the order of documents is ignored; each speech is treated independently. However, inaugural addresses have a strong **temporal component**. By using the "Year" as a covariate, we can smooth the topic proportions over time to see historical shifts. ### Optimizing the Kernel Parameter We construct a covariate matrix `U` using the year of the address. We then find the optimal kernel bandwidth (`beta`) using Cross-Validation. ```{r kernel-prep} # Covariate: Year of the address years <- as.numeric(substring(names(data_corpus_inaugural), 1, 4)) U <- t(as.matrix(years)) # Optimize beta (Gaussian Kernel width) # We test a specific range of betas to locate the minimum CV error. beta_candidates <- c(0.2, 0.5, 1, 2, 5) / 10000 # Run CV to find the best beta # Note: We use the same rank (rank=3) as selected above. cv_res <- nmfkc.kernel.beta.cv(Y, rank = rank, U = U, beta = beta_candidates, plot = FALSE) best_beta <- cv_res$beta print(best_beta) ``` ### Fitting Kernel NMF Now we fit the model using the kernel matrix `A`. This enforces that documents close in time (similar years) should have similar topic distributions. ```{r kernel-fit} # Create Kernel Matrix A <- nmfkc.kernel(U, beta = best_beta) # Fit NMF with Kernel Covariates res_ker <- nmfkc(Y, A = A, rank = rank, seed = 123, prefix = "Topic") ``` ### Visualization: Standard vs Kernel NMF Let's compare how topic proportions change over time. * **Standard NMF (Top):** Shows noisy fluctuations. It captures the specific content of each speech but misses the larger historical context. * **Kernel NMF (Bottom):** Reveals smooth historical trends, showing how themes like "Nation Building" vs "Global Affairs" have evolved over centuries. ```{r plot-comparison, fig.width=8, fig.height=8} oldpar <- par(mfrow = c(2, 1), mar = c(4, 4, 2, 1)) # Prepare Axis Labels (Rounded to integers) at_points <- seq(1, ncol(Y), length.out = 10) labels_years <- round(seq(min(years), max(years), length.out = 10)) # 1. Standard NMF (Noisy) barplot(res_std$B.prob, col = 2:(rank+1), border = NA, xaxt='n', main = "Standard NMF: Topic Proportions (Noisy)", ylab = "Probability") axis(1, at = at_points, labels = labels_years) # 2. Kernel NMF (Smooth trend) barplot(res_ker$B.prob, col = 2:(rank+1), border = NA, xaxt='n', main = "Kernel NMF: Temporal Topic Evolution (Smooth)", ylab = "Probability") axis(1, at = at_points, labels = labels_years) # Legend legend("topright", legend = paste("Topic", 1:rank), fill = 2:(rank+1), bg="white", cex=0.8) par(oldpar) ```