--- title: "Classification with NMF-LAB" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Classification with NMF-LAB} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{palmerpenguins} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction This vignette demonstrates how to use the `nmfkc` package for **Supervised Classification**, a technique referred to as **NMF-LAB** (Label-based NMF). ### How it works The NMF-LAB approach treats multi-class classification as a matrix factorization task: $$Y \approx X C A$$ * **$Y$ (Target):** One-hot encoded matrix of class labels (Classes $\times$ Samples). * **$A$ (Input):** Feature matrix or Kernel matrix constructed from features. * **$X$ (Learned Basis):** A mapping matrix between latent factors and class labels. * **$C$ (Coefficient/Parameter):** Captures the relationship between input $A$ and classes. **Key Trade-off:** 1. **Linear Model ($A = Features$):** Highly interpretable. $C$ tells you exactly which feature contributes to which class. 2. **Kernel Model ($A = Kernel$):** Highly accurate. Handles non-linear boundaries but is harder to interpret. We will explore this trade-off using the `iris` dataset. First, let's load the required packages. ```{r load-packages} library(nmfkc) library(palmerpenguins) ``` ----- ## Example 1: The Iris Dataset (Linear vs. Kernel) Our goal is to classify 3 species of iris flowers (*setosa, versicolor, virginica*) based on 4 measurements. ### 1\. Data Preparation We convert the species labels into a binary class matrix $Y$ and normalize the features $U$. ```{r iris-data-prep} # 1. Prepare Labels (Y) label_iris <- iris$Species Y_iris <- nmfkc.class(label_iris) # One-hot encoding (3 Classes x 150 Samples) rank_iris <- length(unique(label_iris)) # Number of classes # 2. Prepare Features (U) # Normalize features to [0, 1] range and transpose to (Features x Samples) U_iris <- t(nmfkc.normalize(iris[, -5])) ``` ### Step 1: Linear NMF-LAB (Focus on Interpretability) First, we fit a **Linear Model** by using the features $U$ directly as the input matrix $A$. This allows us to see **"Which feature drives which class?"** by inspecting the matrix $C$. ```{r iris-linear} # Use Normalized Features directly as A A_linear <- U_iris # Fit Linear NMF-LAB res_linear <- nmfkc(Y = Y_iris, A = A_linear, rank = rank_iris, seed = 123, prefix = "Class") # --- Interpretability Check --- # The matrix C (Q x R) shows the weight of each Feature (columns) for each Class (rows). # Let's look at the estimated weights: round(res_linear$C, 2) ``` **Interpretation:** Looking at the matrix $C$ above: * **Class1 (Setosa):** Has high weights on `Sepal.Width` (2nd col). * **Class3 (Virginica):** Has high weights on `Petal.Length` (3rd col) and `Petal.Width` (4th col). This "White-box" transparency is the main advantage of the linear model. **Accuracy Check:** We evaluate the model using a confusion matrix. *(Note: Linear models often struggle with overlapping classes like versicolor and virginica.)* ```{r iris-linear-acc} pred_linear <- predict(res_linear, type = "class") (f_linear <- table(fitted.label = pred_linear, label = label_iris)) # Calculate Accuracy (assuming diagonal correspondence) acc_linear <- sum(diag(f_linear)) / sum(f_linear) cat(paste0("Linear Model Accuracy: ", round(acc_linear * 100, 2), "%\n")) ``` ### Step 2: Kernel NMF-LAB (Focus on Performance) To improve accuracy, we switch to the **Kernel Model**. We map the features into a high-dimensional space using a Gaussian kernel. ```{r iris-kernel} # 1. Optimize Kernel Width (beta) # Heuristic estimation of beta res_beta <- nmfkc.kernel.beta.nearest.med(U_iris) # Cross-validation for fine-tuning (using generated candidates) cv_res <- nmfkc.kernel.beta.cv(Y_iris, rank = rank_iris, U = U_iris, beta = res_beta$beta_candidates, plot = FALSE) best_beta <- cv_res$beta # 2. Fit Kernel NMF-LAB A_kernel <- nmfkc.kernel(U_iris, beta = best_beta) res_kernel <- nmfkc(Y = Y_iris, A = A_kernel, rank = rank_iris, seed = 123, prefix = "Class") # 3. Prediction and Evaluation fitted_label <- predict(res_kernel, type = "class") (f_kernel <- table(fitted.label = fitted_label, label = label_iris)) # Calculate Accuracy acc_kernel <- sum(diag(f_kernel)) / sum(f_kernel) cat(paste0("Kernel Model Accuracy: ", round(acc_kernel * 100, 2), "%\n")) ``` **Result:** The accuracy jumps significantly (often \>96%). The kernel successfully separates the complex boundaries. ### Visualization: Class Prototypes (Basis X) Finally, let's visualize the Basis Matrix $X$ of the successful kernel model. Ideally, it should look like a diagonal matrix, mapping each Latent Factor to a specific Species. ```{r iris-plot-X, fig.height=5} image(t(res_kernel$X)[, nrow(res_kernel$X):1], main = "Basis Matrix X (Kernel Model)\nMapping Factors to Species", axes = FALSE, col = hcl.colors(12, "YlOrRd", rev = TRUE)) axis(1, at = seq(0, 1, length.out = rank_iris), labels = colnames(res_kernel$X)) axis(2, at = seq(0, 1, length.out = rank_iris), labels = rev(rownames(res_kernel$X)), las = 2) box() ``` ----- ## Example 2: The Palmer Penguins Dataset Let's apply the **Kernel NMF-LAB** workflow to classify penguin species (*Adelie, Chinstrap, Gentoo*), focusing on the **Probabilistic (Soft)** nature of NMF classification. ### 1\. Data Preparation We must remove rows with missing values (`NA`) as the kernel matrix $A$ cannot handle missing entries in the input features. ```{r penguins-data-prep} # Load and clean data (remove rows with NAs) d_penguins <- na.omit(palmerpenguins::penguins) # Prepare Y (Labels) label_penguins <- d_penguins$species Y_penguins <- nmfkc.class(label_penguins) # Prepare U (Features) U_penguins <- t(nmfkc.normalize(d_penguins[, 3:6])) ``` ### 2\. Model Fitting We use the heuristic $\beta$ directly for a quick demonstration. ```{r penguins-model-fit} rank_penguins <- length(unique(label_penguins)) # 1. Heuristic beta estimation best_beta_penguins <- nmfkc.kernel.beta.nearest.med(U_penguins)$beta # 2. Optimization A_penguins <- nmfkc.kernel(U_penguins, beta = best_beta_penguins) res_penguins <- nmfkc(Y = Y_penguins, A = A_penguins, rank = rank_penguins, seed = 123, prefix = "Class") ``` ### 3\. Visualizing "Soft" Classification Unlike many classifiers that only output a final label, NMF provides a **probability distribution** over classes. The plot below shows the predicted probability for each penguin. * **Solid Blocks of Color:** Indicate high confidence predictions. * **Mixed Colors:** Indicate samples where the model is uncertain (often on the boundary between species). ```{r penguins-soft-vis, fig.width=8, fig.height=4} # Get probabilistic predictions probs <- predict(res_penguins, type = "prob") # Visualize barplot(probs, col = c("#FF8C00", "#9932CC", "#008B8B"), border = NA, main = "Soft Classification Probabilities (Penguins)", xlab = "Sample Index", ylab = "Probability") legend("topright", legend = levels(label_penguins), fill = c("#FF8C00", "#9932CC", "#008B8B"), bg = "white", cex = 0.8) ``` ### 4\. Evaluation Finally, we calculate the accuracy. ```{r penguins-acc} fitted_label_p <- predict(res_penguins, type = "class") (f_penguins <- table(Predicted = fitted_label_p, Actual = label_penguins)) acc_p <- sum(diag(f_penguins)) / sum(f_penguins) cat(paste0("Penguins Accuracy: ", round(acc_p * 100, 2), "%\n")) ```