--- title: "Usage Tutorial" author: "Chen Yang" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Usage Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, eval = FALSE ) ``` # Usage Tutorial This tutorial provides detailed instructions for using mLLMCelltype for cell type annotation in single-cell RNA sequencing data. We'll cover various usage scenarios, parameter configurations, and integration with Seurat. ## Comprehensive Function Parameters ### annotate_cell_types() The main function for cell type annotation with a single model: ```{r} library(mLLMCelltype) results <- annotate_cell_types( input, # Marker gene data (data frame, list, or file path) tissue_name, # Tissue name (e.g., "human PBMC", "mouse brain") model, # LLM model to use api_key = NA, # API key (if not set in environment, NA returns prompt only) top_gene_count = 10, # Number of top genes per cluster to use debug = FALSE # Whether to print debugging information ) ``` ### interactive_consensus_annotation() Function for creating consensus annotations from multiple models through interactive discussion: ```{r eval=FALSE} consensus_results <- interactive_consensus_annotation( input, # Original marker gene data (Seurat FindAllMarkers result or list of genes) tissue_name = NULL, # Optional tissue name models = c("claude-sonnet-4-5-20250929", "gpt-5", "gemini-1.5-pro"), # Models to use api_keys, # Named list of API keys top_gene_count = 10, # Number of top genes to use controversy_threshold = 0.7, # Threshold for identifying controversial clusters entropy_threshold = 1.0, # Entropy threshold for controversial clusters max_discussion_rounds = 3, # Maximum discussion rounds consensus_check_model = NULL, # Model to use for consensus checking (see recommendations below) log_dir = "logs", # Directory for logs cache_dir = NULL, # Uses default system cache directory use_cache = TRUE # Whether to use cache ) ``` **Important Note on `consensus_check_model`**: This parameter is used for evaluating semantic similarity, calculating consensus metrics, and moderating discussions. We recommend using capable models such as: - `claude-sonnet-4-5-20250929` (Anthropic) - `claude-opus-4-1-20250805` (Anthropic) - `o1` (OpenAI) - `gpt-5` (OpenAI) - `gemini-2.5-pro` (Google) See the main README for detailed recommendations and examples. ## Detailed Usage Scenarios ### Scenario 1: Basic Annotation with a Single Model For quick exploration or when API usage is a concern: ```{r} # Load example data library(Seurat) data("pbmc_small") # Find markers pbmc_markers <- FindAllMarkers(pbmc_small, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) # Run annotation with a single model results <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = "claude-sonnet-4-5-20250929", api_key = Sys.getenv("ANTHROPIC_API_KEY"), top_gene_count = 10 ) # Add annotations to Seurat object pbmc_small$cell_type_claude <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = names(results), to = results ) # Visualize DimPlot(pbmc_small, group.by = "cell_type_claude", label = TRUE) ``` ### Scenario 2: Multi-Model Consensus for High Accuracy For publication-quality annotations with uncertainty quantification: ```{r} # Define multiple models to use models <- c( "claude-sonnet-4-5-20250929", # Anthropic "gpt-5", # OpenAI "gemini-1.5-pro", # Google "grok-3" # X.AI ) # API keys for different providers api_keys <- list( anthropic = Sys.getenv("ANTHROPIC_API_KEY"), openai = Sys.getenv("OPENAI_API_KEY"), gemini = Sys.getenv("GEMINI_API_KEY"), grok = Sys.getenv("GROK_API_KEY") ) # Run annotation with multiple models results <- list() for (model in models) { provider <- get_provider(model) api_key <- api_keys[[provider]] results[[model]] <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = model, api_key = api_key, top_gene_count = 10 ) } # Create consensus consensus_results <- interactive_consensus_annotation( input = pbmc_markers, tissue_name = "human PBMC", models = models, # Use all the models defined above api_keys = api_keys, controversy_threshold = 0.7, entropy_threshold = 1.0, consensus_check_model = "claude-sonnet-4-5-20250929" ) # View consensus results # You can access the final annotations with consensus_results$final_annotations # Add consensus annotations and metrics to Seurat object pbmc_small$cell_type_consensus <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = names(consensus_results$final_annotations), to = consensus_results$final_annotations ) # Extract consensus metrics from the consensus results consensus_metrics <- lapply(names(consensus_results$initial_results$consensus_results), function(cluster_id) { metrics <- consensus_results$initial_results$consensus_results[[cluster_id]] return(list( cluster = cluster_id, consensus_proportion = metrics$consensus_proportion, entropy = metrics$entropy )) }) # Convert to data frame for easier handling metrics_df <- do.call(rbind, lapply(consensus_metrics, data.frame)) # Add consensus proportion to Seurat object pbmc_small$consensus_proportion <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = metrics_df$cluster, to = metrics_df$consensus_proportion ) # Add entropy to Seurat object pbmc_small$shannon_entropy <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = metrics_df$cluster, to = metrics_df$entropy ) ``` ### Scenario 2b: Using Free OpenRouter Models For users with limited API credits or budget constraints: ```{r} # Set OpenRouter API key openrouter_api_key <- Sys.getenv("OPENROUTER_API_KEY") # Define free OpenRouter models to use free_models <- c( "meta-llama/llama-4-maverick:free", # Meta Llama 4 Maverick (free) "nvidia/llama-3.1-nemotron-ultra-253b-v1:free", # NVIDIA Nemotron Ultra 253B (free) "deepseek/deepseek-r1:free", # DeepSeek R1 (free, advanced reasoning) "meta-llama/llama-3.3-70b-instruct:free" # Meta Llama 3.3 70B (free) ) # Run annotation with free OpenRouter models free_results <- list() for (model in free_models) { free_results[[model]] <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = model, # OpenRouter models are automatically detected by format: 'provider/model-name:free' api_key = openrouter_api_key, top_gene_count = 10 ) } # Create consensus with free models free_consensus_results <- interactive_consensus_annotation( input = pbmc_markers, tissue_name = "human PBMC", models = free_models, # Use all the free models defined above api_keys = list("openrouter" = openrouter_api_key), controversy_threshold = 0.7, entropy_threshold = 1.0, consensus_check_model = "meta-llama/llama-4-maverick:free" # Use a free model for consensus checking ) # View free model consensus results # You can access the final annotations with free_consensus_results$final_annotations # Add free model consensus annotations to Seurat object pbmc_small$free_model_consensus <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = names(free_consensus_results$final_annotations), to = free_consensus_results$final_annotations ) # Compare paid vs. free model results comparison <- data.frame( cluster = names(consensus_results$final_annotations), paid_models = consensus_results$final_annotations, free_models = free_consensus_results$final_annotations, agreement = consensus_results$final_annotations == free_consensus_results$final_annotations ) print(comparison) ``` ### Scenario 3: Working with CSV Files For users who prefer working with files: ```{r} # Save markers to CSV write.csv(pbmc_markers, "pbmc_markers.csv", row.names = FALSE) # Run annotation using the CSV file results <- annotate_cell_types( input = "pbmc_markers.csv", tissue_name = "human PBMC", model = "claude-sonnet-4-5-20250929", api_key = Sys.getenv("ANTHROPIC_API_KEY") ) ``` ### Scenario 4: Custom Caching For better control over caching behavior: ```{r} # Note: The annotate_cell_types function does not have built-in caching. # If you need caching, you can implement it separately. # Run annotation results <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = "claude-sonnet-4-5-20250929", api_key = Sys.getenv("ANTHROPIC_API_KEY"), top_gene_count = 10, debug = FALSE ) # If you need custom caching, you can implement it using your own cache manager # This is just a conceptual example and not part of the actual package # cache_manager <- YourCacheManager$new(cache_dir = "path/to/cache") # cache_manager$clear_cache() ``` ## Model Selection Guide mLLMCelltype supports a wide range of LLM models. Here's a guide to help you choose: ### High Performance Models For the most accurate annotations: - **Anthropic Claude Sonnet 4.5** (`claude-sonnet-4-5-20250929`) - **Anthropic Claude Opus 4.1** (`claude-opus-4-1-20250805`) - **OpenAI GPT-5** (`gpt-5`) - **Google Gemini 2.5 Pro** (`gemini-2.5-pro`) ### Balanced Performance/Cost Models For good results with lower API costs: - **X.AI Grok-3** (`grok-3`): Competitive performance at lower cost - **DeepSeek V3** (`deepseek-v3`): Good performance for specialized tissues ### Economy Models For preliminary exploration or large datasets: - **Qwen 2.5** (`qwen-max-2025-01-25`): Good performance for the cost - **Zhipu GLM-4** (`glm-4`): Economical option with decent performance - **MiniMax** (`minimax`): Cost-effective for initial exploration ### Free Models via OpenRouter For users with limited API credits or budget constraints: - **Meta Llama 4 Maverick** (`meta-llama/llama-4-maverick:free`): Most reliable and fast, recommended for consensus checking - **NVIDIA Nemotron Ultra 253B** (`nvidia/llama-3.1-nemotron-ultra-253b-v1:free`): Good performance with consistent formatting - **DeepSeek R1** (`deepseek/deepseek-r1:free`): Advanced reasoning model (free) Based on our testing, we recommend the following free models: - `meta-llama/llama-4-maverick:free`: Most reliable and fast, recommended for consensus checking - `nvidia/llama-3.1-nemotron-ultra-253b-v1:free`: Good performance with consistent formatting - `deepseek/deepseek-r1:free`: Reliable with good response time Some models may have limitations: - `deepseek/deepseek-r1:free`: May occasionally return empty results - `thudm/glm-z1-9b:free`: May return localized error messages ("non-character parameter") when used for consensus checking These free models are accessed through OpenRouter and don't consume credits, but may have limitations compared to paid models. Use the `:free` suffix in the model name to access them. ```{r} # Example of using a free model via OpenRouter # First, set your OpenRouter API key Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # Then use a free model with the :free suffix free_model_results <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = "meta-llama/llama-4-maverick:free", # Note the :free suffix api_key = Sys.getenv("OPENROUTER_API_KEY") # No need to specify provider - it's automatically detected from the model name format ) ``` ## Integration with Seurat Workflow Here's a complete example of integrating mLLMCelltype with a Seurat workflow: ```{r} library(Seurat) library(mLLMCelltype) library(ggplot2) # Load data data("pbmc_small") # Standard Seurat preprocessing pbmc_small <- NormalizeData(pbmc_small) pbmc_small <- FindVariableFeatures(pbmc_small) pbmc_small <- ScaleData(pbmc_small) pbmc_small <- RunPCA(pbmc_small) pbmc_small <- FindNeighbors(pbmc_small) pbmc_small <- FindClusters(pbmc_small, resolution = 0.5) pbmc_small <- RunUMAP(pbmc_small, dims = 1:10) # Find markers for each cluster pbmc_markers <- FindAllMarkers(pbmc_small, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) # Define models to use models <- c( "claude-sonnet-4-5-20250929", "gpt-5", "gemini-1.5-pro" ) # API keys api_keys <- list( anthropic = Sys.getenv("ANTHROPIC_API_KEY"), openai = Sys.getenv("OPENAI_API_KEY"), gemini = Sys.getenv("GEMINI_API_KEY") ) # Run annotation with multiple models results <- list() for (model in models) { provider <- get_provider(model) api_key <- api_keys[[provider]] results[[model]] <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = model, api_key = api_key, top_gene_count = 10 ) # Add individual model results to Seurat object column_name <- paste0("cell_type_", gsub("[^a-zA-Z0-9]", "_", model)) pbmc_small[[column_name]] <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = names(results[[model]]), to = results[[model]] ) } # Create consensus consensus_results <- interactive_consensus_annotation( input = pbmc_markers, tissue_name = "human PBMC", models = models, # Use all the models defined above api_keys = api_keys, controversy_threshold = 0.7, entropy_threshold = 1.0, consensus_check_model = "claude-sonnet-4-5-20250929" ) # Add consensus results to Seurat object pbmc_small$cell_type_consensus <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = names(consensus_results$final_annotations), to = consensus_results$final_annotations ) # Extract consensus metrics from the consensus results consensus_metrics <- lapply(names(consensus_results$initial_results$consensus_results), function(cluster_id) { metrics <- consensus_results$initial_results$consensus_results[[cluster_id]] return(list( cluster = cluster_id, consensus_proportion = metrics$consensus_proportion, entropy = metrics$entropy )) }) # Convert to data frame for easier handling metrics_df <- do.call(rbind, lapply(consensus_metrics, data.frame)) # Add consensus proportion to Seurat object pbmc_small$consensus_proportion <- as.numeric(plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = metrics_df$cluster, to = metrics_df$consensus_proportion )) # Add entropy to Seurat object pbmc_small$shannon_entropy <- as.numeric(plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = metrics_df$cluster, to = metrics_df$entropy )) # Visualize results p1 <- DimPlot(pbmc_small, group.by = "cell_type_consensus", label = TRUE, repel = TRUE) + ggtitle("Cell Type Annotations") + theme(plot.title = element_text(hjust = 0.5)) p2 <- FeaturePlot(pbmc_small, features = "consensus_proportion", cols = c("yellow", "green", "blue")) + ggtitle("Consensus Proportion") + theme(plot.title = element_text(hjust = 0.5)) p3 <- FeaturePlot(pbmc_small, features = "shannon_entropy", cols = c("red", "orange")) + ggtitle("Shannon Entropy") + theme(plot.title = element_text(hjust = 0.5)) # Combine plots p1 | p2 | p3 ``` ## Advanced Parameter Tuning ### Adjusting top_gene_count The `top_gene_count` parameter controls how many top marker genes per cluster are used for annotation: ```{r} # Using more genes (better for well-characterized tissues) results_more_genes <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = "claude-sonnet-4-5-20250929", api_key = Sys.getenv("ANTHROPIC_API_KEY"), top_gene_count = 20 # Using more genes ) # Using fewer genes (better for noisy data) results_fewer_genes <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = "claude-sonnet-4-5-20250929", api_key = Sys.getenv("ANTHROPIC_API_KEY"), top_gene_count = 5 # Using fewer genes ) ``` ### Adjusting controversy_threshold The `controversy_threshold` parameter in the `interactive_consensus_annotation` function controls which clusters are considered controversial and require discussion: ```{r eval=FALSE} # Example of using interactive_consensus_annotation with different controversy thresholds # Lower threshold (more clusters will be discussed) consensus_results_low_threshold <- interactive_consensus_annotation( input = pbmc_markers, tissue_name = "human PBMC", models = c("claude-sonnet-4-5-20250929", "gpt-5", "gemini-2.0-flash"), api_keys = list( "anthropic" = Sys.getenv("ANTHROPIC_API_KEY"), "openai" = Sys.getenv("OPENAI_API_KEY"), "gemini" = Sys.getenv("GEMINI_API_KEY") ), controversy_threshold = 0.3 # Lower threshold - more clusters will be discussed ) # Higher threshold (fewer clusters will be discussed) consensus_results_high_threshold <- interactive_consensus_annotation( input = pbmc_markers, tissue_name = "human PBMC", models = c("claude-sonnet-4-5-20250929", "gpt-5", "gemini-2.0-flash"), api_keys = list( "anthropic" = Sys.getenv("ANTHROPIC_API_KEY"), "openai" = Sys.getenv("OPENAI_API_KEY"), "gemini" = Sys.getenv("GEMINI_API_KEY") ), controversy_threshold = 0.7 # Higher threshold - fewer clusters will be discussed ) ``` ## Performance Considerations ### API Rate Limits and Costs Different LLM providers have different rate limits and pricing: - **OpenAI**: Higher rate limits but can be more expensive - **Anthropic**: Good balance of rate limits and cost - **Google**: Competitive pricing with good rate limits - **Others**: Generally lower cost but may have stricter rate limits To manage costs and rate limits: 1. Use caching to avoid redundant API calls 2. Start with a single model for exploration 3. Use more economical models for initial testing 4. Reserve multi-model consensus for final analysis ### Execution Time Typical execution times: - Single model annotation: 5-30 seconds per cluster - Multi-model consensus: 1-5 minutes for a typical dataset - Discussion process: Additional 1-3 minutes per controversial cluster To improve performance: 1. Use a smaller `top_gene_count` for faster execution 2. Enable caching to reuse results 3. Use a higher `controversy_threshold` to reduce the number of clusters that require discussion ## Troubleshooting ### Common Issues with OpenRouter If you encounter "No auth credentials found" errors with OpenRouter: - Verify your API key is correct and active - Ensure you're using the correct format for the model name (e.g., `provider/model-name:free`) - Try a different free model, as some models may have access restrictions ### Non-English Model Error Messages If you see "non-character parameter" errors: - This typically occurs when using non-English language models like `thudm/glm-z1-9b:free` for consensus checking - Switch to an English-optimized model like `meta-llama/llama-4-maverick:free` for consensus checking ### Empty Results If a model returns empty results: - Try increasing the timeout or retry with the same model - Some models like `deepseek/deepseek-r1:free` may occasionally return empty results - Switch to a more reliable model if the problem persists ## Next Steps Now that you understand the detailed usage of mLLMCelltype, you can explore: - [Consensus Annotation Principles](https://cafferyang.com/mLLMCelltype/articles/consensus-principles.html): Learn about the technical principles - [Visualization Guide](https://cafferyang.com/mLLMCelltype/articles/visualization-guide.html): Create publication-ready visualizations - [FAQ](https://cafferyang.com/mLLMCelltype/articles/faq.html): Find answers to common questions - [Advanced Features](https://cafferyang.com/mLLMCelltype/articles/advanced-features.html): Explore hierarchical annotation and other advanced features