model_metadata(model) — returns all GGUF key-value metadata as a named character vector. Useful for inspecting model architecture, quantization type, and embedded chat template. Example: model_metadata(model)["tokenizer.chat_template"].Fixed apply_chat_template() failing for Gemma 4 models — Gemma 4 uses a <|turn> / <turn|> chat format not recognized by llama_chat_apply_template() (returns -1). The fallback now calls common_chat_templates_apply() from common/chat.cpp, which executes the Jinja2 template embedded in the model's GGUF directly. This works for any model with a valid Jinja2 template regardless of the C API whitelist. enable_thinking defaults to true, so Gemma 4 generates thinking content naturally without a pre-closed thought block. Tool calls and multimodal content are not handled.
Fixed stop token leaking in generate() and generate_parallel() for ChatML models (OLMo, Llama 3) — Two separate issues fixed:
<|im_end|> as 6 separate pieces, with the last piece being >\n (merging > and newline). The previous exact-suffix check failed because the response ended with <|im_end|>\n instead of exactly <|im_end|>. Changed to a windowed find() that searches within the last stop.size() + 4 bytes and truncates at the match position. Applied to both generate() and generate_parallel().<|start_header_id|> loop: Llama 3.2 3B sometimes omits <|eot_id|> and jumps directly to <|start_header_id|> to begin a new turn, causing infinite repetition. Added <|start_header_id|> to text_stop_strings in both functions.generate()'s stop list also expanded from {"<turn|>", "<end_of_turn>"} to {"<turn|>", "<end_of_turn>", "<|eot_id|>", "<|im_end|>", "<|start_header_id|>"} to match generate_parallel().Fixed verbosity not forwarded in quick_llama() — verbosity parameter was accepted but silently dropped when passed through to .generate_single() and .generate_multiple(), so backend logging level had no effect during quick_llama() calls. Now correctly forwarded to generate() and generate_parallel().
Fixed backend errors crashing R instead of being catchable — All Rcpp::stop() calls in src/interface.cpp replaced with Rf_error(). stop() throws a C++ exception which crosses the C boundary (.Call() registration) and triggers std::terminate(), killing the R process. Rf_error() uses longjmp which R's condition system can intercept, so tryCatch() now works correctly for all backend errors including the OOM guard.
Fixed model-loading progress dots leaking to stderr with verbosity = 0 — llama_model_load_from_file() has its own progress_callback that prints dots to stderr independently of the log callback system. Now set to a no-op when verbosity < 2 in localllm_model_load_safe(). Model loading is fully silent at the default generation verbosity.
generate_parallel(progress) now defaults to interactive() — previously defaulted to TRUE, which printed carriage-return-based progress bars to log files and R CMD check output. The new default shows the progress bar only in interactive R sessions and suppresses it in scripts and automated checks.quick_llama(progress) now defaults to interactive() — same rationale as above; no effect on single-prompt calls.quick_llama(stream) parameter — the stream argument was present in the function signature but was never passed to any downstream function (it was placeholder code with a comment "available for future use"). Removed to avoid user confusion.New localllm_set_verbosity() C API — added to the backend binary and wired through the proxy layer (proxy.h/cpp, interface.cpp, init.cpp). Enables per-call verbosity control at the C level (integer 0–3, negative = fully silent). Called automatically by generate(), generate_parallel(), model_load(), and context_create() before each C invocation.
C-layer OOM crash guard in localllm_model_load_safe() — added a last-resort memory check that fires even when check_memory = FALSE. If the model file is larger than total physical RAM, the function now returns a clean error ("Model file (X.X GB) exceeds total physical RAM (Y.Y GB)...") instead of proceeding to llama_model_load_from_file() and letting macOS OOM-kill the R process silently. The guard only blocks provably-impossible loads (file size > total RAM) and does not interfere with the existing R-layer check. Supported on macOS (sysctl hw.memsize), Linux (/proc/meminfo MemTotal), and Windows (GlobalMemoryStatusEx).
model_load() messages not suppressed by verbosity = 0 — Two R-level message() calls in api.R ("Using cached model: ..." and the GPU/unified-memory info line) print unconditionally regardless of verbosity. The verbosity parameter controls only the C backend log level; these R-layer informational messages are a separate code path not yet gated on verbosity. Confirmed against Gemma 4 26B-A4B (IQ2_XXS) on 2026-04-12.generate() and generate_parallel() roxygen entries now explain why they default to 0L (called in loops, per-call logs would be noisy) and cross-reference model_load()/context_create() (default 1L, run once per session, warnings should be visible).generate_parallel() performance regression introduced by llama.cpp b7825's new memory APIllama_memory_seq_cp() call was dropped during the b7825 migration, causing every parallel slot to re-decode the full prompt instead of sharing the prefixp0=-1, p1=-1), which is compatible with the new APIgenerate() and generate_parallel() now work on Intel Macs (x86_64); GPU acceleration is not available on Intel Mac, CPU inference is usedhardware_profile() crash on Linux and Windows when GPU diagnostic tools (nvidia-smi, rocm-smi, clinfo) are not installedvendor/cpp-httplib dependency (required by updated common/ library)cmake/license.cmakeNo changes to R-level API - All existing R code continues to work without modification.
tempdir() during automated checks
so that R CMD check no longer creates ~/.cache/R/localLLM in the home
directory (CRAN policy violation).hardware_profile() example to use \donttest{} instead of
if (interactive()) guard, per CRAN best practices.Breaking changes in backend (transparent to R users):
llama_kv_self_* API to llama_memory_* APIKey improvements:
llama_batch_get_one()llama_batch_init() + common_batch_add() + llama_batch_free()generate() call starts from clean staten_threads_batch parameter for batch processingNo changes to R-level API - All existing R code continues to work without modification:
library(localLLM)
backend_init()
model <- model_load("model.gguf")
ctx <- context_create(model, n_ctx = 512)
result <- generate(ctx, "Hello", max_tokens = 10)
# All existing code works exactly the same
backend/llama.cpp/build_localllm.shUpdated files:
custom_files/localllm_capi.cpp (10 locations modified)
Unchanged:
custom_files/localllm_capi.h (C API interface)R/*.R)src/proxy.cpp)tests/testthat/*.R)install.packages("localLLM_1.2.0.tar.gz", repos = NULL, type = "source")
library(localLLM)
install_localLLM() # Will download the new b7825 backend
remove.packages("localLLM")
install.packages("localLLM_1.2.0.tar.gz", repos = NULL, type = "source")
library(localLLM)
install_localLLM(force = TRUE) # Force reinstall backend
New technical documentation:
UPGRADE_COMPLETE.md - Complete upgrade reportCRITICAL_CHANGES_REQUIRED.md - Detailed change checklistMIGRATION_ANALYSIS_b5421_to_b7785.md - Full migration analysisPotential optimizations for future releases:
For more information about llama.cpp, see:
Previous release notes (if any) would go here...