Get Started with localLLM

localLLM provides an easy-to-use interface to run large language models (LLMs) directly in R. It uses the performant llama.cpp library as the backend and allows you to generate text and analyze data with LLMs. Everything runs locally on your own machine, completely free, with reproducibility by default.

Installation

Getting started requires two simple steps: installing the R package and downloading the backend C++ library.

Step 1: Install the R package

# Install from CRAN
install.packages("localLLM")

Step 2: Install the backend library

The install_localLLM() function automatically detects your platform and downloads the appropriate pre-compiled library. GPU acceleration is selected automatically when a compatible GPU driver is detected:

Platform	GPU backend	Detection method
macOS (Apple Silicon)	Metal	always enabled
macOS (Intel)	Metal	always enabled
Windows (x86-64)	Vulkan	`vulkan-1.dll` present in System32
Linux (x86-64)	Vulkan	Vulkan loader + hardware ICD file present

On Windows and Linux, if no GPU driver is found, the CPU build is installed automatically.

library(localLLM)
install_localLLM()

# Force CPU build even when a GPU is detected
install_localLLM(force_cpu = TRUE)

# Reinstall after adding a GPU driver (re-runs detection)
install_localLLM(force_reinstall = TRUE)

Your First LLM Query

The simplest way to get started is with quick_llama():

library(localLLM)

response <- quick_llama("What is the capital of France?")
cat(response)

#> The capital of France is Paris.

quick_llama() is a high-level wrapper designed for convenience. On first run, it automatically downloads and caches the default model (Llama-3.2-3B-Instruct-Q5_K_M.gguf).

Text Classification Example

A common use case is classifying text. Here’s a sentiment analysis example:

response <- quick_llama(
  'Classify the sentiment of the following tweet into one of two
   categories: Positive or Negative.

   Tweet: "This paper is amazing! I really like it."'
)

cat(response)

#> The sentiment of this tweet is Positive.

Processing Multiple Prompts

quick_llama() can handle different types of input:

Single string: Performs a single generation
Vector of strings: Automatically switches to parallel generation mode

# Process multiple prompts at once
prompts <- c(

  "What is 2 + 2?",
  "Name one planet in our solar system.",
  "What color is the sky?"
)

responses <- quick_llama(prompts)
print(responses)

#> [1] "2 + 2 equals 4."
#> [2] "One planet in our solar system is Mars."
#> [3] "The sky is typically blue during the day."

Finding and Using Models

GGUF Format

The localLLM backend only supports models in the GGUF format. You can find thousands of GGUF models on Hugging Face:

Search for “gguf” on Hugging Face
Filter by model family (e.g., “gemma gguf”, “llama gguf”)
Copy the direct URL to the .gguf file

Loading Different Models

# From Hugging Face URL
response <- quick_llama(
  "Explain quantum physics simply",
  model_path = "https://huggingface.co/unsloth/gemma-3-4b-it-qat-GGUF/resolve/main/gemma-3-4b-it-qat-Q5_K_M.gguf"
)

# From local file
response <- quick_llama(
  "Explain quantum physics simply",
  model_path = "/path/to/your/model.gguf"
)

# From cache (name fragment)
response <- quick_llama(
  "Explain quantum physics simply",
  model_path = "Llama-3.2"
)

Managing Cached Models

# List all cached models
cached <- list_cached_models()
print(cached)

#>                                 name size_bytes            modified
#> 1 Llama-3.2-3B-Instruct-Q5_K_M.gguf 2322153920 2025-12-05 20:01:18
#> 2   gemma-3-4b-it-qat-Q5_K_M.gguf   2829698176 2025-12-14 19:21:11

# Delete a cached model by name
file.remove(cached$path[cached$name == "Llama-3.2-3B-Instruct-Q5_K_M.gguf"])

Customizing Generation

Control the output with various parameters:

response <- quick_llama(
  prompt = "Write a haiku about programming",
  temperature = 0.8,      # Higher = more creative (default: 0)
  max_tokens = 100,       # Maximum response length
  seed = 42,              # For reproducibility
  n_gpu_layers = 999      # Use GPU if available
)

Next Steps

Reproducible Output: Learn about deterministic generation and audit trails
Basic Text Generation: Master the lower-level API for full control
Parallel Processing: Efficiently process large datasets
Model Comparison: Compare multiple LLMs systematically