Remote OpenClaw Blog

Best Llama Models for Hermes Agent — Local and Cloud Setup

8 min read · 22 May 2026

Llama 4 Scout is the best local model for Hermes Agent in 2026, running on just 12GB VRAM via Ollama with a 10 million token context window at zero API cost. For users who need stronger reasoning, Llama 4 Maverick (400B total / 40B active parameters) delivers tool calling quality approaching cloud models but requires 24GB VRAM. Hermes Agent auto-detects models installed through Ollama and includes per-model tool call parsers specifically optimized for local inference, making Meta's Llama 4 family the most practical path to a fully self-hosted Hermes Agent deployment with no external API dependency.

Llama Model Comparison for Hermes Agent

Meta's Llama 4 collection uses a Mixture-of-Experts (MoE) architecture with native multimodality, accepting both text and image input. The table below compares Llama models relevant to Hermes Agent deployment as of April 2026, covering both local and cloud pricing.

Model	Total / Active Params	VRAM (Local)	Cloud API Cost (per 1M tokens)	Context Window	Best For
Llama 4 Scout	109B / 17B	12GB	$0.08 / $0.30	10M tokens	Local agent on consumer GPU
Llama 4 Maverick	400B / 40B	24GB	$0.20 / $0.60	1M tokens	High-quality local agent

Llama 4 Scout is the practical choice for most self-hosted Hermes Agent deployments. At 12GB VRAM with Q4 quantization, it fits on an RTX 4070 or an M2 MacBook Pro. The 10 million token context window is the longest of any locally runnable model — far exceeding what Hermes Agent typically needs for conversation history, skills, and tool definitions combined.

Llama 4 Maverick delivers significantly stronger reasoning and tool calling but requires an RTX 3090/4090 or a high-end Mac Studio with 24GB+ VRAM. For Hermes Agent users who already own that hardware, Maverick offers cloud-competitive quality at zero marginal cost. For a comparison with all supported models, see our best models for Hermes Agent guide.

Local Setup with Ollama

Ollama is the recommended way to run Llama models locally with Hermes Agent. As of 2026, Ollama's Llama 4 library includes both Scout and Maverick with automatic Q4 quantization that compresses models to fit consumer hardware without measurable intelligence loss.

Step 1: Install Ollama

Download Ollama from ollama.com for macOS or Linux. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Step 2: Pull a Llama 4 Model

# For Scout (12GB VRAM):
ollama pull llama4:scout

# For Maverick (24GB VRAM):
ollama pull llama4:maverick

Ollama automatically applies Q4_K_M quantization during download, compressing the model to fit available hardware.

Step 3: Raise the Context Length

This is the most common mistake when using Ollama with Hermes Agent. Ollama defaults to a 4,096 token context length under 24GB VRAM — far too short for agent use. Hermes Agent needs at least 16K-32K tokens to fit system prompts, tool definitions, skills, and conversation history.

Set the context length via environment variable before starting Ollama:

export OLLAMA_CONTEXT_LENGTH=32768

Or create a Modelfile for persistent configuration:

FROM llama4:scout
PARAMETER num_ctx 32768

Then build it: ollama create llama4-hermes -f Modelfile

Step 4: Verify the Model is Running

ollama list

You should see llama4:scout or llama4:maverick in the output. Hermes Agent will auto-detect it on the next launch. If Hermes Agent is not installed yet, follow the setup guide first.

Hermes Agent Configuration

Hermes Agent auto-detects models installed through Ollama and connects to the local Ollama server on its default port. No API key is needed. Configuration in ~/.hermes/config.yaml is minimal.

config.yaml for Ollama

provider: ollama
model: llama4:scout

For a custom Modelfile with extended context:

provider: ollama
model: llama4-hermes

Alternatively, use the interactive hermes model command and select your Ollama model from the auto-detected list.

Tool Call Parsers

Hermes Agent includes per-model tool call parsers that handle format differences in how each local model generates function calls. For Llama 4 models, the parser uses the llama3_json format. These parsers reduce malformed tool calls and retries, saving compute time on local hardware where every wasted inference is noticeable.

If you are using vLLM instead of Ollama for serving, enable tool calling with the --enable-auto-tool-choice and --tool-call-parser llama3_json flags. See the Hermes Agent providers documentation for all parser options.

Cloud API Options for Llama with Hermes Agent

Running Llama through a cloud API eliminates hardware requirements while keeping per-token costs lower than proprietary models. Multiple providers host Llama 4 models with OpenAI-compatible API endpoints that Hermes Agent can connect to via the custom provider.

Provider	Scout (per 1M tokens)	Maverick (per 1M tokens)	Notes
Together AI	$0.08 / $0.30	$0.24 / $0.72	Serverless, batch discounts
Fireworks AI	$0.08 / $0.30	$0.20 / $0.60	Custom inference kernels, 50% batch discount
Groq	Varies	Varies	Ultra-fast inference on LPU hardware
OpenRouter	Varies	Varies	Routes to cheapest available provider

Hermes Config for Cloud Llama

# Via Together AI
provider: custom
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
base_url: https://api.together.xyz/v1
api_key: ${TOGETHER_API_KEY}

# Via OpenRouter (simplest)
provider: openrouter
model: meta-llama/llama-4-scout
api_key: ${OPENROUTER_API_KEY}

Fireworks AI typically offers the lowest Maverick pricing because of custom inference kernels that squeeze more performance per GPU. Together AI is competitive on Scout pricing. OpenRouter is the simplest option — it routes to the cheapest available provider automatically.

Cost Optimizer

Build time: 1 hr. Cost Optimizer: 15 minutes. Your call.

Start With Cost Optimizer →Compare Best Fits →

Self-Hosted Agent Economics

Running Llama locally eliminates per-token costs but introduces hardware and electricity costs. The economics depend on how much you use the agent. For a full cost analysis across all deployment options, see our Hermes Agent cost breakdown.

When Local Beats Cloud

At cloud API rates of $0.08/$0.30 per million tokens for Llama 4 Scout, a user processing 10 million tokens per month pays roughly $3.80. The hardware to run Scout locally — an RTX 4070 (around $550) or an M2 MacBook Pro — only pays for itself if monthly usage significantly exceeds that level, or if the primary motivation is data privacy rather than cost.

The economic case for local inference is stronger with Maverick. At $0.20/$0.60 per million tokens, a heavy user processing 50 million tokens per month pays $40 in cloud API costs. An RTX 4090 (around $1,600) amortized over 12 months costs $133/month plus electricity — the break-even point is roughly 200+ million tokens per month.

When Cloud Beats Local

Light usage (under 20M tokens/month): Cloud APIs are cheaper than hardware amortization.
Need for multiple models: OpenRouter lets you switch between Llama, Claude, and GPT without managing hardware.
Uptime requirements: Cloud APIs have better availability than a local machine that needs reboots, updates, and cooling.

The Privacy Argument

For many Hermes Agent users, running Llama locally is not about cost — it is about keeping all data on their own hardware. No API calls leave the network, no conversation data reaches external servers, and the agent's persistent memory stays entirely under local control. This is the primary reason to choose Llama over cloud-only models like Claude or GPT, regardless of the cost comparison. For a detailed self-hosting walkthrough, see our Hermes Agent self-hosted guide.

Limitations and Tradeoffs

Running Llama with Hermes Agent has genuine limitations compared to cloud API models.

Reasoning quality is lower than top cloud models. Llama 4 Scout and Maverick are strong open-weight models, but Claude Sonnet 4.6 and GPT-4.1 produce more reliable results on complex multi-step reasoning tasks. For agent workflows with many chained tool calls, cloud models typically need fewer retries.
Ollama context length defaults are too low. The 4,096 token default context length is a recurring gotcha that breaks Hermes Agent. You must manually raise it to at least 16K-32K tokens — this is the single most common setup mistake for local Hermes deployments.
Local inference is slower. Even on an RTX 4090, local Llama inference is slower than cloud API calls to Anthropic or OpenAI. For interactive agent use where response latency matters, cloud APIs provide a better experience.
Hardware requirements are non-trivial. Scout needs 12GB VRAM; Maverick needs 24GB. Users without a capable GPU or a modern MacBook cannot run these models locally.
Tool calling is less reliable than with Claude. Despite Hermes Agent's per-model tool call parsers, Llama models produce more malformed tool calls than Claude Sonnet 4.6. Each malformed call wastes local compute and adds latency to the agent response.
No native multimodal in Ollama yet. While Llama 4 supports image input natively, Ollama's multimodal support for Llama 4 may still be limited depending on your version. Verify with ollama --version before relying on vision features.

Related Guides

FAQ

Can Hermes Agent run Llama 4 locally without an API key?

Yes. Install Ollama, pull a Llama 4 model (ollama pull llama4:scout), set provider: ollama in your Hermes config.yaml, and the agent connects to the local Ollama server automatically. No API key, no account, no external network calls required.

Which Llama 4 model should I use with Hermes Agent?

Llama 4 Scout if you have 12GB VRAM or less — it runs on an RTX 4070 or M2 MacBook Pro and supports a 10 million token context window. Llama 4 Maverick if you have 24GB+ VRAM (RTX 3090/4090 or Mac Studio) and want stronger reasoning and tool calling quality closer to cloud models.

Why does Hermes Agent fail with Ollama's default context length?

Ollama defaults to 4,096 tokens of context under 24GB VRAM. Hermes Agent needs at least 16K-32K tokens to fit its system prompt, tool definitions, loaded skills, and conversation history. Set OLLAMA_CONTEXT_LENGTH=32768 as an environment variable or create a custom Modelfile with the num_ctx parameter raised.

Is it cheaper to run Llama locally or through a cloud API?

Cloud APIs are cheaper for light usage (under 20 million tokens per month). Local inference is cheaper for heavy usage (200+ million tokens per month with Maverick-class hardware). The primary reason to run Llama locally is data privacy — all data stays on your hardware — not cost savings. See our cost breakdown for detailed calculations.

How does this compare to using Llama with OpenClaw?

This guide covers Llama configuration specifically for Hermes Agent's Ollama provider, tool call parsers, and self-hosted economics. For Llama setup in OpenClaw, see our best Llama models for OpenClaw guide. For a general Llama model review, see best Llama models 2026.

Frequently Asked Questions

Which Llama 4 model should I use with Hermes Agent?

Why does Hermes Agent fail with Ollama's default context length?

Is it cheaper to run Llama locally or through a cloud API?

How does this compare to using Llama with OpenClaw?

Ready to choose the right OpenClaw workflow?

Cost OptimizerBuild time: 1 hr. Cost Optimizer: 15 minutes. Your call.Compare Best FitsUse the marketplace filters to choose the right bundle, persona, or skill without browsing blind.Browse AI Agent SkillsUse the skills hub to move from research into the right ecosystem, use case, and install path.

Loading article

Best Llama Models for Hermes Agent — Local and Cloud Setup

Llama Model Comparison for Hermes Agent

Local Setup with Ollama

Step 1: Install Ollama

Step 2: Pull a Llama 4 Model

Step 3: Raise the Context Length

Step 4: Verify the Model is Running

Hermes Agent Configuration

config.yaml for Ollama

Tool Call Parsers

Cloud API Options for Llama with Hermes Agent

Hermes Config for Cloud Llama

Self-Hosted Agent Economics

When Local Beats Cloud

When Cloud Beats Local

The Privacy Argument

Limitations and Tradeoffs

Related Guides

FAQ

Can Hermes Agent run Llama 4 locally without an API key?

Which Llama 4 model should I use with Hermes Agent?

Why does Hermes Agent fail with Ollama's default context length?

Is it cheaper to run Llama locally or through a cloud API?

How does this compare to using Llama with OpenClaw?

Frequently Asked Questions

Which Llama 4 model should I use with Hermes Agent?

Why does Hermes Agent fail with Ollama's default context length?

Is it cheaper to run Llama locally or through a cloud API?

How does this compare to using Llama with OpenClaw?

Related Skills

Related Guides

Ready to choose the right OpenClaw workflow?