SHIP · Jun 16, 2026

I tested 7 local LLMs on real agent work. Two survived.

Dense Gemma 4 and Qwen 3.5 passed. Every MoE variant failed. Ollama beats LM Studio on the same weights. BFCL V4 data confirms the pattern: size doesn't rescue you.

Agent-ready: drop this post into Claude Code or Codex

TL;DR: I tested 7 local LLMs on multi-step agentic work with MCP tools. Dense models beat every MoE variant. Gemma 4 dense (8B) and Qwen 3.5 dense passed. Ollama beats LM Studio on the same weights. BFCL V4 data confirms: API models score 85-95% on tool calling, 14B local scores 50-70%, 7-9B scores 30-50%. Size doesn’t rescue you. 120B failed where 8B passed.

Key takeaways:

  • Dense models (8B-27B) beat MoE and elastic variants on every agent tool-calling test.
  • BFCL V4: API models 85-95%, 14B local 50-70%, 7-9B local 30-50%. The gap is real.
  • 120B failed the same suite as 20B. Size doesn’t rescue a bad tool-call translation layer.
  • Ollama beats LM Studio on identical weights. The server matters as much as the model.
  • Three failure modes: tool-call loops, hallucinated success, weak-model edit-as-replace.

I set aside a weekend to test local LLMs on real agent work. I thought it would be a config change.

It was three weeks.

Small models are winning chat benchmarks. Qwen 3.5, Gemma 4, Llama 3.3. The quality gap with frontier models has collapsed for code edits, structured extraction, commit messages, short-form reasoning. An 8B dense model on a consumer laptop handles the majority of what a modern coding assistant asks of it (SitePoint: Best Local LLM Models 2026).

But chat is not agentic. Agentic is multi-turn tool calling: the model reads state, calls a tool, inspects the result, decides whether to continue or stop. That last step. recognizing “I’m done”. is the one that corrupts files on disk when it breaks. A chat model that gets a fact wrong is irritating. An agent that doesn’t know when to stop calling update_record flattens your document and walks off grinning.

What does a kitchen have to do with LLM testing?

Prep work is single-shot. Dice the onions. Mince the garlic. One task, one output. Any competent cook handles it. In agent terms: rewrite this paragraph, generate a commit message, clean up this YAML. Any 8B+ instruction-tuned model handles these fine.

Service is multi-turn. The head chef reads the ticket, fires the risotto, checks the pass, coordinates with the grill station, decides when the plate is ready. In agent terms: the model reads the codebase, proposes a structure, writes findings to disk through typed tool calls, and recognizes when the job is done.

Almost every local LLM benchmark measures prep work. I wanted to test service.

How did you test the models?

I picked a deliberately representative workload. A multi-step loop with real MCP tools: the agent walks a codebase, proposes a hierarchical structure, and writes findings to disk through tool calls like update_record and create_section. The same harness I use with Claude Sonnet, GPT-5, and Gemini every day. All three handle it without drama (Thomas Landgraf: I Gave Seven Local LLMs a Real Job).

Swap the model. Hold everything else constant. See what happens.

What failure modes emerged in weak models?

The tool-call loop. Same tool, same target, same arguments. Sixteen times in a row until the agent ran out of turns. The file on disk was garbage. the description jammed into the title field, the body still the template placeholder. Google’s own Gemma 4 documentation acknowledges Gemma can emit multiple tool calls per turn with no built-in loop termination. MoE variants are particularly prone.

You can paper over this with application-layer loop detection. Track the last N (tool_name, args_hash) tuples, interrupt after K repeats. But the underlying model will not stop on its own.

Hallucinated success. The trace looked clean. One tool call, clean arguments, a final answer narrating what it did. The file on disk was unchanged. Either the arguments were malformed in a way the MCP server silently ignored, or the model narrated its plan as a completion. It passes the “did any tools get called?” check but the work never happened. Your test suite goes green. Your user doesn’t know.

Weak-model edit-as-replace. I ran /add Examples on a five-section document and got back three sentences. The naive replace-on-disk logic clobbered the entire document. With Claude or GPT-5, the model echoes the full document with the transformation. The weak model did exactly what the prompt said. It failed to infer the invariant.

I fixed it at the prompt layer. a DOCUMENT COMPLETENESS RULE in all-caps. Prompt engineering fixing a data-loss bug. With frontier models the prompt is a hint. With local models the prompt is a contract.

Which models passed and which failed?

ModelTypeSizeHeavy workflowLight workflow
Gemma 4 denseDense8BPassPass
Qwen 3.5 denseDense9BPassPass
Llama 3.3Dense8BPassPass
Gemma 4 MoEMoE26B (4B active)Fail (tool loop)Pass
gpt-oss-20bMoE20BFail (hallucinated success)Pass
gpt-oss-120bMoE120BFail (hallucinated success)Pass
Gemma 4 elasticMatFormerVariableFail (edit-as-replace)Pass

Three patterns:

Dense beats MoE. Every MoE variant failed the heavy workflow. Every dense variant passed. The intuition: MoE routing fragments the reasoning needed to recognize “I’m done.” The jdhodges 2026 tool-calling benchmark of thirteen local models reports the same pattern. A separate Reddit benchmark testing 11 small LLMs (0.5B to 3.8B) on tool-calling judgment. run entirely on CPU via Ollama. found the same dense-over-MoE pattern at every size point (Berkeley Function Calling Leaderboard V4).

Size doesn’t rescue you. gpt-oss-120b failed in the same category as its 20B sibling. You cannot out-parameter a chat-template or tool-call-format mismatch. The 8B dense Gemma 4 that runs on a laptop passes the suite that the 120B model on an H100 cluster fails.

Ollama beats LM Studio on the same weights. Same weights, opposite outcomes. The difference is the tool-call translation layer. the code between the model’s native output format and the OpenAI-compatible API. Ollama maps this faithfully. LM Studio’s translation loses fidelity for agentic workflows specifically. Chat and single-shot prompts are unaffected.

What do the BFCL V4 benchmarks say?

The Berkeley Function Calling Leaderboard V4, updated April 2026, tests holistic agentic evaluation. not single-turn function calling. It measures multi-step planning, tool selection accuracy, parameter extraction, error recovery, and context maintenance across sequential tool calls (BFCL V4 methodology).

The typical accuracy ranges tell the story:

Model classBFCL V4 accuracy
GPT-4 / Claude (API)85-95%
14B local (Qwen, Llama)50-70%
7-9B local30-50%
3B and below10-25%

V4 is the first version to test true agentic capability, not AST metrics (V1) or enterprise functions (V2) or multi-turn interactions (V3). The agentic gap is explicit: local models lose 35-55 percentage points on the tasks that matter for production agent work.

On the context window front, local models trail significantly. Qwen 3.5 9B supports 32K-128K tokens. Llama 4 Scout supports 128K. GPT-4 and Claude support 128K-200K. For multi-turn agent loops that accumulate tool results, every 10x in context costs approximately 25% in accuracy (per Mem0’s BEAM benchmark).

What would you tell someone starting today?

Heavy agentic workflows: Ollama, dense model. Gemma 4 dense (8B) and Qwen 3.5 dense pass real agent tool-calling suites. Avoid MoE unless explicitly tool-tuned (Qwen 3.6-35B-A3B is the first MoE that might break this pattern. testing it this week). Avoid elastic/MatFormer entirely.

Light workloads: almost any 8B+ instruction-tuned model, either server. Pick on speed.

If you have 16GB RAM (MacBook Air): Qwen 3.5 9B fits comfortably. Use Q4_K_M quantization for best quality/size balance. Limit tool calls to 3-5 per task for reliability. Simplify tool definitions. fewer parameters, clearer descriptions. A Reddit benchmark of 11 models running entirely on CPU via Ollama confirmed that even tiny models (3.8B) can handle basic tool selection, but only models above 7B maintain coherence across 5+ tool calls.

LM Studio trap: the default 4096-token context is too small for real agentic flows. You’ll see “Model did not produce a final response” and conclude the model is broken. Bump to 16K+ and reload the model. The setting change does not apply retroactively.

“Listed but not loaded”: LM Studio’s /v1/models endpoint returns models registered in the UI even when they haven’t been loaded into memory. Inference calls to unloaded models 404 instantly. Test with a real one-token inference probe, not a /v1/models check.

Are local LLMs ready for agent work?

The local-LLM story for agentic flows in 2026 is where cloud LLM tool calling was in early 2024. The capability exists. The demos look great. Making it reliable for multi-step workflows requires specific model-harness combinations and significant integration polish.

What surprised me how much of the “local LLM doesn’t work for agentic flows” narrative is a stack-trace mismatch. Dense Gemma 4 on Ollama runs a multi-step tool workflow that I, a few weeks ago, would have told you required Sonnet. You don’t need frontier models for this. You need the right model, the right harness, and a prompt you’ve hardened against weak-model literal-mindedness.

I run LFM 2.5 8B (MoE, 1.5B active) and Qwopus 3.5 9B (Qwen fine-tune) on a 16GB MacBook Air. The hardware bill goes down every month. The model quality goes up every month. The gap is closing.

It hasn’t closed.

FAQ

Can local LLMs handle agent tool-calling workflows? Dense Gemma 4 (8B) and Qwen 3.5 (9B) pass. Every MoE variant failed.

Why do dense models beat MoE for agent work? MoE routing fragments the reasoning needed to recognize when to stop calling tools.

Does model size matter? No. 120B failed where 8B passed. Size doesn’t rescue a bad translation layer.

Ollama vs LM Studio? Ollama wins on the same weights. The tool-call translation layer is the difference.

What are typical local model BFCL scores? 14B: 50-70%, 7-9B: 30-50%, 3B and below: 10-25%. API models score 85-95%.


This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at [email protected]

Newsletter

Get the brief on AI agents

Practical posts on shipping agents, automating work, and building in public. No hype, no fluff.

Contact: [email protected]