Local LLMs hallucinate tool names and generate malformed JSON in function calling

devtools0 views
When you use a local model (Llama 3, Mistral, Qwen) via Ollama or llama.cpp with function calling, the model hallucinates tool names that don't exist, generates malformed JSON arguments (missing quotes, trailing commas, wrong types), and ignores the tool schema you provided. This happens 20-40% of the time even with the best open-source models. So what? Function calling is the foundation of every agentic workflow — without reliable tool use, a local LLM cannot be an agent at all. It can chat, but it cannot act. This means anyone who wants to run agents locally for privacy, cost, or latency reasons is stuck: the models that can do reliable function calling (Claude, GPT-4) are cloud-only, and the models you can run locally cannot reliably call a single tool. Why does this persist in the first place? Function calling was bolted onto open-source models after the fact via fine-tuning on synthetic tool-call datasets. The training data is small, the JSON grammar is not enforced at the decoding level (most inference engines just sample tokens and hope they form valid JSON), and there is no standardized tool-call format across model families — Llama uses one format, Mistral another, Qwen another.

Evidence

Ollama function calling docs acknowledge limited model support. llama.cpp grammar-constrained sampling helps but slows inference 2-3x. Gorilla LLM benchmark shows open-source models at 40-60% tool-call accuracy vs 90%+ for GPT-4. GitHub issues on Ollama function calling failures are common: https://github.com/ollama/ollama/issues

Comments