Don’t Blame the Model

# Don't Blame the Model

Large language models have earned a reputation for unreliability. Tiny input changes can trigger wildly different outputs. Run the same prompt twice, and you might get contradictory answers. But attributing these failures to the models themselves misses the real problem.

The issue lies in how we deploy and use LLMs, not in their fundamental design. Models like GPT-4 or Claude perform consistently within their training. The problems emerge when organizations treat them as deterministic systems or ignore the context in which they operate.

Several factors drive perceived unreliability. Temperature settings control randomness in generation. Lower temperatures produce more predictable outputs, while higher temperatures increase variation. Most deployments don't optimize these parameters for their specific use case. Prompt engineering matters enormously. Vague instructions produce scattered results. Precise, well-structured prompts yield far more consistent performance.

The systemic issue is expectations management. Users expect LLMs to behave like traditional software, where identical inputs guarantee identical outputs. LLMs don't work that way. They generate probabilistic outputs based on learned patterns. This is a feature, not a bug, when harnessed properly.

Context windows also play a role. Longer contexts introduce more variables. Retrieval-augmented generation (RAG) systems add external data that can vary between runs. Without careful architecture, this compounds unpredictability.

Organizations deploying LLMs successfully understand these dynamics. They establish guardrails. They test extensively. They set appropriate expectations with end users. They use deterministic components where consistency matters most.

The path forward requires shifting blame from the models to deployment decisions. LLMs are tools with known properties. Using them effectively means respecting those properties rather than fighting them. The reliability problem is solvable through better engineering and clearer understanding of what these systems actually do.

WHY IT

Don’t Blame the Model

AI Weekly Issue #476: Weekly Intelligence Briefing: Tech, AI & Policy

AI Weekly Issue #473: The godfather of AI bets against LLM's

In Harvard study, AI offered more accurate emergency room diagnoses than two human doctors

Get Daily AIWireDaily