GPT and Claude failed Bridgewater's finance tests because the right answers were never public

Bridgewater Associates and Thinking Machines Lab, the AI startup founded by former OpenAI CTO Mira Murati, claim to have fine-tuned Alibaba's Qwen3-235B model to outperform GPT-4, Claude, and Gemini on financial analysis tasks. The custom model reportedly achieves 84.7 percent accuracy while costing roughly one-fourteenth as much as competing systems.

The headline reveals the core issue: GPT and Claude performed poorly on Bridgewater's tests because the correct answers were never made public. This suggests Bridgewater used proprietary financial datasets or internal benchmarks that neither OpenAI nor Anthropic could have seen during training. The models cannot solve problems with answers they've never encountered in their training data.

The fine-tuning approach matters here. Rather than building from scratch, Bridgewater and Thinking Machines took an existing open model and adapted it specifically for finance. This strategy makes economic sense for specialized domains where general-purpose AI struggles. Qwen3-235B serves as the foundation, but domain-specific training on Bridgewater's datasets likely accounts for the performance gains.

Cost efficiency deserves attention. If the 84.7 percent accuracy claim holds under scrutiny, a model one-fourteenth the price of GPT-4 would reshape financial AI economics. Banks and trading firms spend enormous sums on AI infrastructure. A cheaper alternative could shift how they deploy machine learning.

The critical caveat: these numbers lack independent verification. Bridgewater and Thinking Machines conducted the testing themselves. No peer review, no third-party validation, no public benchmark exists to confirm these results. Companies routinely report impressive metrics on proprietary tests, then fail to replicate those results in real-world conditions.

The broader takeaway centers on specialization. General

GPT and Claude failed Bridgewater's finance tests because the right answers were never public

Trunk Tools' stack cut document review from 60 days to 10 by ditching general-purpose models

Linear Thinking, Nonlinear Costs

Cloudflare’s new policy pushes AI companies to pay for publishers’ content

Get Daily AIWireDaily