Skip to main content
SolutionsApproachCase StudiesInsightsContact

Open-Weight Models

←Back to Insights

17 February 2026·11 min read

February 2026: Open-weight models reach 80-85% of proprietary performance. At typical enterprise scale (1-10B tokens), they're 40-70% cheaper after break-even. At extreme optimised scales, savings can reach 20-100×.


Executive Summary

Something fundamental shifted in AI economics this year. OpenAI releasing GPT-OSS 120B as an open-weight model wasn't just another product launch—it was an admission that the old pricing model is breaking down.

The numbers tell the story: open-weight models now score within 6-7 percentage points of proprietary leaders on benchmarks like GPQA. The cost differential varies by scale—at typical enterprise volumes (1-5B tokens/month), expect 40-70% savings after break-even. At 10B+ tokens monthly, savings can reach 80-90%. At extreme scales with optimised infrastructure (tens of billions of tokens monthly), the difference can reach 20-100×, though such deployments require significant expertise.

This analysis examines September 2025's benchmark data and deployment costs to help you determine whether your organisation should keep paying API premiums or join the migration to open-weight infrastructure.

Where We Stand Today

The Proprietary Leaders (as of Sept 20, 2025)

Grok-4 currently leads the pack at 88.4% on GPQA, charging $3 per million input tokens and $15 for output. GPT-5 follows at 85.7% GPQA with its $1.25/$10 pricing structure, while Gemini 2.5 Pro and Claude 3.7 Sonnet cluster around the 84-86% range with similar premium pricing. OpenAI's reasoning models, o3 and o4-mini, sit at 83.3% and 81.4% respectively.

These are genuinely impressive systems. GPT-5's 92.5% MMLU score represents real progress. But here's what's interesting: the gap between best and "good enough" has narrowed to single digits.

Open-Weight Performance

OpenAI's GPT-OSS 120B, released under Apache-2.0 licence, achieves 80.1% on GPQA — 90% of their flagship's performance, now available under an open licence. Qwen3-235B-Thinking pushes slightly higher at 81.1%, costing $0.30/$3 per million tokens through hosted providers (prices vary by provider). Even DeepSeek R1 Zero manages 73.3% while being completely free to self-host.

Meta's Llama 4 Scout offers a 10-million-token context window—10× what GPT-5 provides and far beyond what most proprietary models support. When it comes to working with large documents or extensive codebases, context matters more than marginal accuracy improvements.

The cost implications deserve emphasis upfront: At extreme scales (tens of billions of tokens monthly with optimised infrastructure), savings can reach 20-100×. At typical enterprise scales (1-5B tokens/month), 40-70% savings is the realistic range after break-even. At 10B+ tokens monthly, savings reach 80-90%.

The Narrowing Performance Gap

The performance differential deserves scrutiny. On GPQA, Grok-4 tops out at 88.4% while Qwen3-235B-Thinking achieves 81.1%—a 7.3 percentage point gap. For graduate-level physics problems, that might matter. For generating marketing copy or answering customer queries? Not so much.

AIME 2025 results are even more interesting. Top proprietary models report between 92-100% depending on sampling methodology, while GPT-OSS 120B hits ~98% on some boards. We're arguing about margins of error at this point.

Then there's context length. Llama 4 Scout processes 10 million tokens, Gemini 2.5 Pro handles 1 million, and GPT-5 manages 400,000. If you're working with large codebases or document archives, would you rather have 7% better accuracy or 25× more context? (Note: advertised context windows are upper bounds; effective usable context depends on provider, quantisation, and toolchain.)

The Economics Transform at Scale


The Scale Equation - Simple Math for Executives

Small Scale (<100M tokens/month) (assumes ~70/30 input/output)

  • GPT-5 API cost: ~$400/month
  • Self-hosting cost: ~$2,000/month
  • Winner: APIs ✅

Medium Scale (1B tokens/month)

  • GPT-5 API cost: ~$3,875/month
  • Self-hosting cost: ~$2,000/month
  • Winner: Self-hosting after ~27 months ✅

Large Scale (10B tokens/month)

  • GPT-5 API cost: ~$38,750/month
  • Self-hosting cost: ~$4,000/month
  • Winner: Self-hosting immediately ✅✅

Rule of thumb: If you're spending >$5,000/month on AI APIs, evaluate self-hosting.


API Pricing Reality (as of Sept 20, 2025; prices vary by provider and subject to change)

Premium Tier (Proprietary)

  • GPT-5: $1.25 input, $10 output per million tokens
  • Claude 3.7 Sonnet: $3 input, $15 output
  • Gemini 2.5 Pro: $1.25 input, $10 output

Open-Weight (Hosted/Self-Hosted)

  • Qwen3-235B: $0.15-$0.30 input, $0.80-$3.00 output
  • Self-hosted marginal cost: Low single-cent to tens-of-cents per million tokens at high utilisation; materially below API rates

Real Cost Calculation

For an enterprise-scale deployment processing 10 billion tokens monthly (7B input, 3B output - typical 70/30 ratio):

GPT-5 via API:

  • Input: 7,000M tokens × $1.25/million = $8,750
  • Output: 3,000M tokens × $10.00/million = $30,000
  • Monthly total: $38,750
  • Annual: $465,000

Qwen3-235B Self-Hosted:

  • Infrastructure: $50,000 one-time
  • Operating cost: ~$4,000/month (scaled infrastructure)
  • Annual total Year 1: $98,000
  • Annual ongoing: $48,000
  • Savings: 79% Year 1, 90% ongoing

Plain English: Once you're processing billions of tokens monthly (think enterprise customer service, document analysis, or code generation at scale), self-hosting saves hundreds of thousands annually. Below that threshold, APIs remain cost-effective.

The 1B Token Threshold

For moderate-scale deployments (1B tokens/month with 70/30 split):

GPT-5 via API:

  • Input: 700M tokens × $1.25/million = $875
  • Output: 300M tokens × $10.00/million = $3,000
  • Monthly total: $3,875
  • Annual: $46,500

Qwen3-235B Self-Hosted:

  • Infrastructure: $50,000 one-time
  • Operating cost: ~$2,000/month
  • Annual total Year 1: $74,000
  • Annual ongoing: $24,000
  • Break-even: ~Month 27
  • Savings after Year 2: 48%

Key insight: Self-hosting makes economic sense at 1B+ tokens/month, or when data sovereignty is critical.

Important: Even below the break-even point, organisations choose self-hosting for:

  • Complete data privacy and compliance
  • Unlimited customisation and fine-tuning
  • No rate limits or service interruptions
  • Predictable latency and performance
  • Independence from vendor lock-in

What OpenAI's Move Really Means

OpenAI releasing GPT-OSS 120B as open-weight wasn't charity. It was strategy. They've seen the writing on the wall: the era of charging $10 per million output tokens is ending. Better to cannibalise yourself than let someone else do it.

This validates what many suspected. Most business applications don't need cutting-edge performance. They need good-enough performance at sustainable costs. Customer service bots, document summarisation, code completion—these tasks run fine at 80% of peak performance. The remaining 20% matters for research papers and complex reasoning chains, not for answering "Where's my order?"

Consider data sovereignty. Financial services firms processing customer data, healthcare systems handling patient records, government agencies managing citizen information—they can't send this to external APIs, regardless of price. For them, open-weight models aren't about cost. They're about compliance.

FrontierMath Scores in Context

Want to know where AI actually struggles? Epoch AI released FrontierMath, a benchmark designed by Fields Medalists. Every model—GPT-5, Claude 3.7, o3, Gemini 2.5—scores below 2%. These are problems that take mathematics PhDs hours or days to solve.

This is useful perspective. At the true frontier of reasoning, there's no meaningful difference between open and proprietary models because they all fail equally. The performance gaps we obsess over exist in the middle ground—useful tasks where 7% improvement might matter but probably doesn't.

For businesses, this should be liberating. The difference between 81% and 88% on GPQA won't make or break your customer service automation. And despite what API vendors might suggest, neither GPT-5 nor Qwen3 knows who won last week's football match—both models' training data stops around September 2024. The supposed advantage of "always-current" API models is marketing fiction.

The Infrastructure Question

"But what about support?" is the first question every CTO asks. Fair concern. Two years ago, deploying open-weight models meant wrestling with dependencies and debugging CUDA errors at 2 AM. Today, Ollama gives you one-line deployment. vLLM handles production serving. Hugging Face offers managed hosting if you want a middle ground between full self-hosting and API dependence.

The ecosystem has matured dramatically. Unless you specifically need someone external to blame when things break (a legitimate enterprise requirement, especially in regulated industries), the support argument has largely evaporated. Most open-weight models now support OpenAI's API format anyway—switching often means changing one URL and adjusting for model-specific quirks. The hard part isn't technical integration. It's getting procurement to approve the infrastructure budget. A well-scoped ML implementation plan that quantifies the break-even timeline makes that conversation much easier.

Competitive Implications

Organisations with high-volume AI workloads (1B+ tokens/month) still paying proprietary API prices face a harsh truth: competitors using open-weight models achieve:

  • 40-70% cost reduction at scale (break-even at ~750M-1B tokens/month)
  • Complete data control
  • Unlimited customisation potential
  • No rate limits or quotas
  • Predictable, controllable latency

For lower-volume users, the value proposition shifts from pure cost to control, customisation, and compliance.

Every month of delay represents unnecessary expense and lost competitive advantage.

Practical Migration Path

Week 1-2: Benchmark Current Usage

  • Document API costs and volume
  • Identify use cases and performance requirements
  • Calculate potential savings

Week 3-4: Pilot Deployment

  • Deploy Qwen3-235B or GPT OSS 120B
  • Run parallel testing
  • Validate performance metrics

Week 5-8: Production Rollout

  • Migrate non-critical workloads first
  • Implement gradual transition
  • Monitor performance and costs

Ongoing: Optimisation

  • Fine-tune on proprietary data
  • Implement model routing for optimal cost/performance
  • Explore smaller specialised models

The Router Strategy

Leading organisations implement intelligent routing:

  • 80% of queries → Qwen3 or similar (cost: ~$0.20/1M tokens)
  • 15% of queries → Mid-tier models
  • 5% of queries → GPT-5/Claude for complex reasoning

Result: 85% cost reduction while maintaining peak performance where needed.

Illustrative Scenario: Fortune 500 Customer Service

Consider a major retailer or financial services firm running customer support for 10 million monthly interactions. Each interaction averages 500 input tokens (customer query plus context) and generates 200 output tokens. That's 5 billion input tokens and 2 billion output tokens monthly—standard for any company in the Fortune 500.

With GPT-5: $6,250 for input, $20,000 for output. Monthly bill: $26,250. Annual: $315,000.

Self-hosted alternative: Maybe $10,000 monthly for cloud infrastructure (or $200,000 upfront for on-premise hardware), plus half an FTE to manage it. Call it $16,500 all-in monthly, $198,000 annually.

The CFO perspective: That $117,000 annual difference equals two full-time employees in most markets, or three in lower-cost regions. It's not a line item—it's headcount. And that's assuming zero growth. Most Fortune 500 companies see token usage growing 50-100% annually as they expand AI applications.

What's compelling is the scaling curve. Double the volume to 20 million interactions and API costs double to $630,000. Infrastructure costs? Maybe increase 30% to $260,000. The gap widens to $370,000 annually—enough to fund an entire team. This is why every Fortune 500 company we've spoken with is at least piloting self-hosted models, even if they're not ready to migrate production workloads.

What the Benchmarks Miss

September 2025's data reveals something more nuanced than "open-weight models win." They win at scale—typically above 1 billion tokens monthly. Below that, APIs remain the rational choice for most organisations. But scale isn't the only factor.

OpenAI releasing GPT-OSS 120B signals their recognition of market segmentation. Not every use case needs premium performance. Not every company can afford premium pricing. And increasingly, not every organisation is willing to trade data control for convenience.

The 7-point performance gap between Grok-4 and Qwen3 on GPQA? It's real. It also doesn't matter for 95% of business applications. Your customers won't notice the difference between 81% and 88% accuracy when asking about their order status. They will notice if your service is always available, responds quickly, and doesn't randomly hit rate limits during peak hours.

Here's what the next year likely holds: Companies currently spending millions on API calls will migrate to hybrid deployments. They'll keep proprietary APIs for genuinely complex tasks while routing routine queries to self-hosted models. The smart ones are already doing this. They're not talking about it because it's a competitive advantage.

The infrastructure investment—that intimidating $50,000-$200,000 upfront cost—becomes manageable when amortised over multiple use cases and years of operation. The technical complexity becomes tractable when you realise you're not pioneering; thousands of companies have already mapped this territory.

By late 2026, the market will likely have sorted itself into three tiers: small companies using APIs because simplicity matters more than cost, large enterprises self-hosting because they process billions of tokens monthly, and everyone in between running hybrid deployments optimised for their specific needs.

The question isn't whether to switch. It's when the economics make it compelling for your organisation. For many enterprises processing billions of tokens monthly, that moment is no longer hypothetical — it's a current reality. To see what production AI deployment looks like for a mid-market business, Spartan Waterproofing's case study shows the practical side of moving from API dependence to self-hosted intelligence.


Calculate Your Break-Even Point

If you're spending more than $5,000 monthly on AI APIs, we can help you evaluate whether self-hosting makes sense. In 30 minutes, we'll calculate your specific break-even timeline, design an optimal model routing strategy, and provide a realistic migration roadmap.

Most clients discover they can reduce AI infrastructure costs by 40-70% at typical enterprise scales while gaining complete control over their data and models. For extreme high-volume deployments, savings can be even more dramatic.

Schedule your assessment


Note: Cost estimates assume 60-70% GPU utilisation, batch processing, and typical enterprise workloads. The "20-100×" cost reduction applies primarily to multi-billion-token workloads (tens of billions/month) with highly optimised infrastructure. Individual results vary based on workload characteristics and operational efficiency. Pricing current as of publication date and subject to change.


References

  1. LMSYS Chatbot Arena Leaderboard - LMArena (2025)
    Live benchmark rankings for LLM performance including GPQA scores
    https://lmarena.ai/

  2. FrontierMath Benchmark - Epoch AI (2025)
    Mathematics benchmark designed by Fields Medalists
    https://epochai.org/frontiermath

  3. GPT-OSS 120B Model Card - Hugging Face (2025)
    Apache-2.0 licensed open-weight model from OpenAI
    https://huggingface.co/openai/gpt-oss-120b

  4. Qwen3-235B-A22B-Thinking Model Card - Alibaba Qwen Team (2025)
    Open-weight reasoning model specifications and benchmarks
    https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking

  5. Llama 4 Scout Announcement - Meta AI (2025)
    10-million token context multimodal model release
    https://ai.meta.com/blog/llama-4-multimodal-intelligence/

  6. GPT-5 Release - OpenAI (2025)
    Flagship proprietary model announcement and capabilities
    https://openai.com/index/introducing-gpt-5/

  7. Grok-4 Announcement - xAI (2025)
    High-performance proprietary model release
    https://x.ai/news/grok-4/

  8. vLLM: Easy, Fast, and Cheap LLM Serving - vLLM Project (2025)
    Production serving framework for open-weight models
    https://github.com/vllm-project/vllm

  9. Ollama Documentation - Ollama (2025)
    One-line deployment for local LLM inference
    https://ollama.ai/

Frequently Asked Questions

What are open-weight AI models?

Open-weight models are large language models whose trained parameters are publicly released, allowing organisations to download, host, and fine-tune them on their own infrastructure. Unlike proprietary APIs, you control the data, the deployment, and the cost.

How much cheaper are open-weight models than proprietary AI APIs?

At typical enterprise scale of one to ten billion tokens, open-weight models are 40 to 70 percent cheaper after the initial infrastructure investment breaks even. At optimised scale, savings can reach 20 to 100 times lower cost per token.

Can open-weight models match proprietary AI performance?

Current open-weight models reach 80 to 85 percent of proprietary model performance on most enterprise tasks. For many production use cases like classification, extraction, and summarisation, this performance gap is negligible.

What hardware is needed to self-host an open-weight AI model?

Requirements depend on model size. A 7-billion-parameter model runs on a single consumer GPU. Models in the 70 to 120 billion parameter range typically need one to four data-centre GPUs like the NVIDIA H100 with quantisation applied.

What are the risks of migrating from proprietary to open-weight AI?

Key risks include the operational burden of managing inference infrastructure, potential quality gaps on specialised tasks, and the need for in-house expertise to fine-tune and monitor models. A staged migration with parallel evaluation reduces these risks.

See live systemsStart the build

Related Insights

Agentic AI in Production: Where Autonomy Holds Up

Agents fail when they are sold as magic colleagues. They work when scope, tools, approval, and recovery paths are engineered first.

Open Article

LLM Hallucinations: Accuracy Is an Operating Control

Hallucinations become expensive when AI output reaches customers, regulators, or operators without grounding and review.

Open Article
sync

Site footer

Company

Entity
Ryder AI Pty Ltd
ABN
24 681 083 983
Base
Brisbane, Queensland
Data boundary
Australian data boundary

Explore

  • Solutions
  • Approach
  • Case Studies
  • Insights
  • Contact
  • About
  • Brand

Contact

[email protected]0424 384 916

© 2026 Ryder AI Pty Ltd

LinkedIn (opens in a new tab)Privacy Policy

RyderAI is distinct from Ryder System Inc (NYSE: R, US logistics multinational) and unrelated to the actress Winona Ryder. More about the brand →