A short version. NVIDIA's Nemotron 3 Super 120B-A12B and OpenAI's GPT-OSS 120B sit at the 120-billion-parameter open-weight tier with opposite architectural bets. Nemotron's hybrid Mamba-MoE-attention design trades a heavier hardware footprint for a million-token context window and roughly 2.2× the inference throughput of GPT-OSS on NVIDIA hardware.[3] GPT-OSS's pure Transformer MoE fits inference on a single 80 GB GPU under MXFP4 quantisation and benefits from a broader open-source fine-tuning and quantisation toolchain. For an Australian regulated-vertical buyer the model choice is workload-driven: long-context document and multi-agent work points to Nemotron; single-GPU first deployments and broader open-source tooling points to GPT-OSS. This article works through the architecture, the benchmarks, the hardware footprint, and the deployment economics so the model is matched to the workload rather than the other way around.
Architecture: Hybrid Design vs Pure Transformer
The significant architectural difference is not parameter count but how the computation is structured.
Nemotron 3 Super: Three Architectures in One
Nemotron 3 Super employs what NVIDIA calls a hybrid Latent Mixture-of-Experts (LatentMoE) architecture.[1] This interleaves three distinct layer types:
- Mamba-2 layers handle the majority of sequence processing, providing linear-time complexity with respect to sequence length. This is what enables the million-token context window without the quadratic memory explosion that pure attention suffers from.
- Mixture-of-Experts layers use a novel latent routing mechanism where tokens are projected into a smaller dimension before expert selection, improving accuracy per computational byte.
- Attention layers are used selectively where global context aggregation matters most.
The result is 120.6 billion total parameters with approximately 12.7 billion active per forward pass (12.1 billion excluding embeddings).[1] The model also incorporates Multi-Token Prediction (MTP) heads, which predict multiple future tokens simultaneously to accelerate generation.
GPT-OSS 120B: Proven Transformer Efficiency
GPT-OSS 120B takes a more conventional but highly refined approach: a pure Transformer with mixture-of-experts routing.[2] It uses 36 layers with 128 experts and top-4 routing, activating just 5.1 billion parameters per token out of 117 billion total.
The architecture employs alternating dense and locally banded sparse attention patterns, grouped multi-query attention with a group size of 8, and Rotary Positional Embedding (RoPE) for positional encoding.[2] This supports context lengths of up to 128,000 tokens natively.
OpenAI trained the model using reinforcement learning techniques informed by their most advanced internal systems, including o3 and other frontier models.[2]
What the Architecture Difference Means in Practice
Nemotron's hybrid approach trades deployment simplicity for throughput and context length. The Mamba-2 layers give it linear scaling for long sequences — processing a 500,000-token document does not require quadratically more memory than processing a 50,000-token one. GPT-OSS 120B's pure Transformer approach is more established, better supported by existing tooling, and activates fewer parameters per token (5.1B vs 12.7B), which directly translates to lower per-token compute cost.
Context Window: 1M vs 128K Tokens
This is where the architectural differences produce the most visible operational impact.
Nemotron 3 Super supports up to one million tokens of native context, configurable via deployment parameters.[1] On the RULER benchmark at one million tokens, it scores 91.75%, demonstrating strong retrieval and reasoning at extreme lengths.[3]
GPT-OSS 120B supports 128,000 tokens — generous by historical standards, but an order of magnitude smaller than Nemotron's ceiling.
When this matters. For workloads processing lengthy regulatory documents, codebases with hundreds of files, or multi-turn agent conversations that accumulate substantial history, the million-token window is a functional requirement rather than a luxury. Retrieval pipelines become simpler when the model can hold more context natively, reducing the engineering burden.
When it does not. For standard question-answering, code generation, and structured data extraction tasks where inputs rarely exceed 32,000 tokens, the 128K limit of GPT-OSS 120B is more than sufficient.
Quantisation and Hardware Requirements
The GPU requirements for each model reveal starkly different deployment profiles.
Nemotron 3 Super
NVIDIA provides the model in multiple precision formats:
- BF16: Requires 4× H100-80GB minimum for inference (241 GB weights); 8× H100 recommended for high-concurrency production serving
- FP8: Reduces memory requirements significantly while maintaining accuracy
- NVFP4: The native training precision, optimised for NVIDIA Blackwell GPUs, maximising throughput on next-generation hardware
The model was pretrained on 25 trillion tokens using NVFP4, NVIDIA's 4-bit floating-point format.[1] This means it is designed from the ground up to run efficiently on Blackwell architecture (B200, B100). On older Hopper hardware (H100), FP8 is the practical sweet spot.
GPT-OSS 120B
OpenAI designed GPT-OSS 120B with a clear deployment target: a single 80 GB GPU.[2]
With MXFP4 quantisation applied to the MoE projection weights, the model fits comfortably on one H100 or AMD MI300X. The full model in FP16 occupies approximately 240 GB across all expert weights, but since only 5.1B parameters activate per token, the working memory footprint is modest.
For organisations running consumer-grade GPUs like the RTX 5090 (32 GB GDDR7), multi-card configurations with llama.cpp or vLLM can serve GPT-OSS 120B at practical throughput levels, though not at the speeds achievable on data-centre hardware.
| Deployment Profile | Nemotron 3 Super | GPT-OSS 120B |
|---|---|---|
| Standard enterprise | 2× B200 (384 GB, FP8/NVFP4) or 4× H100 (320 GB, FP8) | 1× H100 (80 GB, MXFP4) or 2× B200 (384 GB, BF16 / high concurrency) |
| Full precision (BF16) | 4× H100-80GB (320 GB) / 8× H100 for high concurrency | 4× H100-80GB (320 GB) |
| Consumer / lab | 2× RTX 5090 (64 GB, Q4/GGUF, reduced context) | 2× RTX 5090 (64 GB, Q4/GGUF) |
| Optimal hardware | Blackwell B200 / B100 | H100 / MI300X |
Standard enterprise
Full precision (BF16)
Consumer / lab
Optimal hardware
Quantisation Quality: What You Actually Lose
A common concern with quantised deployment is quality degradation. For Nemotron 3 Super, the numbers are unusually strong because the model was pretrained in NVFP4 from the start — low precision is native, not retrofitted.
NVIDIA's AutoQuantize pipeline assigns each layer to FP4, FP8, or BF16 based on sensitivity analysis. The result: the mixed-precision deployment checkpoint retains 99.8% of BF16 median benchmark accuracy — roughly a 0.2 percentage-point drop.[7] A naive post-training quantisation pass loses over 1%, but AutoQuantize recovers most of that gap.
For context, generic post-training 4-bit quantisation on large models typically incurs materially larger accuracy loss than 8-bit schemes. Nemotron's near-flat median drop under its hardware-co-designed AutoQuantize pipeline is at the near-lossless end of what the field has demonstrated, and lower than what a naive 4-bit pass on a model not designed for low precision would produce.
Community GGUF conversions for llama.cpp use different quantisation recipes and lack NVIDIA's per-layer sensitivity tuning. Expect a small but non-zero additional degradation — typically a few percentage points on harder benchmarks, depending on the specific format and calibration quality.
Benchmark Performance
Both models deliver strong performance across standard evaluation suites, but their strengths diverge.
| Benchmark | Nemotron 3 Super | GPT-OSS 120B | Notes |
|---|---|---|---|
| SWE-Bench Verified | 60.47% | — | Software engineering |
| RULER 1M Context | 91.75% | N/A (128K max) | Long-context retrieval |
| GPQA (with tools) | Comparable | 80.9% | PhD-level science |
| MMLU-Pro | Comparable | 90.0% | Broad knowledge |
| HumanEval+ | ~91-92% | Comparable | Code generation |
| PinchBench (Agentic) | 85.6% | — | Agent orchestration |
SWE-Bench Verified
RULER 1M Context
GPQA (with tools)
MMLU-Pro
HumanEval+
PinchBench (Agentic)
Nemotron 3 Super leads on software engineering (SWE-Bench), agentic reasoning (PinchBench), and long-context tasks.[3] GPT-OSS 120B holds advantages on mathematics, broad knowledge (MMLU-Pro at 90.0%), and PhD-level science reasoning (GPQA at 80.9%).[4]
The throughput story is decisive: Nemotron 3 Super achieves up to 2.2× higher inference throughput than GPT-OSS 120B and up to 7.5× higher than Qwen3.5-122B on standard workloads, per NVIDIA's published comparisons.[3][7] In optimised configurations the per-GPU-hour work delivered is materially higher than either open comparator.
Deployment Economics
The total cost of ownership calculation depends on your workload profile.
Scenario 1: High-Volume Agent Orchestration
Multi-agent systems generate up to 15× the token volume of standard chat interactions.[5] At this scale, Nemotron's 2.2× throughput advantage means you need roughly half the GPU capacity to serve the same request volume. The upfront infrastructure cost is higher, but the per-token cost drops rapidly with volume.
Consider a document-processing pipeline handling 10,000 requests per day at an average of 4,000 tokens per request. GPT-OSS 120B on a single H100 can serve this comfortably. But scale to 50,000 requests per day with multi-step agent reasoning — where each request generates 15,000-30,000 tokens across planning, execution, and verification steps — and the throughput ceiling becomes the constraint. At that point, Nemotron's 2.2× throughput advantage[3] means two H100s can deliver what would require four or five with GPT-OSS 120B.
Scenario 2: Cost-Constrained First Deployment
For organisations deploying their first self-hosted model, GPT-OSS 120B on a single leased H100 is the lowest-barrier entry point. One GPU, one model, one deployment. The Apache 2.0 licence carries no commercial restrictions. The total infrastructure cost can be as low as a single cloud GPU instance.
This matters more than benchmark differences for an Australian enterprise evaluating local AI for the first time. The operational learning curve — running inference servers, monitoring GPU utilisation, handling model updates, building prompt-engineering workflows — is the real investment. Starting with the simpler deployment lets the operating team build those capabilities before committing to multi-GPU infrastructure.
Scenario 3: Hybrid Fleet
The most sophisticated approach is running both models for different workload types. GPT-OSS 120B handles general reasoning, customer-facing chat, and structured extraction tasks where its broad knowledge excels. Nemotron 3 Super handles long-context document processing, code analysis, and multi-agent orchestration where its throughput and context window provide decisive advantages.
Scenario 4: Document-Heavy Workflows
If your primary use case involves processing documents exceeding 128K tokens — lengthy contracts, regulatory filings, multi-file codebases — GPT-OSS 120B simply cannot handle the full context. Nemotron's million-token window eliminates the need for chunking and retrieval pipelines, which themselves carry engineering and accuracy costs.
Decision matrix: which model fits which workload
| Your Priority | Choose | Why |
|---|---|---|
| Single-GPU simplicity | GPT-OSS 120B | Runs on one H100 with MXFP4 |
| Million-token context | Nemotron 3 Super | Native 1M context, 91.75% RULER[3] |
| Maximum throughput | Nemotron 3 Super | 2.2× faster than GPT-OSS[3] |
| Agent orchestration | Nemotron 3 Super | PinchBench 85.6%[3], throughput for high-token workloads |
| Broad reasoning + maths | GPT-OSS 120B | MMLU-Pro 90.0%, GPQA 80.9% |
| Minimal licensing friction | GPT-OSS 120B | Apache 2.0 (fully permissive) |
| NVIDIA hardware ecosystem | Nemotron 3 Super | Native NVFP4, Blackwell-optimised |
| Existing llama.cpp stack | GPT-OSS 120B | Better tooling support, proven quant formats |
Single-GPU simplicity
Million-token context
Maximum throughput
Agent orchestration
Broad reasoning + maths
Minimal licensing friction
NVIDIA hardware ecosystem
Existing llama.cpp stack
Fine-tuning and open-source tooling
Post-deployment customisation matters as much as inference performance for enterprise teams that need to adapt models to proprietary data.
GPT-OSS 120B benefits from OpenAI's decision to release under Apache 2.0 with full weight access. Community quantisations (GGUF, AWQ, GPTQ) appeared within days of release, and LoRA fine-tuning recipes are well-documented across Hugging Face, Axolotl, and Unsloth. The pure Transformer architecture means existing fine-tuning tooling works without modification.
Nemotron 3 Super's hybrid architecture (Mamba-2 + MoE + attention) is newer, and the fine-tuning ecosystem is correspondingly thinner. NVIDIA provides NeMo-based training recipes, but community-maintained LoRA adapters and quantisation variants lag behind GPT-OSS. Organisations planning significant fine-tuning should factor in the additional engineering effort required for the hybrid layer types. That said, NVIDIA's NeMo framework is production-grade and well-supported for enterprises already in the NVIDIA ecosystem.
Trade-offs and Limitations
Neither model is universally superior. Nemotron 3 Super is the more ambitious architecture — it combines three distinct computational paradigms and delivers genuinely novel context-length and throughput advantages. The trade is more infrastructure and dependence on NVIDIA's latest hardware.
GPT-OSS 120B is the more pragmatic choice. It achieves strong capability for its activation footprint, runs on hardware most organisations already own or can lease, and benefits from the broadest open-source fine-tuning and quantisation tooling.
For Australian enterprises evaluating self-hosted AI infrastructure, the practical advice is straightforward: start with GPT-OSS 120B to prove the deployment pipeline and workload patterns following the build sequence, then evaluate Nemotron 3 Super when throughput or context requirements outgrow what a single-GPU deployment can deliver. For workflows in APRA-regulated entities specifically, the model choice is downstream of the hosting decision; both options work under an in-the-loop approval architecture.
Frequently Asked Questions
What is the main architectural difference between Nemotron 3 Super and GPT-OSS 120B?
Nemotron 3 Super uses a hybrid LatentMoE architecture combining Mamba-2 state-space layers, mixture-of-experts layers, and attention layers, activating roughly 12.7 billion parameters per token. GPT-OSS 120B is a pure Transformer mixture-of-experts model activating 5.1 billion parameters per token.
Can GPT-OSS 120B run on a single GPU?
Yes. With MXFP4 quantisation of the MoE weights, GPT-OSS 120B can run inference on a single 80 GB GPU such as the NVIDIA H100 or AMD MI300X. This is one of its strongest advantages for cost-constrained deployments.
Which open-weight model is better for processing long documents?
Nemotron 3 Super supports up to one million tokens of context and scores 91.75% on the RULER benchmark at that length. GPT-OSS 120B is limited to 128,000 tokens, making Nemotron the clear choice for large-document workflows.
How much faster is Nemotron 3 Super than GPT-OSS 120B?
NVIDIA reports that Nemotron 3 Super achieves up to 2.2 times higher inference throughput than GPT-OSS 120B on standard workloads. For multi-agent systems that generate high token volumes, this throughput advantage compounds significantly.
What licence does each model use for enterprise deployment?
GPT-OSS 120B is released under the Apache 2.0 licence, which is broadly permissive. Nemotron 3 Super is released under the NVIDIA Open Model Licence Agreement, which is also commercially permissive but includes NVIDIA-specific terms.
Which model performs better on coding and software engineering tasks?
Nemotron 3 Super scores 60.47% on SWE-Bench Verified. GPT-OSS 120B edges ahead on some mathematics and coding benchmarks. The choice depends on whether your workload prioritises software engineering breadth or targeted mathematical reasoning.
What GPU hardware is needed to self-host either model?
GPT-OSS 120B can run on a single 80 GB GPU with MXFP4 quantisation. Nemotron 3 Super in full BF16 precision needs roughly 240 GB of VRAM (the standard deployment is 8× H100-80GB, or 2× B200 with tensor parallelism). FP8 quantisation halves that to about 120 GB (2× H100-80GB), and NVFP4 quantisation — which Nemotron 3 Super was trained for natively — fits the model on a single 80 GB GPU (an H100 or a B200) with headroom for the KV cache.[7]
References
- NVIDIA. "Nemotron 3 Super 120B-A12B Model Card." NVIDIA NIM. https://build.nvidia.com/nvidia/nemotron-3-super-120b-a12b/modelcard
- OpenAI. "Introducing GPT-OSS." August 2025. https://openai.com/index/introducing-gpt-oss/
- NVIDIA. "Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning." NVIDIA Developer Blog. https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/
- Clarifai. "OpenAI GPT-OSS Benchmarks: How It Compares." https://www.clarifai.com/blog/openai-gpt-oss-benchmarks-how-it-compares-to-glm-4.5-qwen3-deepseek-and-kimi-k2
- VentureBeat. "Nvidia's new open weights Nemotron 3 Super combines three different architectures." https://venturebeat.com/technology/nvidias-new-open-weights-nemotron-3-super-combines-three-different
- OpenAI. "GPT-OSS-120B & GPT-OSS-20B Model Card." https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf
- NVIDIA. "Nemotron 3 Super Technical Report." https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf
- NVIDIA. "Nemotron 3 Super Quantization." https://docs.nvidia.com/nemotron/nightly/nemotron/super3/quantization.html