Model check - NVIDIA Nemotron Nano 2: An Efficient Hybrid LLM that Beats Reasoning Benchmarks.
How NVIDIA's latest LLM achieve competitive performance through architectural innovation, model compression, and strategic training at 20 trillion token scale
The race for more capable AI models has largely focused on scaling parameters and compute. But what if the path to better performance lies not in making models bigger, but in making them fundamentally smarter? NVIDIA's Nemotron Nano 2 explores this paradigm with a hybrid architecture that achieves competitive performance through architectural innovation rather than brute-force scaling.
Available in 12B and 9B parameter variants, Nemotron Nano 2 introduces a hybrid architecture combining Mamba2 state-space models with traditional Transformer attention mechanisms. This approach delivers competitive performance with models twice its size while maintaining practical deployment characteristics.
Hybrid Architecture: Beyond Traditional Design
Nemotron Nano 2 uses a pattern-based hybrid architecture that strategically combines Mamba2, Attention, and MLP layers based on computational requirements:
# Configuration for 12B model
"hybrid_override_pattern": "M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M-"
# Configuration for 9B model
"hybrid_override_pattern": "M-M-M-MM-M-M-M*-M-M-M*-M-M-M-M*-M-M-M-M*-M-MM-M-M-M-M-M-"
Pattern Notation:
M= Mamba2 layer (efficient for sequential processing)*= Attention layer (powerful for complex reasoning)-= MLP layer (feedforward processing)
The implementation dynamically determines layer types during model initialization:
# Actual implementation from the model code
@property
def layers_block_type(self):
return [
"mamba" if self.hybrid_override_pattern[i] == "M" else
"attention" if self.hybrid_override_pattern[i] == "*" else "mlp"
for i in range(self.num_hidden_layers)]
Performance: Outpacing Larger Models
Before diving into the technical implementation, here are some of the benchmark values. Nemotron Nano 2 consistently outperforms larger competing models across diverse benchmarks:
Mathematical Reasoning Excellence
Nemotron Nano 2 exhibits particularly strong mathematical capabilities:
GSM8K Chain-of-Thought (grade school math):
Nemotron Nano 12B: 91.66%
Nemotron Nano 9B: 91.36% (close to full model performance)
Qwen3 8B: 84.00%
Gemma3 12B: 74.45%
MATH Benchmark (competition mathematics):
Nemotron Nano 12B: 83.54%
Nemotron Nano 9B: 80.50%
Qwen3 8B: 55.40%
Gemma3 12B: 42.40%
MATH Level 5 (most difficult problems):
Nemotron Nano 12B: 67.61%
Nemotron Nano 9B: 63.64%
Qwen3 8B: 29.91%
Gemma3 12B: 17.71%
General Understanding and Reasoning
MMLU (Massive Multitask Language Understanding):
Nemotron Nano 12B: 78.24% (best performance)
Nemotron Nano 9B: 74.53%
Qwen3 8B: 76.44%
Gemma3 12B: 73.61%
MMLU-Pro 5-shot (more challenging version):
Nemotron Nano 12B: 63.98% (ahead of competitors)
Nemotron Nano 9B: 59.43%
Qwen3 8B: 56.27%
Gemma3 12B: 45.12%
Code Generation and Long Context
HumanEval+ Pass@1 (code completion):
Nemotron Nano 12B: 61.03%
Qwen3 8B: 57.55%
Gemma3 12B: 36.68%
RULER-128K (128,000 token context):
Nemotron Nano 12B: 84.74%
Nemotron Nano 9B: 82.22%
Gemma3 12B: 80.70%
Key Performance Insights
Mathematical Performance: Nemotron Nano 2 shows strong mathematical reasoning, with the 12B model achieving 91.66% on GSM8K compared to Qwen3's 84.00%—a 7.66 percentage point advantage.
Efficient Compression: The Minitron-compressed 9B model retains 91.36% performance on GSM8K, losing only 0.3 percentage points despite a 25% parameter reduction.
Architectural Advantage: Despite comparable or smaller size, Nemotron Nano 2 consistently outperforms both Qwen3-8B and Gemma3-12B across diverse reasoning tasks, indicating the effectiveness of its hybrid approach.
Implementation Details
The model architecture reveals several design choices from examining the source code:
Unified Block Structure
# Actual implementation from modeling code
class NemotronHBlock(nn.Module):
def __init__(self, config, layer_idx):
self.block_type = config.layers_block_type[layer_idx]
if self.block_type == "mamba":
self.mixer = NemotronHMamba2Mixer(config, layer_idx=layer_idx)
elif self.block_type == "attention":
self.mixer = ATTENTION_CLASSES[config._attn_implementation](config)
elif self.block_type == "mlp":
self.mixer = NemotronHMLP(config, layer_idx=layer_idx)
Each layer follows a consistent structure with RMSNorm pre-normalization and residual connections, but swaps the core computational "mixer" based on the hybrid pattern.
Specialized Normalization
The model uses different normalization strategies for different components:
Standard RMSNorm for Transformer blocks (similar to LLaMA)
Gated RMSNorm for Mamba2 components with group-wise normalization
Layer-wise epsilon: 1e-5 for numerical stability
Multi-Head Mamba2 Design
# Mamba2 mixer configuration
mamba_config = {
"mamba_num_heads": 128, # Multi-head structure
"mamba_head_dim": 80, # Per-head dimension
"mamba_n_groups": 8, # Group-wise processing
"ssm_state_size": 128, # State space dimension
"conv_kernel_size": 4, # Convolution kernel size
"chunk_size": 128 # Processing chunk size
}
The Mamba2 layers use a multi-head design similar to attention mechanisms, enabling parallel processing while maintaining the linear complexity benefits of state-space models.
Grouped Query Attention (GQA)
attention_config = {
"num_attention_heads": 40, # Query heads
"num_key_value_heads": 8, # Key/Value heads (5:1 ratio)
"head_dim": 128 # Dimension per head
}
The 5:1 query-to-key-value ratio reduces memory usage during attention computation while maintaining performance.
Mixed-Precision Architecture
The model implements a specific precision management:
# Precision handling in different components
precision_strategy = {
"residual_connections": "fp32", # Optional fp32 residuals for stability
"attention_computation": "bfloat16", # Standard computation precision
"rms_norm": "fp32", # Normalization in fp32, cast back to input dtype
"state_space_computation": "fp32" # Mamba2 A matrix always in fp32
}
This mixed-precision approach balances computational efficiency with numerical stability across the hybrid architecture.
Position-Free Architecture
Unlike most transformer models, Nemotron Nano 2 does not use any explicit position encoding:
The model relies on three mechanisms for sequence understanding:
Mamba2 Temporal Dynamics: State-space models inherently capture sequential dependencies through their recurrent formulation
Causal Attention Masks: Attention layers use masks to maintain autoregressive order without position encoding
Architectural Inductive Bias: The hybrid pattern itself provides structural sequence information
This position-free design contributes to the model's efficiency—there's no computational overhead for position encoding and no length limitations imposed by fixed position representations.
Minitron Model Compression: Intelligent Model Reduction
The 9B model variant showcases advanced compression, which goes beyond simple parameter pruning.
Understanding Minitron Compression
Minitron compression employs a neural architecture search (NAS) approach that strategically reduces model size while preserving performance through:
Width Pruning: Reducing hidden dimensions and intermediate sizes
Depth Pruning: Removing entire layers from the network
Knowledge Distillation: Transferring knowledge from the larger 12B model
Compression Effects in Practice
Comparing the architectures shows the structured nature of Minitron compression:
# 12B Model (Original)
{
"num_hidden_layers": 62,
"hidden_size": 5120,
"intermediate_size": 20480,
"hybrid_override_pattern": "M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M-"
}
# 9B Model (Minitron Compressed)
{
"num_hidden_layers": 56, # 6 layers removed (9.7% depth reduction)
"hidden_size": 4480, # 12.5% width reduction
"intermediate_size": 15680, # 23.4% intermediate size reduction
"hybrid_override_pattern": "M-M-M-MM-M-M-M*-M-M-M*-M-M-M-M*-M-M-M-M*-M-MM-M-M-M-M-M-"
}
Notice that compression isn't uniform—the hybrid pattern adapts, with some positions combining Mamba layers (MM) to maintain efficiency while reducing overall parameter count.
Layer Scoring Methodology
The layer importance scoring uses an iterative Mean Squared Error (MSE) approach:
Forward Pass Analysis: Compute model output with all layers present
Candidate Removal: Temporarily remove a candidate layer and recalculate the output
Impact Measurement: MSE between original and modified outputs quantifies layer importance
Iterative Selection: Layers with the lowest impact scores become removal candidates
This approach proves computationally efficient because it requires only forward passes rather than expensive gradient computations, enabling informed decisions about which components can be safely compressed.
Training at Scale: Data and Methodology
Massive Training Corpus
Nemotron Nano 2 was trained on an extensive corpus, with the final pretraining involving 20 trillion tokens processed through data curation and synthetic generation pipelines.
Data Composition and Sources
The training data reflects a strategic approach to capability building:
# Training dataset breakdown (major components in tokens)
dataset_composition = {
"English Common Crawl": 3.36e12, # 31.5%
"English Synthetic CC": 1.95e12, # 18.3%
"Multilingual": 2.17e12, # 20.4%
"Synthetic Multilingual": 0.99e12, # 9.3%
"Code": 747e9, # 7.0%
"Papers": 192e9, # 1.8%
"Math": 125e9, # 1.2%
"STEM SFT": 273e9, # 2.6%
# Additional specialized datasets contributing remaining ~8%
}
Synthetic Data Strategy
A notable aspect is the extensive use of synthetic data generation. Out of the total training corpus, 3.5 trillion tokens (33%) are synthetically generated using advanced models:
DeepSeek-R1: Mathematical reasoning and problem-solving data
Qwen2.5-72B: Question-answering from academic papers and books
Mixtral-8x22B: Social sciences and moral reasoning datasets
phi-4: Common Crawl enhancement and mathematical content
This synthetic data approach allows targeted capability enhancement in specific domains while maintaining data quality and control.
Multilingual and Code Coverage
Languages (16 total): English, Spanish, French, German, Japanese, Italian, Portuguese, Chinese, Arabic, Danish, Korean, Dutch, Polish, Russian, Swedish, Thai
Programming Languages (43 total): Including Python (127B tokens), Java (131B tokens), C++ (68B tokens), JavaScript (76B tokens), and specialized languages like CUDA and SystemVerilog
Training Methodology
Advanced Learning Rate Scheduling
Nemotron Nano 2 uses a Warmup-Stable-Decay (WSD) learning rate schedule optimized for large-scale training:
# Actual training configuration
training_config = {
"total_tokens": 20e12,
"batch_size": 736,
"warmup_tokens": 8e9, # 8B token warmup
"peak_learning_rate": 4.5e-4,
"min_learning_rate": 4.5e-6,
"schedule": "warmup_stable_decay"
}
This schedule provides stable training dynamics across the extended training run, crucial for hybrid architectures combining different computational paradigms.
FP8 Precision Training
The models leverage FP8 (8-bit floating point) precision training:
Memory Efficiency: ~40% reduction in training memory compared to FP16
Computational Speed: Faster training on modern accelerators
Numerical Stability: Maintained through careful gradient scaling and loss monitoring
Computational Investment
The training represents significant computational investment:
training_resources = {
"cumulative_compute": "1.45E+24 FLOPS",
"estimated_energy": "708.3 MWh",
"training_period": "June 2025 - August 2025",
"data_cutoff": "May 1, 2025"
}
Multi-Phase Training Pipeline
Training of Nemotron Nano 2 follows a multi-phase training approach:
Phase 1: Base Model Pretraining
Scope: 12B base model trained on 20 trillion tokens
Duration: June 2025 - August 2025
Compute: 1.45E+24 FLOPS
Energy: 708.3 MWh
Focus: Foundation language understanding across 16 languages and 43 programming languages
Phase 2: Post-Training Refinement
Scope: Additional training on 1 trillion tokens
Compute: 7.25E+22 FLOPS
Energy: 35.6 MWh
Focus: Alignment, instruction following, and specialized capabilities
Data: Enhanced with reasoning traces from SOTA models (DeepSeek-R1, Qwen3-235B-A22B, Nemotron-4-340B)
Model merging: Checkpoint interpolation for DPO and GRPO models to balance model performance.
Phase 3: Minitron Compression & Distillation
Scope: 142 billion tokens for compression training
Compute: 7.72E+21 FLOPS
Energy: 3.7 MWh
Process: Layer pruning, width reduction, and knowledge distillation from 12B to 9B model
This phased approach allows for systematic capability building: foundational learning, specialized refinement, and efficient compression while preserving performance.
Deployment and Inference Advantages
Runtime Efficiency Benefits
The hybrid architecture delivers significant efficiency gains during inference:
Memory Efficiency: 40% lower peak memory usage compared to pure Transformer models
Speed Optimization: Up to 3x faster inference on long sequences due to Mamba2's linear scaling
Context Length: Supports 128K tokens with efficient processing
Framework Integration
The models integrate seamlessly with standard deployment frameworks:
# Supported inference engines
inference_engines = [
"HuggingFace Transformers", # Standard PyTorch implementation
"vLLM", # High-throughput serving
"TensorRT-LLM" # NVIDIA optimized inference
]
# Hardware requirements
hardware_specs = {
"12B model": "A100 80GB, H100 80GB",
"9B model": "A10G 24GB, H100 80GB" # More accessible hardware
}
Dynamic Budget Control
The 9B model includes runtime budget control for reasoning tokens:
# Conceptual API for runtime thinking budget control
response = model.generate(
prompt="Solve this complex problem step by step",
thinking_budget=150, # Allow extra reasoning tokens
max_tokens=500
)
This feature enables dynamic resource allocation, allowing the model to "think longer" on complex problems while maintaining efficiency for simpler queries.
Thanks for reading through this technical deep dive - hopefully it gave you a good sense of how Nemotron Nano 2 achieves competitive performance through hybrid Mamba-Transformer architecture, sophisticated compression techniques, and strategic training methodologies at massive scale.
Relevant Links
Model Access:
NVIDIA Nemotron Nano 12B v2 Base - Base 12B model on Hugging Face
NVIDIA Nemotron Nano 9B v2 - Compressed 9B model on Hugging Face
Technical Documentation:
NVIDIA Nemotron Nano 2 Technical Report - Comprehensive technical paper
Mamba: Linear-Time Sequence Modeling - Original Mamba paper
Mamba-2: Structured State Space Duality - Mamba2 architecture paper
Training Datasets:
Nemotron Pretraining Dataset Collection - Released training data
Nemotron-CC v2 - Common Crawl dataset
Nemotron-PrismMath - Synthetic math dataset
Related Research:
Minitron: Structured Pruning and Knowledge Distillation - Minitron compression methodology
Neural Architecture Search (NAS) - Foundation for compression approach
Implementation Resources:
Transformers Library - For HuggingFace integration
vLLM - High-throughput inference engine
TensorRT-LLM - NVIDIA optimized inference



