Model Check - MiMo-Audio: Scaling Speech Pre-Training to 100M Hours

Going over the code and the technical report of the new Speech LM model from Xiaomi that rivals GPT4o-audio and Gemini

Sep 22, 2025

MiMo-Audio is Xiaomi's 7B parameter model that processes speech and text through a unified architecture. Trained on 100+ million hours of audio data—10x larger than existing open-source models— that results in emergent capabilities like voice conversion, speech translation, and cross-modal reasoning through few-shot learning, demonstrating speech scaling laws similar to text language models.

TL;DR:

Scale Unlocks Emergent Abilities: 100M+ hours (10x larger than existing models) creates phase transition at ~0.7T tokens with genuine few-shot abilities
Unified Architecture: Patch-based audio representation (25Hz→6.25Hz) bridges text-speech mismatch; two-stage training preserves text capabilities while adding audio generation
SOTA Open-Source Performance: Best modality consistency on SpeechMMLU benchmark; strong across audio understanding benchmarks.

Emergent Abilities at Scale

Evidence for "Phase Transition":

Performance jumps after ~0.7 trillion tokens across multiple benchmarks
5-shot SpeechMMLU (T2S, S2S), 16-shot voice conversion, S2S translation

Emergent Capabilities (not explicitly trained):

Few-shot voice conversion
Emotion and speaking rate modification
Speech denoising
Speech-to-speech translation

Core Contribution: Proving that text scaling paradigms work also for speech.

Architecture Deep Dive

MiMo-Audio-Tokenizer (1.2B Parameters)

Core Specs:

8-layer Residual Vector Quantization (RVQ)
25Hz token rate → 25*8 = 200 tokens/second
Preserves both semantic + acoustic information
K-means based vocabulary initialization.

Two-Stage Training:

Stage 1: 11M+ hours, joint audio reconstruction + A2T objectives
- Loss weights: λ_A2T = 10.0, λ_recon = 1.0, λ_commit = 1.0
Stage 2: Adversarial fine-tuning with Multi-Period + Multi-Scale STFT discriminators
- Frozen encoder/discretization, train decoder/vocoder

Approach: From-scratch training at massive scale vs. building on existing semantic models

Solving Semantic vs Acoustic Tokens Conflict

The Core Trade-off:

Semantic tokens: Capture linguistic content, lose acoustic details (speaker identity, prosody)
Acoustic tokens: Preserve audio quality, struggle with language understanding
Traditional solution: Choose one path—semantic for ASR, acoustic for voice cloning

MiMo-Audio's Approach:

Add layer-3 hidden states to final-layer output via element-wise summation
Theory: Early layers (L3) = acoustics, final layers (L32) = semantics
Implementation: Element-wise summation combines both representations

❓No ablation in the paper. Improvements might just the output of the larger scale training. ❓

Audio Language Model

The Challenge: Audio 200 tokens/sec vs text ~4 words/sec sequence length

Solution: Patching audio tokens (25Hz → 6.25Hz) before LLM

Three-Component Architecture:

Patch Encoder (6 Layers): Groups 4 consecutive RVQ tokens, bidirectional attention
LLM Backbone: MiMo-7B-Base with unified next-token/patch prediction
Patch Decoder (16 Layers): Generates 25Hz RVQ sequence with delayed pattern output.
- Layer-specific delays: 0-1-2-3-4-5-6-7 prevent simultaneous RVQ prediction

Result: Efficient cross-modal transfer while maintaining fine-grained audio generation.

Implementation Insights

Training Strategy

Two-Stage Progressive Approach:

# Stage 1: Understanding Only
loss_weights = [1, 0, 0, 0, 0, 0, 0, 0, 0]  # Text only
learning_rates = {'patch_encoder': 2e-4, 'llm': 3e-5}

# Stage 2: Understanding + Generation
loss_weights = [100, 12, 8, 6, 4, 2, 2, 1, 1]  # Text + RVQ layers
learning_rates = {'patch_encoder': 2e-4, 'llm': 3e-5, 'patch_decoder': 2e-4}

Key Architecture Specs:

Patch Encoder: 1024 dim, 64 heads, 6 layers
LLM Backbone: 4096 dim, 32 heads, 36 layers
Patch Decoder: 1024 dim, 64 heads, 16 layers
Context: 8192 tokens, 4-patch audio chunks
Delay Pattern: [0,1,2,3,4,5,6,7] for 8 RVQ layers
Shared embedding tables between encoder/decoder for efficiency

Training at Unprecedented Scale

Scale Specifications

Data: 100+ million hours (10x larger than existing open-source)
Pipeline: Automated processing, multi-dimensional annotation, quality control
Sources: Podcasts, audiobooks, news, interviews, conference recordings
Content: Daily communication, entertainment, business, arts, research

Two-Stage Progressive Training

Stage 1 - Understanding (2.6T tokens):

1.2T text + 1.4T audio tokens (6.25Hz)
Tasks: Speech-text interleaved, ASR, audio captioning, text pre-training
Loss computed only on text tokens (preserves text capabilities)

Stage 2 - Understanding + Generation (5T tokens):

2.6T text + 2.4T audio tokens
Adds: Speech continuation, TTS, instruction-following TTS
All parameters trained with weighted losses

They use an internal TTS model to generate training data for spoken dialogue

Performance Analysis

Speech Intelligence (SpeechMMLU)

MiMo-Audio S2S: 69.1% (best evaluated)
Step-Audio2-mini S2S: 51.8%
MiMo-Audio Modality Gap: 3.4 points (T2T: 72.5%, S2S: 69.1%)
Competitors: 22.3+ point gaps

Key Finding: Consistent reasoning across text/speech modalities

Audio Understanding (MMAU)

MiMo-Audio: 66.0% overall
Step-Audio2-mini: 60.3%
Balance: Speech 67.6%, sound 65.2%, music 65.3%

Few-Shot Learning Evidence

Voice Conversion: 16-shot in-context learning without parameter updates
S2S Translation: Cross-lingual generation maintaining speaker characteristics
Style Transfer: Emotion/rate conversion across prosodic dimensions

Limitation: Heavy reliance on automatic metrics; perceptual quality gaps unclear

Thanks for reading… 👋

Model Weights: MiMo-Audio Collection on HuggingFace
Source Code: MiMo-Audio GitHub Repository
Technical Report: MiMo-Audio Technical Report
Evaluation Suite: MiMo-Audio-Eval
Demo Interface: Interactive MiMo-Audio Demos

Machine Learns Substack

Discussion about this post