Machine Learns #72

Latest open-weight frontier models, speech synthesis everywhere, diffusion for text, sparse attention that actually works, RNNs that train in O(1), and agents that still can’t do real jobs.

Jun 17, 2026

🤖 Model Releases

🌐 Gemma 4 12B - Google’s 12B dense multimodal model with no separate vision or audio encoder. Image patches, audio waveforms, and text go through one transformer via linear projections. 256K context, configurable thinking mode, native function calling, Apache 2.0. Fits on a single 24GB GPU.
🔀 DiffusionGemma 26B-A4B - Gemma 4 26B rebuilt to use discrete diffusion instead of autoregressive decoding. Denoises 256-token blocks in parallel, >1,100 tok/s on a single H100. Simpler tasks auto-use fewer denoising steps. Meaningful accuracy gap vs the AR version (5-20 pts across benchmarks), but the speed is the point. Apache 2.0.

if signal > noise:
    machine_learns.subscribe()

📝 GLM-5.2 - Zhipu’s 744B MoE (40B active) with a stable 1M context window. Introduces IndexShare, sparse attention sharing every 4 layers, cutting 2.9x FLOPs at long context. Terminal-Bench 81.0, SWE-bench Pro 62.1. Apache 2.0.
⚡ Step 3.7 Flash - 198B sparse MoE vision-language model with 11B active params, 256K context, up to 400 tok/s. SWE-Bench Pro 56.3, ClawEval 67.1. Apache 2.0. Aggressive API pricing at $0.20/M input.
🧠 Ling 2.6 & Ring 2.6 - Two 1T-param MoE models from InclusionAI. Ling handles fast/direct responses, Ring does deep reasoning. Hybrid MLA + Linear Attention, 256K context, speculative decoding via EAGLE. SWE-bench Verified 72.2%. MIT license.
🏋️ Nemotron 3 Ultra 550B - NVIDIA’s 550B MoE (55B active) with hybrid Mamba-2 + attention. FP4-native, trained in FP4, not post-hoc quantized. 1M token context. GPQA Diamond 87.9, SWE-Bench Verified 69.7. Requires 4xB200 or 8xH100 minimum. OpenMDW v1.1 license.
💻 Kimi K2.7 Code - Moonshot’s 1T MoE (32B active) for long-horizon software engineering. Always-on thinking mode with 30% fewer thinking tokens than K2.6. MCPMark-Verified 81.1 (vs Claude Opus 4.8 at 76.4). 256K context, vision support. Modified MIT license.
🧩 VibeThinker-3B - 3B reasoning model fine-tuned from Qwen2.5-Coder-3B. 4-stage training pipeline with curriculum SFT, multi-domain RL, self-distillation, instruct RL. IMO-AnswerBench 76.4, competitive with models 100x its size on math and coding. MIT license.
🤖 Macaron-V1-Preview-749B - 749B agent model using Mixture-of-LoRA: 5 specialist 1B adapters (chat, personal-life, coding, generative UI, multi-step workflows) on a frozen GLM5.1 base. Adding a new capability = train one LoRA. Preview release, license TBD.
🗣️ dots.tts - RedNote’s 2.2B end-to-end TTS. Fully continuous (no discrete codec tokens), 48kHz stereo, zero-shot voice cloning, 24 languages, built on Qwen2.5-1.5B. Three variants: base, soar (quality-optimized), mf (distilled, 2-4 step inference). Current best open speaker similarity score (79.2 SIM on Seed-TTS-Eval). Apache 2.0.
🎤 ZONOS2 - Zyphra’s 8B MoE TTS (900M active) for zero-shot voice cloning across 30+ languages. First open-source MoE TTS. 44.1kHz output, no phonemizer needed (raw UTF-8 input), two modes: stable (studio-clean) and expressive (max speaker fidelity). 4x throughput over their dense model. Apache 2.0.
🎙️ Ming Omni TTS 16.8B - Unified audio generation from Ant Group. 16.8B total, 3B active (MoE). Generates speech, environmental sound, and music from one model. 100+ built-in voices, zero-shot cloning, emotion/pitch/dialect control. Apache 2.0.
🔊 Miso TTS 8B - 8B TTS on Llama 3.2 backbone with Mimi codec. Zero-shot voice cloning, built-in SilentCipher watermarking. English only. Modified MIT (free under 50M MAU / $10M/mo).
👂 Higgs Audio v3 STT - 2.68B ASR model that replaces Whisper-Large-v3’s decoder with a Qwen3-1.7B LLM decoder. Supports a think mode before transcription. English only. Apache 2.0.
🔉 VoiceCLAP-Large-v2 - CLIP for voice. 8.9B contrastive audio-text embedding model from LAION, built on Qwen2.5-Omni with LoRA. Aligns speech and text descriptions for emotion recognition, voice search, and retrieval. CC-BY-NC 4.0.
🎵 Magenta RealTime 2 - Google’s 2.4B model (also a 230M small variant) for real-time continuous music generation. Steerable by text, audio, and MIDI simultaneously at 200ms latency. Frame-wise autoregressive at 40ms frames. 48kHz stereo. Code: Apache 2.0, Weights: CC-BY 4.0.
🎨 Ideogram 4 - 9.3B single-stream DiT for text-to-image with Qwen3-VL-8B text encoder. Best open-weight image model by multiple third-party benchmarks. Best-in-class text rendering. Up to 2K resolution. Non-commercial license.
👁️ Zamba2-VL-2.7B - 2.7B VLM on Zyphra’s hybrid Mamba2 + Transformer architecture with Qwen2.5-VL vision encoder. SSM backbone means lower latency and memory than transformer VLMs. DocVQA 90.9 at 2.7B params. Apache 2.0.
🎬 NAVA - Baidu’s 6.3B joint audio-video generation model. Text or image+text prompt produces synchronized 720p video + stereo audio in one pass, no separate audio pipeline. Runs on a single RTX 4090 with fp8 + offload. Multi-speaker voice cloning. Apache 2.0.

📎 Papers

🔍 MiniMax Sparse Attention

What’s new

Blockwise sparse attention that delivers real wall-clock speedups, not just theoretical FLOP savings. Two-branch system where a tiny Index Branch decides what matters, then the Main Branch only attends to those blocks.

How it works

Every attention head gets two additional small projection matrices that form the Index Branch. For each query, it scores all KV blocks (size 128) by max-pooling key projections within each block, then picks the top-k=16 blocks.
The Main Branch runs standard exact attention, but only over the selected ~2,048 tokens instead of the full sequence. No approximation in the attention itself, just content-aware block skipping.
Top-k selection is non-differentiable, so the Index Branch is trained by KL divergence: align the index scores with actual full-attention distributions computed during training. The index learns to predict where the model would attend if it could see everything.
Works for both pretraining from scratch and retrofitting: a dense 109B MoE model can be converted by initializing the index projections from existing QK weights and fine-tuning on 200B tokens.
Built into MiniMax’s production 109B MoE (trained on 3T tokens) and tested up to 4M context.

Results

14.2x faster prefill and 7.6x faster decoding on H800 at 1M context, real wall-clock, not theoretical.
Quality stays on par with full GQA across standard benchmarks, so the 28.4x attention compute reduction is essentially free.

📐 SubQ 1.1 Small

What’s new

Content-dependent sparse key selection that routes each query to semantically relevant keys regardless of position. Unlike fixed-pattern sparsity (decides where to look before seeing content) and SSMs (lossy fixed-capacity compression), this selects dynamically based on what the query actually needs.

How it works

For each query, a lightweight scoring function evaluates all keys based on content, selects the highest-signal positions, then computes exact attention only over those. The selection is position-independent, so relevant information gets attended to whether it was 10 tokens ago or 1M tokens ago.
Training starts with standard pretraining, then supervised fine-tuning, then RL specifically targeting long-context retrieval failures. The RL stage matters: it teaches the selection function to find scattered evidence across huge documents, not just attend to nearby or salient tokens.
Context extension uses about 1T tokens of naturally long artifacts (full books, entire codebases, long documents) with staged extension up to 2M tokens. The training data is genuinely long, not synthetically padded.
The architecture is designed so the sparse selection degrades gracefully: at short contexts where everything fits in the selection budget, it behaves identically to dense attention.

Results

56x faster than FlashAttention-2 at 1M context. On multi-evidence retrieval (MRCR v2, which requires finding and combining facts scattered across a huge document), scores 86.2% vs Claude Opus 4.6 at 78.3%.
No short-context tradeoff: GPQA Diamond 85.4%, LiveCodeBench 89.7%, SWE-Bench 81.8%.

🗣️ CTC-TTS: LLM-based Dual-Streaming TTS with CTC Alignment

What’s new

For a TTS model that can stream both inputs and outputs, it replaces the Montreal Forced Aligner (MFA) in LLM-based TTS with a CTC neural aligner and a bi-word interleaving strategy. Two variants: L for quality, F for low latency.

How it works

Standard LLM-based TTS needs word-level alignment between text and audio to interleave phoneme tokens with speech tokens during training. MFA requires a pronunciation dictionary, language-specific rules, and often fails on names, rare words, and non-standard speech.
A CTC ASR model (115M params, trained separately) produces frame-level phoneme posteriors, then Viterbi decoding gives alignment boundaries. This replaces the entire MFA pipeline with a single forward pass.
Training sequences use bi-word interleaving: phonemes of current word, phonemes of next word, then speech tokens for the current word. The one-word lookahead gives the model context about what is coming before it has to produce the audio, improving prosody and reducing hallucination.
The F-variant (for streaming) stacks the current and next word’s phoneme embeddings in the feature dimension rather than sequentially, eliminating the lookahead latency. First packet comes out in 159ms.
Models stay small: 34M params for single-speaker, 160M for multi-speaker. The CTC aligner is a one-time cost at data preprocessing, not an inference dependency.

Results

CTC-TTS-L cuts word error rate in half vs the MFA baseline (4.82% vs 10.98%), meaning far fewer skipped or garbled words. Listeners rated it higher than ground truth recordings (MOS 4.33 vs 4.28).
The F-variant hits 159ms first-packet latency, fast enough for real-time conversation, at a small quality tradeoff.

🏗️ NAG: Norm-AGnostic Residual Networks

What’s new

Standard Transformers have a depth problem: residual stream norm grows with each layer, so deeper layers’ updates become proportionally smaller rotations of the hidden state. NAG fixes this by separating direction from magnitude in the residual stream.

How it works

Splits the residual into two lanes: a normalized phase lane (unit-norm direction vector) and a scalar norm lane (magnitude). Each layer updates both independently.
Before adding a layer’s output to the residual, the update vector is centered (zero-mean), stripped of any component parallel to the current residual direction, normalized to unit length, and then combined with the phase via a controlled rotation. This means each layer contributes a fixed-size angular change regardless of how large the accumulated norm has grown.
The norm lane is updated separately by a simple scalar function. Direction and scale flow on independent tracks, so norm inflation in the magnitude track never limits the directional updates.
This geometry produces a natural skip signal: if a token’s expected rotation angle is small (the layer has nothing useful to add), the block can be skipped entirely. Mixture-of-Depths falls out as a free architectural side effect rather than requiring a separate learned router.
Implementation is a drop-in replacement for standard residual connections. The centering, projection, and normalization ops are cheap relative to the attention and FFN compute they surround.

Results

Outperforms baseline Transformers at every width-depth ratio tested, with gains increasing for deeper models where norm inflation normally causes the worst degradation.
Makes Mixture-of-Depths practical as a pretraining technique, not just a fine-tuning trick.

🔄 Pretraining RNNs without Recurrence

What’s new

Supervised Memory Training (SMT) eliminates backpropagation through time for nonlinear RNNs entirely. Gradients travel one step instead of T steps, making training fully parallelizable with O(1) gradient paths.

How it works

The core problem: training nonlinear RNNs with BPTT means gradients must travel back through every timestep. O(T) gradient path, vanishes or explodes, inherently sequential. Linear RNNs (Mamba, etc.) sidestep this but sacrifice expressiveness.
SMT breaks training into two subproblems. First, a Transformer encoder-decoder learns “memory labels”: what should the RNN’s hidden state be at each timestep? The encoder sees full past context and compresses it into a memory vector. A decoder predicts future tokens from that vector, forcing the encoder to keep only causally useful information.
Second, the RNN learns one-step transitions: given current memory m_t and next input x_{t+1}, predict m_{t+1}. Each transition is independent, so training parallelizes across all timesteps. No unrolling.
A DAgger phase finetunes the RNN on its own predicted states (not the Transformer labels) to correct compounding drift from the one-step approximation. This does unroll the RNN, but starts from a strong initialization so it converges fast.
The Transformer encoder is a training-time scaffold, discarded at inference. You end up with a pure RNN that is cheaper to run than the Transformer that trained it.
The predictive state objective naturally discovers structured memory: on finite-state tasks it collapses equivalent histories into the same state vector, recovering the minimal sufficient statistic. The RNN then learns transitions over this structured space rather than discovering representation from scratch.
Architecture-agnostic on the RNN side. Tested with vanilla nonlinear RNNs, GRUs, and others.

Results

Beats BPTT on all 5 synthetic tasks, with the gap widening on longer sequences where BPTT’s gradient flow fails. On pixel sequence generation BPTT collapses entirely while SMT stays stable.
Matches BPTT on language modeling (TinyStories) and outperforms the Transformer teacher on stack tracking at sequences longer than training length, showing genuine extrapolation.

🎯 Your Latent Reasoning is Secretly a Policy Improvement Operator

What’s new

Proves that each recursive step in Tiny Recursive Models is a KL-regularized policy improvement operation. Reframes test-time compute scaling as reinforcement learning and uses this to eliminate dead compute.

How it works

In a Tiny Recursive Model (TRM), the same transformer block runs in a loop, feeding output back as input. Standard explanation: more iterations = more effective depth. But the authors show many iterations are “dead compute,” the answer does not improve.
They prove that pre-reasoning output is a reference policy and post-reasoning output is an improved policy. Their log-ratio gives an advantage signal without needing an explicit value function. A recursive step is useful if and only if the ground-truth token gets above-average improvement (the Advantage Margin condition).
Deep Improvement Supervision (DIS): instead of training all recursive steps toward the same final target (which gives zero gradient to intermediate steps), generate progressively better intermediate targets. Step 1 targets a noisy/rough version, step 2 a slightly cleaner one, up to the ground truth.
The intermediate targets come from a discrete diffusion corruption schedule. Start from ground truth, apply decreasing noise at each step. Each recursive pass trains against its own appropriately corrupted target. Like denoising diffusion but for reasoning depth.
This guarantees every recursive step has positive expected advantage over the previous one, eliminating dead compute by construction.
Standard TRM uses T=3 external cycles with n=6 internal iterations (18 forward passes). DIS achieves the same accuracy with T=1, n=2 (2 passes total).

Results

A 0.8M parameter model hits 24% on ARC-AGI-1, a visual reasoning benchmark where most billion-parameter LLMs score lower. 18x fewer forward passes than standard TRM training for the same accuracy.

🧪 Agents’ Last Exam

What’s new

A benchmark for whether AI agents can do real professional work, not just answer questions. 1,490 tasks across all 55 O*NET/SOC occupational subfields: CAM toolpaths, financial models, video edits, architectural renders, network configs.

How it works

Tasks come from 250+ domain experts submitting actual projects from professional practice that originally took days to weeks. Each task has natural language description, input files, target software, expected deliverable, and evaluation spec.
Agents get a real VM with pre-installed software, read-only inputs, and a writable output directory. They must produce deliverables, not answers.
Evaluation is deterministic and automated: file diffs, numeric tolerance checks, geometric surface distance, behavioral state verification. LLM-as-judge only when unavoidable. A gate-and-score pattern is common: a binary precondition (file parses, no collision) must pass before quality is assessed.
90% of tasks are private with periodic rotation to prevent contamination.
Three difficulty tiers: Near-Term (feasible with current tools), Full-Spectrum (integrated multi-tool workflows), Last-Exam (hardest, long-horizon professional projects).
Agents need vision, CLI, tool use, and orchestration. Tested mainstream harnesses (Claude Code, Codex, Cursor, Gemini CLI) paired with frontier LLMs.

Results

Best overall: Codex + GPT-5.5 at 24% full-pass rate. On the hardest tier, every agent scores 0-2.6%. The gap between beating coding benchmarks and doing professional work is enormous.
Harness choice matters as much as model: the same LLM with different agent wrappers produces substantially different scores. The agent scaffolding is a major variable, not just the base model.

🔧 Self-Harness: Harnesses That Improve Themselves

What’s new

An agent iteratively improves its own system prompt and scaffolding without a human engineer or a stronger external model. Fully bootstrapped self-improvement.

How it works

The “harness” is everything around the base model: system prompt, tool definitions, orchestration logic, failure recovery rules. Different models need different harnesses, and manual engineering does not scale.
Three-stage iterative loop. Stage 1 (Weakness Mining): run tasks under the current harness, analyze execution traces of failures to identify specific behavioral pathologies. Not “the task failed” but “I kept retrying the same command instead of trying an alternative.”
Stage 2 (Harness Proposal): for each weakness, generate targeted minimal edits. Key constraint is minimality: small changes tied to specific failures, not sweeping rewrites. Multiple candidate fixes per weakness for diversity.
Stage 3 (Proposal Validation): each edit is accepted only if it passes regression testing. Run on both failed tasks and previously passing tasks. Fixes that break existing successes are rejected. This prevents improvement on one task from degrading others.
Accepted edits become the new harness for the next iteration. The process discovers model-specific adaptations: MiniMax learns to create output artifacts early and cap tool messages. Qwen learns dependency prechecking. GLM learns persistent shell sessions. These are precise behavioral patches, not generic “try harder.”
Tested on three very different base models starting from the same minimal initial harness.

Results

Double-digit gains across all three models on held-out tasks: MiniMax 40.5% to 61.9%, Qwen 23.8% to 38.1%, GLM 42.9% to 57.1%. All from self-improvement, no human prompt tuning.
Each model discovers different harness changes, confirming that harness optimization is model-specific and automatable.

🎞️ A Frame is Worth One Token: DeltaTok

What’s new

Compresses the difference between consecutive video frames in visual foundation model feature space into a single continuous token. 1,024x token reduction per frame.

How it works

Operates entirely in the frozen feature space of DINOv3 (ViT-B). No pixel reconstruction at all. The insight: consecutive frames in a VFM’s feature space are so redundant that their difference compresses to one token.
The tokenizer is a continuous autoencoder (no VAE, no quantization). The encoder takes patch tokens from two consecutive frames, adds learned per-frame embeddings, appends a single learnable query, and self-attention compresses everything into that one “delta token.”
The decoder takes the previous frame’s full patch tokens plus the delta token, uses zero-initialized queries, and reconstructs the current frame’s features. Initialization trick: no final LayerNorm + tiny LayerScale (10^-5) so at init the decoder approximates identity (predict no change). Free prior that nothing moved.
First frame is handled by prepending a black frame, so z_1 encodes absolute features. After that, autoregressive rollout is purely in delta-token space, and the decoder is invoked only when spatial features are actually needed downstream.
DeltaWorld (the generative world model) uses Best-of-Many training: sample K=256 noise queries per timestep, predict K candidate deltas, backprop only through the best one. At inference, different noise queries yield diverse futures in a single forward pass, no iterative denoising.
The predictor is trivially cheap because the sequence is 1D: one token per frame, standard causal mask, 1D RoPE.

Results

Beats Cosmos-12B on scene understanding (Cityscapes 55.4 vs 53.3 mIoU, KITTI depth 3.88 vs 4.01 RMSE) using 35x fewer params and 2,000x fewer FLOPs. The inter-frame delta in feature space is genuinely low-dimensional enough for one token.

🖼️ Native-Resolution Image Synthesis (NiT)

What’s new

A diffusion transformer that generates images at any resolution and aspect ratio from a single model. No resizing, no cropping, no per-resolution fine-tuning.

How it works

Uses DC-AE (32x spatial downsampling, 32 latent channels) instead of SD-VAE (8x, 4 channels). Aggressive compression makes patch-size-1 practical, so each latent spatial position is one token. A 512x512 image becomes 256 tokens.
Multiple variable-length token sequences are packed into one flat sequence up to 131,072 tokens using a longest-pack-first algorithm. Small images pack densely, large ones take more space. FlashAttention-2’s cu_seqlens gives block-diagonal attention with zero padding overhead.
The key to resolution generalization: replaces DiT’s learned absolute positional embeddings with axial 2D RoPE. Height and width are encoded independently in separate halves of the embedding dimension. Since RoPE encodes relative positions, the model extrapolates to larger grids the same way LLMs extrapolate to longer contexts.
Packed-AdaLN handles conditioning: each image’s timestep+class vector is broadcast across all its tokens. Training mixes native-resolution images with fixed 256x256 and 512x512 copies. Flow matching with logit-normal time distribution.

Results

Single model sets new best FID on ImageNet at both 256x256 (2.03) and 512x512 (1.45), first model to be SOTA at both resolutions simultaneously.
At resolutions never seen in training, NiT produces coherent images (1024x1024 FID 4.52) while competitors like EDM2-L collapse to FID 80+. Same story for extreme aspect ratios.

📐 NaViT: Patch n’ Pack

What’s new

Brings NLP-style sequence packing to vision transformers. Multiple images of different sizes and aspect ratios packed into a single sequence at native resolution. Foundational 2023 paper that influenced PaLI, Gemini, and most current multimodal models.

How it works

A greedy packing algorithm fills sequences with images until the token budget is full. Masked self-attention (block-diagonal) prevents cross-image interaction. Masked pooling extracts per-image representations. Padding waste under 2%.
Resolution is sampled per image during training from a truncated normal biased toward smaller sizes. NaViT sees about 5x more images per compute budget than fixed-resolution ViT because many small images are cheap. Variable resolution at train time is what gives generalization at test time.
Factorized positional embeddings replace ViT’s 1D learned positions: separate x and y embeddings combined by addition. Fractional variant normalizes to [0,1] by image dimension, handling arbitrary resolutions without interpolation artifacts.
Token dropping becomes more flexible: each image gets a different drop rate instead of one fixed rate for the batch. A sigmoid schedule over training starts with high drop rates (see more images, less detail) and decreases over time. Resolution-dependent rates also work: high-res images tolerate more dropping.

Results

4x less compute to match the best ViT on JFT-4B pretraining by processing ~5x more images in the same FLOP budget through elimination of padding waste.
+20 points on ImageNet-A (dominated by extreme aspect ratios that squashing to squares destroys) and +5 AP on LVIS object detection. Preserving native aspect ratio directly helps fine-grained recognition.

🛠️ Repos and Tools

🔎 turbovec - Rust/Python vector index with extreme compression. 31GB float32 corpora fit in about 4GB via 2-4-bit TurboQuant. Zero training phase, no k-means fitting. 10-19% faster than FAISS FastScan on ARM. Drop-in integrations for LangChain, LlamaIndex, Haystack. MIT license.
🥔 Kartoffelphon-2.5M - 2.59M German speech clips, about 7,000 hours, from CC/CC-BY podcasts, LibriVox, and lectures. Emilia-style pipeline with Whisper transcription and DNSMOS quality scoring. Broad stylistic diversity for robust German ASR/TTS training. CC-BY 4.0.

That is it for this issue. If you found this useful, share it with someone who thinks the only open-source voice model worth using is Whisper.

Machine Learns Substack

Discussion about this post

Ready for more?