Machine Learns #63

New model releases and papers focusing to training tricks, and scaling laws, dLLMs, post-training more...

Feb 04, 2026

🤖 Model Releases

💻 Qwen3-Coder-Next — 80B-parameter text generation model optimized for coding and language tasks.

from machine_learns import newsletter
newsletter.subscribe(frequency="bi-weekly")

🎭 MiniCPM-o 4.5 — 9B-parameter multimodal model for real-time, full-duplex audio and video processing with bilingual speech conversation and advanced visual capabilities.

🔊 MMS-300M Forced Aligner — Python package for efficient forced alignment of text and audio using Hugging Face pretrained models, with improved memory usage over TorchAudio.

🎬 LingBot-World — Open-source world simulator for video generation featuring high-fidelity environments, long-term memory, and real-time interactivity.

💻 Step 3.5 Flash — Open-source sparse MoE foundation model for efficient reasoning and agentic tasks, processing 100–300 tokens/sec with 256K context window.

🗣️ KugelAudio-0-Open — Open-source TTS model for European languages with voice cloning, using a 7B-parameter AR + Diffusion architecture trained on ~200K hours of speech.

👂 Qwen3-ASR — Open-source ASR series from Alibaba Cloud supporting multilingual speech, music, and song recognition with language detection and timestamp prediction.

🖼️ Z-Image — 6B-parameter single-stream diffusion transformer for efficient image generation, editing, and bilingual text rendering.

🎭 Kimi-K2.5 — Image-text-to-text model from Moonshot AI built on the Transformers library.

💻 Stable-DiffCoder-8B-Instruct — Code diffusion LLM built on Seed-Coder architecture with block diffusion continual pretraining for improved code generation, reasoning, and editing.

🗣️ LuxTTS — Lightweight TTS model for voice cloning achieving 150x+ realtime speed while fitting in 1GB VRAM.

🎬 Linum v2 — Open-weight 2B-parameter text-to-video model generating 2–5 second clips at up to 720p for experimentation in generative media.

👂 VibeVoice-ASR — Unified speech-to-text model processing up to 60 minutes of long-form audio in a single pass with speaker identification, timestamps, and user-customized context support.

📎 Papers

🧠 A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training

What’s new

A framework explaining outlier-driven rescaling in transformers, identifying attention sinks and residual sinks as functional components rather than artifacts. Proposes GatedNorm and PreAffine methods to mitigate outliers while preserving performance.

How it works

Outliers interact with normalization mechanisms (softmax and RMSNorm) to rescale non-outlier components
Attention sinks: specific tokens receive disproportionately high attention scores
Residual sinks: fixed dimensions exhibit consistently high activations across tokens
GatedNorm: element-wise low-rank self-gating after normalization layers
PreAffine: learnable scaling vector before normalization to enable outlier-driven rescaling without large residual values

Results

Improved training stability and performance across various models
Enhanced quantization robustness under aggressive low-bit settings
Gains in knowledge, reasoning, STEM, code generation, and multilingual tasks

🌫️ Causal Autoregressive Diffusion Language Model

What’s new

Unifying the training efficiency of autoregressive models with high-throughput inference of diffusion models.

How it works

Strictly causal attention mask reformulates the diffusion process
Shifted causal attention where each position predicts its original token from preceding noised context
Dense supervision for entire sequences in a single forward pass
Soft tail masking concentrates noise at the sequence tail
Context-aware reweighting adjusts loss weights based on local ambiguity
Dynamic parallel decoding with KV-caching generates variable-length sequences based on confidence

Results

Outperforms existing discrete diffusion models by 5.7+ points in zero-shot accuracy
ARM-level data efficiency with 3× reduced training latency vs. block diffusion methods
Lowest zero-shot perplexity across multiple domains

🧠 Residual Context Diffusion Language Models

What’s new

A mechanism that recycles computation from discarded low-confidence tokens instead of discarding them at every denoising steps to improve dLLM accuracy.

How it works

Transforms discarded token representations into contextual residuals for the next denoising step
Two-stage training: lightweight reference model generates reliable probability distributions; target model incorporates residuals using reference model as stable guide
Entropy-based embedding aggregation selects and aggregates context
Dynamically adjusts residual contribution based on normalized Shannon entropy

Results

5–10 point accuracy improvement on various benchmarks with minimal extra computation
Nearly doubles baseline accuracy on challenging AIME tasks
Reduces denoising steps by 4–5× at equivalent accuracy levels

🧷 Scaling Embedding Layers in Language Models

What’s new

Scone (Scalable, Contextualized, Offloaded, N-gram Embedding), a method enhancing input embeddings without increasing decoding costs.

How it works

Retains original vocabulary while adding embeddings for frequent n-grams
Separate transformer (f-gram model) learns contextualized representations
Embeddings precomputed and stored in off-accelerator memory
Avoids sparse update problem by parameterizing embeddings with f-gram model
F-gram layer can be offloaded, maintaining fixed accelerator resources during inference

Results

1B accelerator-resident parameter model outperforms 1.9B baseline
Uses approximately half the FLOPs and accelerator memory during inference

🏗️ ConceptMoE: Adaptive Token-to-Concept Compression

What’s new

Adaptive token-to-concept compression for implicit compute allocation, dynamically merging semantically similar tokens within MoE architectures.

How it works

Learnable chunk module identifies boundaries based on inter-token similarity
Consecutive high-similarity tokens merge into concept representations
MoE architecture enables controlled evaluation by reallocating saved computation
Minimal architectural changes for straightforward integration

Results

+0.9 points on language pretraining, +2.3 on long context understanding, +0.6 on multimodal benchmarks
+5.5 points when converting pretrained MoE during continual training
Prefill speedups up to 175%, decoding speedups up to 117%

🏗️ TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon

What’s new

A generalization of the Muon optimizer extending orthogonalization beyond individual layers for LLM pre-training.

How it works

Models gradients as structured higher-order tensors
Performs matrix-level gradient orthogonalization across layers simultaneously
Provides improved convergence guarantees over layer-wise Muon
Robust under different approximate SVD schemes

Results

Evaluated on GPT-style (130M–774M) and LLaMA-style (60M–1B) models
Consistently improves training and validation perplexity across scales

🔊 High-Fidelity Generative Audio Compression at 0.275kbps

What’s new

Achieving high-fidelity audio at ultra-low bitrates (0.275kbps), shifting from signal fidelity to task-oriented effectiveness.

How it works

Integrates semantic understanding at the transmitter with generative synthesis at the receiver
Two-stage process: learns compressed semantic representation aligned with linguistic supervision, then recovers high-fidelity audio via large generative model
Encoder filters redundancy, transmitting only semantic essence
Decoder reconstructs details from model priors

Results

High-fidelity 32kHz audio reconstruction at 0.275kbps with 3000× compression ratio
Outperforms SOTA neural codecs in perceptual quality and semantic consistency
Maintains intelligible transmission even at 0.175kbps

🔊 SemanticAudio: Audio Generation and Editing in Semantic Space

What’s new

A two-stage Flow Matching framework (Semantic Planner + Acoustic Synthesizer) for audio generation and editing in high-level semantic space with training-free text-guided editing.

How it works

Semantic Planner: generates compact semantic features from text using dual inputs (global sentence embedding + token-level embeddings)
Acoustic Synthesizer: produces high-fidelity acoustic latents conditioned on semantic features
Editing: delta velocity fields from source/target prompts enable semantic-level modifications without retraining

Results

Superior semantic alignment (CLAP score 0.354)
High reconstruction fidelity with low Mel and STFT loss
Robust editing capabilities even without source text

🎬 Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory

What’s new

A framework for multi-turn video editing with cross-consistency, using explicit memory to enhance video-to-video diffusion models.

How it works

Lightweight memory modules integrate into V2V models
Dynamic tokenization with varying kernel sizes based on edit relevance
Retrieval mechanism identifies relevant past edits from external cache
Learnable token compressor reduces redundancy while preserving visual cues (30% speedup)

Results

Strong cross-iteration consistency in novel view synthesis and text-guided long video editing
Outperforms SOTA baselines in visual quality and computational efficiency

🔁 Self-Distillation Enables Continual Learning

What’s new

Self-Distillation Fine-Tuning for continual learning, enabling on-policy learning directly from expert demonstrations.

How it works

Single model serves as both teacher and student.
Teacher conditioned on task [prompt + example demonstration]; student uses only the prompt
Training minimizes reverse KL divergence between teacher and student outputs
Model learns from own generated trajectories while preserving prior capabilities

Results

Higher new-task accuracy and better retention than SFT
Single model acquires multiple skills sequentially without performance degradation
Outperforms baselines on both in-distribution and out-of-distribution tasks

🔁 Teaching Models to Teach Themselves

What’s new

SOAR, a self-improvement framework using meta-RL where a teacher model generates automated curricula for problems the student cannot yet solve.

How it works

Optimization with outer Teacher and inner Student loop stages.
Teacher proposes synthetic problems for inner looper Student optimization.
After N inner loops steps, Teacher is rewarded based on Student’s improvement. (Teacher learns to teach in the outer looper)
Both stages uses RLOO for optimization.

Results

Bi-level meta-RL facilitates learning despite sparse rewards
Structural quality of generated questions more important than solution correctness

🔁 Recursive Self-Aggregation Unlocks Deep Thinking in LLMs

What’s new

Combining parallel and sequential scaling methods to improve LLM reasoning.

How it works

Maintains population of candidate solutions at each step
Aggregates subsets to iteratively produce improved solutions
Inspired by evolutionary algorithms, enables model to revisit and correct reasoning
RL training teaches effective solution aggregation

Results

Significant performance improvements across tasks vs. traditional methods
Bridges gap between smaller and larger reasoning models

🔁 Training-Free Group Relative Policy Optimization

What’s new

A method enhancing LLM performance without parameter updates by leveraging experiential knowledge as token priors.

How it works

Uses group-based rollouts to distill semantic advantages from multiple outputs
Maintains frozen model; updates external experiential knowledge library
Each step: generate outputs, score them, extract semantic advantages based on performance
Adapts to new scenarios with minimal training data

Results

Significant improvements in mathematical reasoning and web searching
Outperforms fine-tuned models with fewer samples and lower costs
Strong performance on AIME24, AIME25, and WebWalkerQA

🧑‍💻 Open Source

FastGen — NVIDIA library for fast generation from diffusion models.

Enjoyed this issue? Send it to a friend who’d appreciate it.

Machine Learns Substack

Discussion about this post

Ready for more?