Machine Learns #54

🤖 Voice models, long-context tricks, and a token-order loss worth trying Flashy audio releases + 5 papers (MoC, TOP, FELLE, M2N2, Motif TR)

Sep 04, 2025

🤖 Model Releases

xingchensong/FlashCosyVoice
Lightweight vLLM implementation for CosyVoice aimed at efficient offline inference. Clean Python codebase with built-in speed optimizations and easy modification.

meituan-longcat/LongCat-Flash-Chat · Hugging Face
A 560B Mixture-of-Experts language model that activates only a context-relevant subset of parameters for more efficient training and inference in agentic tasks.

Marvis-Labs/marvis-tts
Conversational speech model for real-time voice cloning and TTS on consumer hardware—works from ~10 seconds of reference audio.

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
Multimodal diffusion framework for high-fidelity Foley generation via aligned video–audio representations.

microsoft/VibeVoice-1.5B · Hugging Face (technical report)
Framework for long-form, multi-speaker conversational audio from text, using continuous speech tokenizers and a next-token diffusion setup for scale and natural flow.

Motif-Technologies/Motif-2.6B · Hugging Face
A 2.6B-parameter language model trained from scratch on AMD Instinct™ MI250 GPUs, built for knowledge representation and creative workflows.

xai-org/grok-2 · Hugging Face
Weights-available 2024 xAI model designed for the SGLang engine; reference setup calls for 8 GPUs.

📎 Papers

Mixture of Contexts for Long Video Generation (MoC)

What’s new

Treats long-context video generation as internal retrieval.
Adds a learnable sparse attention router for long-term memory.

How it works

Each query token pulls only a few informative chunks plus mandatory anchors (e.g., captions).
Causal routing avoids loop closures; chunk granularity and selectivity anneal during training.
Mean-pooled descriptors enable efficient attention; drop-off/drop-in improve robustness; intra-shot links preserve continuity.

Results

Cuts attention FLOPs by >85% while maintaining or improving fidelity and consistency.
Stronger long multi-shot videos with more motion diversity.

Predicting the Order of Upcoming Tokens Improves Language Modeling (TOP)

What’s new

Token Order Prediction (TOP) adds a ranking-style auxiliary loss to standard next-token prediction (NTP).

How it works

Learns to order upcoming tokens within a window using a modified ListNet loss.
Needs only an extra unembedding (vs. multi-layer heads in MTP).
Final objective = NTP + TOP; the next token should receive the highest score.

Results

Beats NTP and MTP on benchmarks (e.g., Lambada, HellaSwag) across 340M/1.8B/7B scales, lowering perplexity and improving accuracy.

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

What’s new

Blends language modeling with flow matching for continuous tokens (mel-spectrograms).
Introduces a coarse-to-fine flow-matching path.

How it works

A dynamic prior conditions each frame on preceding context.
C2F-FM generates low-res mel features, then refines details; accepts text + mel prompts.
Uses classifier-free guidance during training and inference.

Results

Competitive WER vs. MELLE, with better similarity metrics.
Stronger naturalness and speaker similarity for continuation and cross-sentence tasks.

Competition and Attraction Improve Model Fusion (M2N2)

What’s new

Evolutionary model-merging method with dynamic boundaries and a diversity-preserving mechanism.
Adds an attraction score to pick complementary model pairs.

How it works

Iteratively merges two models with flexible split points; keeps an archive to widen exploration.
Competition promotes high performers; attraction steers toward promising pairs.
Removes hand-crafted parameter groupings to expand the search space.

Results

High test accuracy from scratch; effective fusion of specialized vision and language models.
Preserves capabilities beyond a single fitness function.

Motif-2.6B Technical Report

What’s new

Introduces Differential Attention (subtracts two attention maps) and PolyNorm activations in a decoder-only transformer.
Targets strong performance with compute efficiency.

How it works

Pretrained on ~2.5T tokens with gradual mixture scheduling: general English → domain-specific (code, math).
Post-trained with data refinement (synthetic + quality filtering).
Alignment via DPO.

Results

Competitive or better than similarly sized models, especially on long-context and reasoning.
Reduces hallucinations; improves in-context learning.

My 2¢
I benchmarked PolyNorm and Differential Attention with BlaGPT; at NanoGPT scale they didn’t move the needle.

🧑‍💻 Open Source

narcotic-sh/senko — Very fast speaker diarization pipeline.
LukeGus/Termix — Web-based server management with SSH terminal, tunneling, and file editing.
j05u3/VTS — Lightning-fast transcription for macOS via OpenAI, Groq, and Deepgram.

💌 Enjoyed this issue? Share it with a friend.

Machine Learns Substack

Discussion about this post