Machine Learns #54
🤖 Voice models, long-context tricks, and a token-order loss worth trying Flashy audio releases + 5 papers (MoC, TOP, FELLE, M2N2, Motif TR)
🤖 Model Releases
xingchensong/FlashCosyVoice
Lightweight vLLM implementation for CosyVoice aimed at efficient offline inference. Clean Python codebase with built-in speed optimizations and easy modification.
meituan-longcat/LongCat-Flash-Chat · Hugging Face
A 560B Mixture-of-Experts language model that activates only a context-relevant subset of parameters for more efficient training and inference in agentic tasks.
Marvis-Labs/marvis-tts
Conversational speech model for real-time voice cloning and TTS on consumer hardware—works from ~10 seconds of reference audio.
HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
Multimodal diffusion framework for high-fidelity Foley generation via aligned video–audio representations.
microsoft/VibeVoice-1.5B · Hugging Face (technical report)
Framework for long-form, multi-speaker conversational audio from text, using continuous speech tokenizers and a next-token diffusion setup for scale and natural flow.
Motif-Technologies/Motif-2.6B · Hugging Face
A 2.6B-parameter language model trained from scratch on AMD Instinct™ MI250 GPUs, built for knowledge representation and creative workflows.
xai-org/grok-2 · Hugging Face
Weights-available 2024 xAI model designed for the SGLang engine; reference setup calls for 8 GPUs.
📎 Papers
Mixture of Contexts for Long Video Generation (MoC)
What’s new
Treats long-context video generation as internal retrieval.
Adds a learnable sparse attention router for long-term memory.
How it works
Each query token pulls only a few informative chunks plus mandatory anchors (e.g., captions).
Causal routing avoids loop closures; chunk granularity and selectivity anneal during training.
Mean-pooled descriptors enable efficient attention; drop-off/drop-in improve robustness; intra-shot links preserve continuity.
Results
Cuts attention FLOPs by >85% while maintaining or improving fidelity and consistency.
Stronger long multi-shot videos with more motion diversity.
Predicting the Order of Upcoming Tokens Improves Language Modeling (TOP)
What’s new
Token Order Prediction (TOP) adds a ranking-style auxiliary loss to standard next-token prediction (NTP).
How it works
Learns to order upcoming tokens within a window using a modified ListNet loss.
Needs only an extra unembedding (vs. multi-layer heads in MTP).
Final objective = NTP + TOP; the next token should receive the highest score.
Results
Beats NTP and MTP on benchmarks (e.g., Lambada, HellaSwag) across 340M/1.8B/7B scales, lowering perplexity and improving accuracy.
FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching
What’s new
Blends language modeling with flow matching for continuous tokens (mel-spectrograms).
Introduces a coarse-to-fine flow-matching path.
How it works
A dynamic prior conditions each frame on preceding context.
C2F-FM generates low-res mel features, then refines details; accepts text + mel prompts.
Uses classifier-free guidance during training and inference.
Results
Competitive WER vs. MELLE, with better similarity metrics.
Stronger naturalness and speaker similarity for continuation and cross-sentence tasks.
Competition and Attraction Improve Model Fusion (M2N2)
What’s new
Evolutionary model-merging method with dynamic boundaries and a diversity-preserving mechanism.
Adds an attraction score to pick complementary model pairs.
How it works
Iteratively merges two models with flexible split points; keeps an archive to widen exploration.
Competition promotes high performers; attraction steers toward promising pairs.
Removes hand-crafted parameter groupings to expand the search space.
Results
High test accuracy from scratch; effective fusion of specialized vision and language models.
Preserves capabilities beyond a single fitness function.
Motif-2.6B Technical Report
What’s new
Introduces Differential Attention (subtracts two attention maps) and PolyNorm activations in a decoder-only transformer.
Targets strong performance with compute efficiency.
How it works
Pretrained on ~2.5T tokens with gradual mixture scheduling: general English → domain-specific (code, math).
Post-trained with data refinement (synthetic + quality filtering).
Alignment via DPO.
Results
Competitive or better than similarly sized models, especially on long-context and reasoning.
Reduces hallucinations; improves in-context learning.
My 2¢
I benchmarked PolyNorm and Differential Attention with BlaGPT; at NanoGPT scale they didn’t move the needle.
🧑💻 Open Source
narcotic-sh/senko — Very fast speaker diarization pipeline.
LukeGus/Termix — Web-based server management with SSH terminal, tunneling, and file editing.
j05u3/VTS — Lightning-fast transcription for macOS via OpenAI, Groq, and Deepgram.

