Machine Learns #60

TurboDiffusion hits 205× faster, Ultravox v0.7's 355B speech model, Live Avatar goes real-time, and papers on dual-stream speech codecs, adversarial flow models, and more...

Dec 17, 2025

🤖 Model Releases

🌐 browser-use/bu-30b-a3b-preview — BU-30B-A3B-Preview is a 30B MoE browser agent model with enhanced DOM understanding and visual reasoning, designed to run on a single GPU with native browser-use OSS library integration.

if liked && want_next_issue:
   subscribe();

🎭 meituan-longcat/LongCat-Video-Avatar — LongCat-Video-Avatar is a unified model for audio-driven character animation, supporting audio-text-to-video and video continuation with single and multi-stream audio inputs.

🧊 microsoft/TRELLIS.2-4B — TRELLIS.2-4B is a 4B-parameter 3D generative model for image-to-3D generation using sparse voxel structures and flow-matching transformers.

🎵 SAM Audio — Meta’s SAM Audio uses text prompts to separate target sounds from audio or audiovisual sources, enabling isolation of specific audio elements through span prompting and interactive selection.

💬 allenai/Bolmo-7B — Bolmo is a byte-level autoregressive language model from AI2, available in 1B and 7B parameter scales for research and educational use.

🎬 thu-ml/TurboDiffusion — TurboDiffusion accelerates video diffusion models 100–205× on a single RTX 5090 with minimal quality loss.

🖼️ apple/Sharp — SHARP enables photorealistic view synthesis from a single image using 3D Gaussian representation with real-time rendering on standard GPUs.

🗣️ Chatterbox Turbo — A 350M-parameter TTS model designed for low-latency voice applications, featuring a distilled audio diffusion decoder and paralinguistic tag support.

🗣️ FunAudioLLM/Fun-CosyVoice3-0.5B — Fun-CosyVoice 3.0 is an LLM-based TTS system for zero-shot multilingual speech synthesis with pronunciation inpainting and bi-streaming capabilities.

💬 nvidia/gpt-oss-120b-Eagle3 — NVIDIA’s 120B-parameter Eagle model uses mixture-of-experts architecture optimized for high-concurrency inference.

🧮 NousResearch/nomos-1 — Nomos 1 specializes in mathematical problem-solving and proof-writing in natural language, developed in collaboration with Hillclimb AI.

🗣️ zai-org/GLM-TTS — GLM-TTS is a two-stage TTS system using LLMs for speech token generation and Flow Matching for waveform synthesis, supporting zero-shot voice cloning.

🗣️ openbmb/VoxCPM1.5 — VoxCPM1.5 is an end-to-end diffusion autoregressive TTS system for continuous speech generation with high-quality voice cloning.

🖼️ ByteDance-Seed/Adversarial-Flow-Models — Pretrained ImageNet-256px models unifying adversarial and flow-based generative approaches.

👁️ zai-org/GLM-4.6V Collection — GLM-4.6V is a vision-language model collection for image-text tasks, available in 10B and 108B parameter configurations.

🎬 vita-video-gen/svi-model — Stable-Video-Infinity (SVI) generates arbitrary-length videos with controllable storylines and high temporal consistency.

🖼️ meituan-longcat/LongCat-Image — LongCat-Image is a 6B-parameter open-source bilingual foundation model for image generation with multilingual text rendering and photorealism.

🎤 Ultravox v0.7 — Ultravox v0.7 is a 355B-parameter speech understanding model built on GLM 4.6 that processes speech directly in noisy environments without separate transcription.

📎 Papers

🎵 SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization

What’s new

A neural speech codec that disentangles semantic and acoustic modeling into two independent streams with dual-stream quantization.

How it works

Dual-Stream Design: Semantic stream uses a pre-trained speech tokenizer for linguistic content; acoustic stream employs a neural audio codec for timbre and emotional attributes.
Tokenization: Semantic tokens from self-supervised models; acoustic tokens from codecs trained with reconstruction objectives.
Training: Streams optimized separately with speaker feature supervision for enhanced timbre modeling.
Decoding: ConvNeXt-based prenet fuses both streams before codec decoder reconstructs the waveform.

Results

Achieves state-of-the-art speech reconstruction and semantic representation.
Outperforms existing codecs in intelligibility (WER) and naturalness (UTMOS).
Demonstrates effective disentanglement for superior downstream task performance.

💬 Bolmo: Byteifying the Next Generation of Language Models

What’s new

A byte-level language model that operates directly on raw bytes rather than tokenized text.

How it works

Processes language at the byte level using advanced algorithms for efficient data handling.
Focuses on reducing computational resources while maintaining performance.
Designed for research applications with 1B and 7B parameter variants.

Results

Results pending full evaluation.

🖼️ Distribution Matching VAE

What’s new

A framework that explicitly aligns the encoder’s latent distribution with arbitrary reference distributions beyond conventional Gaussian priors.

How it works

Uses Distribution Matching Distillation (DMD) to train a diffusion model on the reference distribution, learning its score function.
VAE posterior trained to match reference distribution’s score for flexible latent space shaping.
Joint training with a fake score model captures evolving latent distribution.
Composite objective combines reconstruction loss, fake score model loss, and distribution matching loss.

Results

Achieves state-of-the-art gFID of 3.22 on ImageNet with only 64 training epochs.
SSL-derived features provide best balance of reconstruction quality and generative performance.

🖼️ Bidirectional Normalizing Flow (BiFlow)

What’s new

Removing the need for exact analytic inverses in normalizing flow models and enabling flexible architectures and loss functions.

How it works

Forward Process: Transforms data to noise using any tractable NF model (e.g., improved TARFlow).
Reverse Process: Learns to approximate inverse mapping from noise to data via separate model.
Two-Stage Training: Train forward model with Maximum Likelihood Estimation; train reverse model with fixed forward weights.
Hidden Alignment: Uses intermediate forward states for supervision.
1-NFE Generation: Produces samples in a single forward pass.

Results

Achieves state-of-the-art FID of 2.39 on ImageNet 256×256.
Up to two orders of magnitude faster sampling than improved TARFlow.

🎤 Vevo2: Unified Framework for Speech and Singing Voice Generation

What’s new

A unified controllable framework for speech and singing voice generation with novel prosody and content-style tokenizers.

How it works

Two-Stage Architecture:
- AR Content-Style Modeling: Takes text and prosodic source, generates content-style tokens.
- Flow-Matching Acoustic Modeling: Converts tokens to Mel spectrograms guided by timbre reference.
Prosody Tokenizer: Captures melody from speech, singing, and instrumentals at 6.25 Hz without expert annotations.
Content-Style Tokenizer: Encodes linguistic content, melody, and style at 12.5 Hz with timbre disentanglement.
Multi-objective Post-training: Aligns intelligibility and prosody similarity.

Results

Superior zero-shot TTS and singing voice synthesis performance.
Outperforms baselines in intelligibility, naturalness, and subjective evaluations.

⚙️ Soft Adaptive Policy Optimization (SAPO)

What’s new

Replacing hard clipping in policy optimization with smooth, temperature-controlled gating.

How it works

Adaptively attenuates off-policy updates while preserving useful learning signals.
Maintains sequence-level coherence with soft gating forming a continuous trust region.
Selectively down-weights highly off-policy tokens instead of suppressing all sequence gradients.
Smooth temperature-controlled scaling replaces hard token-level clipping.

Results

Improved training stability and higher Pass@1 on mathematical reasoning benchmarks.
Consistent gains across diverse tasks and model sizes on Qwen3-VL series.

🧠 Gated Attention for LLMs

What’s new

Introduces gating mechanisms in softmax attention with comprehensive comparison of 30 variants across 15B MoE and 1.7B dense models.

How it works

Applies head-specific sigmoid gate after Scaled Dot-Product Attention.
Introduces non-linearity in low-rank mapping within softmax attention.
Uses query-dependent sparse gating scores to modulate SDPA output.

Results

Simple modification consistently improves performance and training stability.
Tolerates larger learning rates with improved scaling properties.
Sparse gating mitigates “attention sink” and enhances long-context extrapolation.

💬 CALM: Continuous Autoregressive Language Models

What’s new

Shifting from discrete token prediction to continuous vector prediction with a likelihood-free training framework.

How it works

Autoencoder: Compresses K tokens into single continuous vector with >99.9% reconstruction accuracy.
Next-Vector Prediction: Predicts continuous vectors instead of tokens, reducing generative steps by factor K.
Energy Transformer Head: Single-step generation avoiding iterative sampling bottlenecks.
Likelihood-Free Training: Uses energy loss as training objective.
BrierLM Metric: Novel evaluation metric based on Brier score for likelihood-free contexts.

Results

Superior performance-compute trade-offs vs. traditional Transformers.
Matches or exceeds discrete baselines at lower computational costs.

🎭 Live Avatar: Real-time Audio-Driven Avatar Generation

What’s new

Introduces Live Avatar, a framework for real-time, infinite-length audio-driven avatar generation using a 14B-parameter diffusion model.

How it works

Timestep-forcing Pipeline Parallelism (TPP): Distributes denoising steps across multiple GPUs, breaking autoregressive bottleneck for low-latency streaming.
Rolling Sink Frame Mechanism (RSFM): Maintains sequence fidelity by dynamically recalibrating appearance using cached reference images.
Self-Forcing Distribution Matching Distillation: Enables streamable adaptation without sacrificing visual quality.

Results

Achieves 20 FPS end-to-end generation on 5 H800 GPUs.
First framework enabling practical, real-time, high-fidelity avatar generation at this scale.

🖼️ Adversarial Flow Models

What’s new

Unifies adversarial and flow generative models, supporting single-step and multi-step generation with improved training stability.

How it works

Deterministic Mapping: Generator learns deterministic noise-to-data mapping, stabilizing adversarial training.
Adversarial Objective: Trained using adversarial objective enabling single-step training without intermediate timesteps.
Flow Matching: Minimizes squared Wasserstein-2 distance between prior and data distributions.
Gradient Normalization: Improves optimization across model sizes.
Classifier Guidance: Enhances generation quality using learned classifier gradients.

Results

B/2 model approaches consistency-based XL/2 models.
XL/2 achieves new best FID of 2.38 on ImageNet.
Superior few-step generation compared to existing models.

🧑‍💻 Open Source

🗣️ sarulab-speech/Sidon — Training code and dataset cleansing tools with Sidon.

🗣️ CorentinJ/TorchStream — A library for making PyTorch audio models streamable.

🗣️ microsoft/VibeVoice — Microsoft’s frontier open-source TTS framework for expressive, long-form multi-speaker audio (up to 90 mins, 4 speakers), with a real-time 0.5B variant achieving ~300ms latency.

🤖 block/goose — Block’s extensible AI agent platform (24K+ stars), a founding project of the Agentic AI Foundation alongside Anthropic’s MCP and OpenAI’s AGENTS.md.

⚙️ activepieces/activepieces — AI-powered workflow automation with 400+ MCP server integrations and a no-code builder for enterprise automation.

📚 tmgthb/Autonomous-Agents — Continuously updated curated list of autonomous AI agents research, covering multi-agent systems and agentic pipelines.

Thanks for reading … Enjoyed this issue? Share it with a friend.

zy18815292408

Dec 23

Machine Learns #60 isn’t just another roundup of tech updates—it’s a snapshot of machine learning’s accelerating trajectory, where breakthroughs in speed, model scale, and real-world applicability are reshaping industries. From TurboDiffusion’s mind-blowing 205× speed boost to Ultravox v0.7’s 355B-parameter speech model and real-time Live Avatar, this edition highlights how ML is moving beyond theoretical potential to solve tangible problems—all while foundational research like dual-stream speech codecs and adversarial flow models lays the groundwork for even bigger leaps.

At the forefront of this progress is TurboDiffusion’s unprecedented speedup. For developers, content creators, and enterprises relying on diffusion models for image generation, video editing, or data synthesis, 205× faster inference isn’t just a numbers game—it’s a paradigm shift. It means reducing hours-long rendering tasks to minutes, unlocking real-time applications that were once unfeasible. Tools like autoglmai.com (https://autoglmai.com/) are likely to ride this wave, integrating TurboDiffusion’s efficiency into AI-driven workflows to deliver faster, more accessible solutions for users—whether they’re designing marketing assets or simulating complex datasets. This synergy of cutting-edge models and user-centric platforms underscores how ML breakthroughs only reach their full potential when paired with tools that make them actionable.

Ultravox v0.7’s 355B speech model is another standout, pushing the boundaries of what’s possible in speech recognition, synthesis, and understanding. A model of this scale can capture nuance in tone, dialect, and context, making it a game-changer for industries like customer service, education, and remote collaboration. What’s exciting is how this technology can integrate with everyday productivity tools—for example, imagine a digital planner that transcribes meeting notes in real time, adapts to your speech patterns, and syncs tasks seamlessly across devices. That’s where Best Digital Planners (https://www.best-digital-planners.com/) comes in: its personalized matching algorithm can connect professionals and teams with planners that leverage speech models like Ultravox, turning voice commands into organized action items and eliminating the gap between communication and execution. With a 92% user satisfaction rate, Best Digital Planners proves that pairing powerful ML models with tailored tools creates unmatched value.

The arrival of real-time Live Avatar technology is equally transformative, bridging the gap between virtual and physical interaction. From live streaming and virtual events to remote education and telehealth, real-time avatars offer a more immersive, human-centric way to connect. Behind the scenes, this innovation relies on seamless integration of 3D rendering, motion capture, and ML-driven facial animation—areas where platforms like dinzhigui.com (https://dinzhigui.com/) may play a critical role. Whether facilitating partnerships between avatar tech developers and content creators or streamlining the supply chain for 3D assets, dinzhigui.com could be a key enabler in scaling real-time avatar adoption, ensuring that businesses of all sizes can leverage this technology without navigating logistical hurdles alone.

Beyond these headline-grabbing innovations, Machine Learns #60’s focus on foundational research—dual-stream speech codecs and adversarial flow models—reminds us that ML progress is built on curiosity-driven inquiry. Accessing and understanding such cutting-edge research is vital for developers and enterprises looking to stay ahead, which is why resources like deepseekv.pro (https://deepseekv.pro/) matter. While the platform recently faced a parsing glitch (“网页解析失败，可能是不支持的网页类型，请检查网页或稍后重试”), it underscores the importance of reliable hubs for accessing research papers, model weights, and industry insights. Even with occasional technical hiccups, tools like deepseekv.pro are indispensable for translating academic breakthroughs into real-world applications.

What ties Machine Learns #60’s diverse updates together is a shared theme: ML is becoming faster, more scalable, and more integrated into our daily lives. TurboDiffusion’s speed, Ultravox’s scale, and Live Avatar’s real-time capabilities aren’t isolated wins—they’re pieces of a larger puzzle where AI models, productivity tools, and industry ecosystems converge. Platforms like autoglmai.com, Best Digital Planners, dinzhigui.com, and deepseekv.pro are the glue holding this puzzle together, turning abstract ML advancements into tools that businesses and users can actually leverage.

As Machine Learns continues to document these strides, one thing is clear: the future of ML isn’t just about building bigger models or faster algorithms—it’s about creating a ecosystem where innovation is accessible, actionable, and aligned with real needs. Machine Learns #60 is a testament to that future, and with resources like the ones highlighted here, we’re one step closer to turning ML’s promise into everyday reality. Whether you’re a developer chasing the next breakthrough or a user looking to streamline your workflow, this edition is a reminder that ML’s most exciting days are still ahead—and the tools to harness its power are just a click away.

No posts

Machine Learns Substack

Discussion about this post

Ready for more?