Machine Learns #44

Praxis Sam Altman's tech utopia, Amazon launches Nova Sonic voice AI, Midjourney returns with V7, Llama 4 models debut amid controversy, new brain-to-voice model, NoProp learning ...

Apr 09, 2025

📌 Bookmarks

The Anthropic Economic Index reveals increased usage of Claude 3.7 Sonnet in coding, education, and science, with a focus on technical tasks using the new "extended thinking" mode. A new taxonomy categorizes 630 usage patterns, highlighting the balance of augmentation over automation in various occupations.

Praxis, backed by Altman, is exploring Kyiv and Athens as potential bases for a self-soverien tech utopia, aiming to leverage local talent and resources in the tech sector.

The AI race in 2025 is intensifying, with competition growing among top models and notable advancements from both large and small developers.

Amazon launches Nova Sonic, a new AI voice model that offers competitive performance in speech recognition and natural-sounding dialogue.

Discussion on the misconceptions of open-source software and the entitlement some users feel towards developers and their contributions.

OpenAI reverses course on its o3 model release and delays GPT-5, aiming for a unified next-gen model with enhanced capabilities.

Trump's TikTok plan faces challenges due to Chinese objections over tariffs.

Midjourney launches V7, its first new AI image model in nearly a year, featuring personalization and improved image quality.

China's open-source AI boom may face rapid constraints as government control and tech restrictions loom.

AI-powered therapy chatbot Therabot shows significant improvements in mental health symptoms, according to a groundbreaking study from Dartmouth.

Alibaba's chairman warns of a potential AI bubble as massive spending on data centers continues without clear customer demand.

DeepMind is delaying the release of AI research to provide Google with a competitive advantage.

A brain-to-voice neuroprosthesis enables individuals with paralysis to communicate naturally by synthesizing speech from neural signals, achieving real-time decoding with high accuracy and personalized voice synthesis.

Hacker News: Do you still use search engines?

🤖 Model Releases

Cogito LLMs are instruction-tuned generative models optimized for coding, STEM, and multilingual capabilities, supporting both standard and reasoning modes.

MT-LLM is a model designed for instruction-based multi-talker overlapped speech recognition, capable of transcribing speech in various scenarios with versatile instructions.

Llama 4 Scout and Llama 4 Maverick are new series of Llama multimodal models. They improve previous Llama 3 models and extend the context length. However there is some controversy about the performance benchmarks.

RolmOCR is a faster, memory-efficient open-source OCR model that improves upon olmOCR, utilizing the Qwen2.5-VL-7B vision language model for enhanced document parsing.

Nomic Embed Multimodal is a suite of state-of-the-art models for embedding PDFs, images, papers, and charts, achieving superior performance in multimodal retrieval.

AnimeGamer is a model for infinite anime life simulation that generates dynamic animation shots and high-quality video clips using Multimodal Large Language Models.

MegaTTS3 is a new TTS model released by ByteDance. It's a diffusion model that generates high-quality speech and excels in voice cloning.

OpenHands LM is a powerful open coding agent model that achieves strong performance on software engineering tasks and can be run locally on standard hardware.

Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.

📎 Papers

NoProp: Training Neural Networks without Back-propagation or Forward-propagation

What's new

NoProp, a back-propagation-free method for training neural networks.
Inspired by diffusion and flow matching methods, training layers indivisually.

How it works

Each layer learns to denoise a noisy target without forward or backward propagation.
Utilizes a local denoising process at inference, taking noisy labels from previous layers.
Simmilar to unfolding a diffusion process, where each layer learns to denoise the output of the previous layer.

Results

Demonstrated effectiveness on MNIST, CIFAR-10, and CIFAR-100 benchmarks.
NoProp outperforms existing back-propagation-free methods in accuracy and computational efficiency.

InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation

What's new

Addresses limitations of finite context windows in LLMs

How it works

Parallels context and parameters in LLMs with human short and long-term memory systems
Transforms context knowledge into model parameters.
Creates synthethic dataset using the pre-trained LLM as teacher with the target context prompting the model.
Used perpexity to pick the samples that are most informative for the model.
Finetunes a new student model with the selected samples.
Internalizes the context knowledge into the model parameters.

Results

Reduces context length by 90%
Achieves 103% average performance of full-context prompting across various tasks
Surpasses full-context prompting using only 0.4% of original contexts in complex scenarios

My 2 cents

This is useful especially when you have a certain behaviour but you don't want to prompt it every time.

F5R-TTS: Improving Flow Matching based Text-to-Speech with Group Relative Policy Optimization

What's new

Integration of Gradient Reward Policy Optimization (GRPO) into flow-matching architecture.
Reformulation of deterministic outputs into probabilistic Gaussian distributions.

How it works

Pretraining phase with flow matching loss.
GRPO-driven enhancement stage using dual reward metrics: word error rate (WER) and speaker similarity (SIM).
Utilizes reinforcement learning for improved performance.

Results

F5R-TTS achieves 29.5% reduction in WER and 4.6% increase in SIM compared to conventional TTS systems.
Demonstrated effectiveness in zero-shot voice cloning experiments.
Experimental validation confirms improvements in speech intelligibility and speaker consistency.

My 2 cents

RL in speech synthesis is the right direction.

One-Minute Video Generation with Test-Time Training

What's new

Test-Time Training (TTT) layers for video generation.
TTT layers allow hidden states to be neural networks, increasing expressiveness.

How it works

TTT converts RNN hidden states into neural networks that updates parameters during inference.
It alleviates the representation bottleneck of RNNs.
TTT layers added to pre-trained Transformer models.
Generates one-minute videos from text storyboards.
Dataset curated from Tom and Jerry cartoons.

Results

TTT layers outperform baselines (Mamba~2, Gated DeltaNet) in coherence.
Achieved 34 Elo points lead in human evaluation of 100 videos.
Results still contain artifacts; limited by pre-trained 5B model.

My 2 cents

TTT layers are a promising approach to other generative tasks by merging meta-learning and generative models.

👨‍💻 Open-Source

bluorion-com/ZClip: Official implementation of the paper: "ZClip: Adaptive Spike Mitigation for LLM Pre-Training". Uses z-score statistics of the gradient norm to detect abnormal spikes

GitHub - OpenBB-finance/OpenBB: Investment Research for Everyone, Everywhere. - Discover OpenBB is a fully open-source financial platform offering comprehensive access to various investment sectors, designed for everyone, everywhere.

Machine Learns Substack

Discussion about this post