Machine Learns #47
No fluff bi-weekly AI newsletter: OpenAI/Microsoft renegotiations, AI leaderboards are broken, new models from Mistral & Microsoft, boosting recurrent LLMs, latest papers and models.
👋 Join us on Discord to see the latest updates on AI.
📌 Bookmarks
AI technology was used to create a video of a deceased victim delivering a powerful impact statement in court, marking a novel application of artificial intelligence in the legal system.
Study claims LM Arena favored top AI labs like Meta and OpenAI, allowing them to game its benchmark for better leaderboard scores.
Chinese scientists at Peking University have developed the world's fastest transistor using bismuth, bypassing silicon and achieving superior speed and energy efficiency.
OpenAI and Microsoft are renegotiating their partnership, potentially allowing Microsoft to take a smaller equity stake in exchange for extended access to OpenAI's technology, which may facilitate a future IPO for OpenAI.
Apple plans to introduce AI search options in Safari, challenging Google's dominance in online search and advertising.
Trump plans to rescind Biden-era AI chip curbs, allowing countries like the UAE and Saudi Arabia to negotiate better terms while maintaining restrictions on China.
Google introduces 'implicit caching' in its Gemini API, promising up to 75% cost savings for developers using its latest AI models.
Google integrates ads into chatbot conversations to enhance digital advertising amid the rise of generative AI.
Mistral AI announces Mistral Medium 3, a new model offering state-of-the-art performance at 8X lower cost, designed for enterprise use and coding tasks.
Mastercard is enabling AI agents to shop online and make payments on behalf of consumers, streamlining the e-commerce experience.
Amazon introduces Vulcan, its first robot with a sense of touch, enhancing efficiency and safety in fulfillment centers.
Anthropic introduces web search on its API, enabling Claude to access current information and enhance AI applications with real-time data.
Japan's NTT group claims to have developed the world's first drone capable of inducing and guiding lightning strikes to protect infrastructure.
Eric Schmidt's acquisition of Relativity Space aims to develop data centers in orbit to meet the growing energy demands of AI applications.
🤖 Model Releases
Latent Bridge Matching (LBM) relighting is a fast image-to-image translation model that relights foreground objects based on a provided background.
The Byte Latent Transformer (BLT) is a new byte-level LLM architecture that improves inference efficiency and robustness by encoding bytes into dynamically sized patches, achieving performance comparable to tokenization-based models at scale.
A collection of Foundation Vision Models that combine multiple models (CLIP, DINOv2, SAM, etc.) for image feature extraction.
Kevin-32B is a multi-turn reinforcement learning model designed for generating optimized CUDA kernels, demonstrating superior performance in self-refinement and task completion compared to traditional single-turn models.
SpeciesNet is an open source AI model that identifies animal species by analyzing photos from camera traps, aiding biodiversity monitoring and conservation efforts.
Phi-4-reasoning-plus is a state-of-the-art open-weight reasoning model from Microsoft Research, fine-tuned for advanced reasoning tasks using supervised fine-tuning and reinforcement learning.
📎 Papers
Overflow Prevention Enhances Long-Context Recurrent LLMs
What's new
OPRM (Overflow Prevention for Recurrent Models).
Addresses memory overflow issues in large recurrent LLMs.
Enhances long-context processing efficiency.
How it works
Utilizes a chunk-based inference strategy.
Segments input context into manageable chunks for parallel processing.
Selects the most relevant chunk for decoding based on entropy or probability criteria.
Results
Performance improvements on LongBench tasks:
Falcon3-Mamba-Inst-7B: +14%
Falcon-Mamba-Inst-7B: +28%
RecurrentGemma-IT-9B: +50%
RWKV6-Finch-7B: +51%
Achieves state-of-the-art results in LongBench v2 benchmark.
Continuous Thought Machines
What's new
Continuous Thought Machine (CTM).
Incorporation of neuron-level processing and between neuron synchronization.
Focus on introducing neural timing as a foundational element.
How it works
Neuron-level temporal processing: Each neuron uses unique weight parameters to process a history of incoming signals.
Neural synchronization: Employed as a latent representation based on the alignments between neurons to represent the momentary state of the network.
Balances computational efficiency with biological realism.
Captures essential temporal dynamics while remaining computationally tractable for deep learning.
Its recursive nature allows for implicit reasoning and thought generation.
Results
Strong performance across various tasks: ImageNet-1K classification, solving 2D mazes, sorting, parity computation, question-answering, and reinforcement learning tasks.
Displays rich internal representations and offers interpretability due to its internal processes.
Capable of complex sequential reasoning.
The Leaderboard Illusion
What's new
Systematic issues in Chatbot Arena leaderboard.
Distortion in ranking due to undisclosed private testing practices.
How it works
Providers can test multiple variants before public release.
Selective disclosure of performance results leads to biased scores.
Proprietary closed models sampled at higher rates than open-weight models.
Results
Meta tested 27 private LLM variants before Llama-4 release.
Google and OpenAI received 19.2% and 20.4% of all arena data, respectively.
83 open-weight models received only 29.7% of total data.
Why it matters
Access to Chatbot Arena data provides substantial performance benefits.
Limited additional data can yield performance gains of up to 112%.
Highlights need for reform in evaluation framework for fairer benchmarking.
Inductive Moment Matching
What's new
Inductive Moment Matching (IMM), a new class of generative models for one- or few-step sampling.
How it works
IMM offers a single-stage training procedure.
Does not require pre-training initialization or optimization of two networks.
Guarantees distribution-level convergence.
Remains stable under various hyperparameters and standard model architectures.
Results
Surpasses diffusion models on ImageNet-256x256 with 1.99 FID using only 8 inference steps.
Achieves state-of-the-art 2-step FID of 1.98 on CIFAR-10 for a model trained from scratch.
👨💻 Open-Source
GitHub - astral-sh/ty: An extremely fast Python type checker and language server, written in Rust.
GitHub - bytedance/deer-flow: DeerFlow is a community-driven framework for deep research, combining language models with tools like web search, crawling, and Python execution, while contributing back to the open-source community.
roboflow/sports: computer vision and sports
BUTSpeechFIT/DiariZen: A toolkit for speaker diarization.
GitHub - SandAI-org/MagiAttention: A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training