Machine Learns #43
Gemini's real-time features launch, major tech acquisitions including Google's $32B Wiz deal, new DeepSeek model is out and 8 more new open models, plus the latest in robotics with Boston Dynamics
📌 Bookmarks
Google is rolling out new AI features for Gemini Live that allow real-time screen sharing and camera interpretation, enhancing its capabilities to answer user questions. These features are part of the Google One AI Premium plan and showcase Google's lead in AI assistants amid competition from Amazon and Apple.
Exploring why Anthropic's AI model Claude struggles to play Pokémon, highlighting its reasoning capabilities and challenges in achieving human-level performance.
New CRISPR tool enhances gene editing and disease modeling capabilities.
Nvidia introduced Dynamo, a software framework for optimizing AI inference across GPUs, enhancing performance and throughput by intelligently managing prefill and decode processes. It integrates with existing libraries and supports various Nvidia hardware, aiming to improve efficiency in AI applications.
M&A activity is surging in the cloud software sector, with significant acquisitions like Wiz by Google for $32B and Ampere Computing by SoftBank for $6.5B. The trend suggests a potential shift back to inorganic growth strategies as companies seek expansion amid slowing organic growth. SaaS valuations are primarily based on revenue multiples, with median growth rates and margins indicating a competitive landscape.
Ant Group is utilizing both Chinese and U.S. semiconductors to enhance AI model efficiency, reducing training costs by 20% and decreasing reliance on Nvidia. The company has also upgraded its AI solutions for healthcare, now implemented in several major hospitals in China.
Google rolls out Gemini's real-time AI video features, enabling screen sharing and live video interpretation for enhanced user interaction.
FuriosaAI, a South Korean AI chip startup, rejects an $800M acquisition offer from Meta to focus on its chip development.
Meta reveals revenue-sharing agreements with Llama AI model hosts amid copyright lawsuit allegations.
Boston Dynamics' Atlas robot showcases impressive breakdance moves, highlighting advancements in humanoid mobility and AI-driven movement.
Google announces a $32 billion agreement to acquire Wiz, enhancing cloud security and multicloud capabilities.
Sakana claims its AI-generated paper passed peer review, but the reality is more complex, raising questions about AI's role in scientific research.
🤖 Model Releases
distil-large-v3.5 is a collection providing support for popular Whisper libraries, focusing on Automatic Speech Recognition.
StarVector is a multimodal LLM for generating structured SVG code directly from images and text.
The Nemotron-H family of hybrid Mamba-Transformer models offers improved accuracy and inference speed, designed for efficient reasoning and advanced AI applications.
DeepSeek-V3-0324 is a new 685 billion parameter model designed for text generation
MambaVision: A Hybrid Mamba-Transformer Vision Backbone with both 1K and 21K pretrained models for image feature extraction.
NVIDIA's magpie-tts-multilingual model offers natural and expressive text-to-speech voices in multiple languages for voice agents and brand ambassadors.
Stable Virtual Camera is a 1.3B generalist diffusion model for Novel View Synthesis, generating 3D consistent novel views of a scene from multiple input views and target cameras.
NVIDIA NeMo Canary Flash is a multilingual, multi-tasking speech model with 883 million parameters, offering state-of-the-art performance in automatic speech recognition and translation across four languages.
Orpheus TTS: A model for generating human-sounding speech from text with customizable voices.
📎 Papers
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
What's new
Block diffusion language models
Interpolation between discrete denoising diffusion and autoregressive models
Flexible-length generation and improved inference efficiency
How it works
Utilizes KV caching and parallel token sampling
Includes efficient training algorithm and data-driven noise schedules
Minimizes gradient variance
Results
Sets new state-of-the-art performance among diffusion models on language modeling benchmarks
Enables generation of arbitrary-length sequences
Why it matters
Potential for parallelized generation and controllability
Addresses limitations of both autoregressive and diffusion approaches
Measuring AI Ability to Complete Long Tasks
What's new
Proposal of a new metric: 50%-task-completion time horizon.
Frontier AI models have a 50% time horizon of around 50 minutes.
AI time horizon has been doubling approximately every seven months since 2019.
How it works
Methodology involves timing humans with domain expertise on various tasks.
AI agents' performance is evaluated against human completion times.
Uses a diverse task suite of 170 tasks, including HCAST and RE-Bench.
Results
AI capabilities have improved significantly, with a trend of increasing time horizons.
Current models can autonomously execute complex tasks that previously required human expertise.
Predictions indicate AI could automate many software tasks currently taking humans a month within five years.
Why it matters
Understanding AI capabilities is crucial for developing safety measures and governance.
Provides a continuous metric for tracking AI progress over time.
Highlights the potential risks associated with increasing AI autonomy in complex tasks.
KBLaM: Knowledge Base augmented Language Model
What's new
Knowledge Base augmented Language Model (KBLaM) for augmenting Large Language Models (LLMs) with external knowledge.
Eliminates external retrieval modules and reduces computational overhead.
How it works
Transforms knowledge into continuous key-value vector pairs using pre-trained sentence encoders.
Integrates knowledge into pre-trained LLMs via a specialized rectangular attention mechanism.
Allows dynamic updates without model fine-tuning or retraining.
Results
Effective in various tasks including question-answering and open-ended reasoning.
Provides interpretable insights into the use of augmented knowledge.
Why it matters
Integrates a large knowledge base of over 10K triples into an 8B pre-trained LLM on a single A100 80GB GPU.
Scales linearly with knowledge base size.
👨💻 Open-Source
GitHub - OpenBB-finance/OpenBB: - OpenBB, the fully open-source financial platform offering comprehensive access to various investment sectors, designed for everyone, everywhere.
ai-dynamo/dynamo: A Datacenter Scale Distributed Inference Serving Framework - NVIDIA Dynamo is an open-source, high-throughput, low-latency inference framework designed for serving generative AI models in distributed environments.
GitHub - hhy-huang/HiRAG - A comprehensive exploration of Retrieval-Augmented Generation with Hierarchical Knowledge (HiRAG), detailing its model pipeline, installation, evaluation, and results in enhancing information retrieval.