Machine Learns #45
OpenAI's social network & GPT-4.1, China launches $8.2B AI fund, NVIDIA's US manufacturing push, new GLM-4 & MineWorld models, C3PO expert pathways optimization, GigaTok's 3B visual tokenizer...
📌 Bookmarks
Adobe invests in Synthesia, the AI video startup that has surpassed $100 million in annual recurring revenue, highlighting its growing focus on enterprise AI tools.
Anthropic's AI chatbot, Claude, now integrates with Google Workspace, enhancing email and calendar management for users.
Google rolls out its AI video generator to Gemini Advanced subscribers
OpenAI is developing a social network prototype that may integrate with ChatGPT, intensifying competition with Elon Musk and Meta.
China launches an $8.2 billion AI fund to boost its domestic ecosystem and reduce dependence on U.S. chip manufacturers like Nvidia and Broadcom.
NVIDIA announces plans to manufacture AI supercomputers entirely in the U.S., partnering with major companies to boost domestic production and job creation.
OpenAI launches GPT-4.1, an advanced AI model with improved context processing and reduced costs, alongside smaller versions for developers.
Hugging Face acquires Pollen Robotics to enhance open-source AI with the humanoid robot Reachy 2.
Meta’s AI research lab is ‘dying a slow death,’ some insiders say
OpenAI co-founder Ilya Sutskever's AI startup Safe Superintelligence raises $2 billion, reaching a valuation of $32 billion.
A speculative timeline exploring the evolution of AI, from generative breakthroughs in 2026 to the emergence of artificial general intelligence by 2030.
Apple faces innovation challenges and market losses due to Trump tariffs and internal struggles with AI development.
🤖 Model Releases
Liquid is a multimodal large language model from ByteDance that integrates visual comprehension and generation by tokenizing images and learning code embeddings alongside text tokens in a shared feature space.
GLM-4-32B-0414 is a state-of-the-art multilingual multimodal chat model with 32 billion parameters, offering advanced capabilities in dialogue, reasoning, and creative tasks.\
MineWorld is a real-time interactive world model designed for Minecraft, enabling users to generate scenes based on selected actions within the game.
KITPose is a novel keypoints-interactive model designed for general mammal pose estimation, utilizing structure-supporting dependencies among keypoints and body parts.
Rimecaster is an open source speaker representation model that enhances voice AI by using a diverse dataset of everyday conversations to produce more natural and human-like speech.
Beta release of multilingual text-to-speech models by Canopy Labs, featuring updates on various language pretraining and fine-tuning releases.
📎 Papers
C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing
What's new
C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization.
Focus on optimizing expert pathways in Mixture-of-Experts (MoE) Large Language Models (LLMs) during test time.
How it works
Creates a reference set of samples to use at test time for optimizing expert pathways.
C3PO dynamically re-weights experts in critical layers for each test sample by picking the nearest neighbors from the reference set.
Utilizes surrogate objectives based on successful neighbors from a reference set.
Implements three optimization methods: Mode Finding, Kernel Regression, and Neighborhood Gradient Descent (NGD).
General idea is to find the best expert pathways for each test sample by leveraging kNN in the reference set and their optimal expert pathways.
Results
C3PO improves accuracy by 7-15% over base models across six benchmarks.
Outperforms larger models (7-9B parameters) with MoE LLMs having only 1-3B active parameters.
NGD achieves 85-95% of the theoretical maximum performance without needing ground truth labels.
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
What's new
GigaTok, a visual tokenizer scaled to 3 billion parameters.
Addresses the performance trade-off of traditional tokenizers in image reconstruction and down-stream generation tasks.
First approach to improve image reconstruction, generation, and representation learning simultaneously.
How it works
They find increasing tokenizer size improves image recosntruction but decreases generation performance.
Generation performance decreases due to the increased complexity of the tokenizer's latent space.
Utilizes semantic regularization to align tokenizer features with pre-trained visual encoder features.
Regularization decreases the latent space complexity and makes the tokens for interpretable to the generative AR model.
Employs strategies for scaling tokenizers: 1D tokenizers, asymmetric encoder-decoder scaling, and entropy loss for training stability.
With all the above, they are able to scale the tokenizer to 3 billion parameters.
Results
GigaTok achieves state-of-the-art performance in reconstruction and downstream autoregressive generation.
Demonstrates improved learnability of tokens and better representation quality in downstream models.
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
What's new
Introduction of Seaweed-7B, a cost-effective video generation foundation model with 7 billion parameters.
Achieves competitive performance with moderate computational resources (665,000 H100 GPU hours).
How it works
Utilizes a variational autoencoder (VAE) and a latent diffusion transformer (DiT).
Employs a hybrid-stream architecture for efficient processing of video and text tokens.
Implements multi-stage training from low to high resolution, optimizing GPU resource allocation.
Results
Seaweed-7B ranks second in Elo comparison for image-to-video generation, outperforming several larger models.
Demonstrates superior inference efficiency, requiring only 12 neural function evaluations (NFEs) compared to 100 for Wan-2.1.
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
What's new
ALMTokenizer: low-bitrate, semantically rich audio codec tokenizer.
Novel query-based compression strategy for audio data.
Enhanced semantic information through multiple training losses (MAE loss, AR loss).
How it works
Converts audio signals into discrete tokens using a transformer-based framework.
Utilizes learnable query tokens to capture holistic context information across audio frames.
Employs a two-stage training strategy to balance reconstruction performance and semantic richness.
Results
ALMTokenizer outperforms previous state-of-the-art models in audio understanding and generation tasks.
Achieves competitive reconstruction performance at lower bitrates.
Demonstrates superior performance in tasks like text-to-speech, speech-to-text, and audio captioning.
👨💻 Open-Source
GitHub - BasedHardware/omi: open-source AI wearable that effortlessly captures and transcribes conversations while providing summaries and action items.
GitHub - GuijiAI/HeyGem.ai an open-source alternative to Heygen, featuring its latest version, deployment options, and opportunities for community co-creation.
GitHub - crestalnetwork/intentkit: An open and fair framework for everyone to build AI agents equipped with powerful skills. Launch your agent, improve the world, your wallet, or both!