Machine Learns - Newsletter #34

🤖 all things AI / ML

Nov 13, 2024

batman in pijamas sad that he didnt buy bitcoin as bitcoin raise to new highs american comic style

👋 Everyone

Join in our Discord channel to get real-time AI news and updates and join in the conversation.

📌 Bookmarks

Bitcoin reached an all-time high above $82,000, fueled by post-election optimism and market expectations for further gains.

Near Protocol plans to build the world's largest open-source AI model with 1.4 trillion parameters, leveraging crowdsourced research and decentralized technology.

OpenAI is reportedly developing new strategies to address the slowdown in AI improvement, as its upcoming model, Orion, shows less advancement compared to previous versions.

Palantir partners with Anthropic and AWS to integrate Claude AI models for U.S. intelligence and defense operations.

Uber optimizes LLM training by leveraging open-source and in-house models to enhance generative AI applications across its services.

📰 Blog Posts

Insights on the complexities of shipping projects in big tech, emphasizing leadership alignment and proactive problem-solving.

Explores why developers prefer clean code over documentation and how AI can alleviate the burden of maintaining effective documentation.

Key habits of top engineers that enhance code quality and collaboration.

🤖Model Releases

RMBG v2.0 is a background removal model by BRIA AI, designed for foreground-background separation across various image types, suitable for commercial use.

Speechbrain released an end-to-end automatic speech recognition system using a Conformer model pre-trained on GigaSpeech, optimized for streaming and full context transcription.

Tencent-Hunyuan-Large is is the largest open-source Transformer-based MoE model with 389 billion parameters and 52 billion active parameters.

OS-ATLAS is a foundation action model designed for generalist GUI agents.

Hertz-dev is an audio-only transformer model with 8.5 billion parameters, designed for real-time voice interaction and open-sourced for research and fine-tuning.

SmolLM2 is a collection of compact LLMs for on-device applications with sizes of 1.7B, 360M, and 135M parameters.

📎Papers

[2411.05663] | [Code] Online-LoRA: Task-free Online Continual Learning via Low-Rank Adaptation

What's new

Introduction of Online-LoRA, a novel framework for task-free online continual learning (OCL).
Addresses catastrophic forgetting in non-stationary data streams without task boundaries.

How it works

Finetunes pre-trained Vision Transformer (ViT) models in real-time.
Features a novel online weight regularization strategy to identify and consolidate important model parameters.
Leverages training dynamics of loss values for automatic recognition of data distribution shifts.

Results

Extensive experiments on benchmark datasets (CIFAR-100, ImageNet-R, ImageNet-S, CUB-200, CORe50).
Online-LoRA shows better performance compared to state-of-the-art (SOTA) methods.

BitNet a4.8: 4-bit Activations for 1-bit LLMs

What's new

• Introduction of BitNet a4.8 for 4-bit activations in 1-bit LLMs.
• Hybrid quantization and sparsification strategy to reduce quantization errors.

How it works

• Utilizes 4-bit activations for attention and feed-forward layers.
• Sparsifies intermediate states with 8-bit quantization.
• Trained from 8-bit to 4-bit activations using a two-stage recipe.

Results

• Comparable performance to BitNet b1.58 with the same training cost.
• Achieves significant efficiency in inference.
• Maintains performance parity with negligible accuracy degradation when scaling.

Cosmos Tokenizer: A suite of image and video neural tokenizers

What's new

Cosmos Tokenizer is an encoder-decoder model for learning discrete video/image representations.

How it works

Uses a 3D causal convolution block architecture to process spatiotemporal information and leverage causal temporal attention to capture long-range dependencies within the data.
For efficiency, the input data is downsampled using 3D wavelets, allowing the tokenizer's encoder-decoder modules to focus on significant features instead of redundant pixel details.
Uses FSQ (Finite Scalar Quantization) for discretization with a vocabulary size of 64k tokens.

Results

SOTA results, achieving up to 12 times faster reconstruction speeds compared to other leading open-weight tokenizers.

[2411.04996] Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

What's new

Introduction of Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture.
Designed for processing text, images, and speech within a unified framework.

How it works

Decouples non-embedding parameters by modality (feed-forward networks, attention matrices, layer normalization).
Enables modality-specific processing with global self-attention over the full input sequence.
Reduces pretraining computational costs significantly.

Results

In Chameleon 7B setting, MoT matches dense baseline performance using only 55.8% of the FLOPs.
MoT achieves comparable speech performance to the dense baseline with only 37.2% of the FLOPs.
In Transfusion setting, a 7B MoT model matches the image modality performance of the dense baseline with one-third of the FLOPs.
A 760M MoT model outperforms a 1.4B dense baseline in key image generation metrics.

👨‍💻 Open-Source

GitHub - bytedance/Protenix: A trainable PyTorch reproduction of AlphaFold 3.

GitHub - xdit-project/mochi-xdit: faster parallel inference of mochi video generation model

GitHub - varungodbole/prompt-tuning-playbook: A playbook for effectively prompting post-trained LLMs

Machine Learns Substack

Discussion about this post