Molmo and PixMo: Open Weights Open Data Multimodal Models

arXiv, 2024

Paper citation: Deitke, Matt, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi et al. “Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models.” arXiv preprint arXiv:2409.17146 (2024).

Two Sentence Summary

The paper introduces Molmo.
It is a family of open-weight vision-language models that excel in performance without using synthetic data from proprietary systems.

Dataset

Models are trained with a new dataset called PixMo.
PixMo contains detailed image captions created through a unique speech-based description method.
This method allows annotators to provide comprehensive image descriptions.

Performance

Molmo models achieve state-of-the-art results.
They perform exceptionally well on academic benchmarks and user preference evaluations.
They outperform proprietary models like GPT-4o and Claude 3.5.

Approach

Implementation Components

Data Collection:

Annotators described images verbally in 60–90 seconds.
Dense image captions were produced, then transcribed for processing.

Model Training:

It involved two major stages:
Multimodal pre-training for generating captions.
Supervised fine-tuning using a diverse dataset mixture.

Python Code for Molmo Algorithm

Step 1: Preparing the inputs

import pandas as pd
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("model-name")  # replace with actual model name
model = AutoModel.from_pretrained("model-name")  # replace with actual model name
image_data = pd.read_csv('path_to_image_data.csv')  # Load image data from CSV
image_descriptions = []

Step 2: Generate captions for images

for index, row in image_data.iterrows():
    image = row['image_path']  # Extract image path from dataset
    caption = model.generate_captions(image)  # Replace with actual call to generate captions
    image_descriptions.append(caption)

Step 3: Fine-tuning the model on the annotated data

import torch
fine_tune_data = pd.read_csv('path_to_fine_tuning_data.csv')  # Load fine-tuning data
inputs = tokenizer(fine_tune_data['text'], return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    outputs = model(**inputs)
model.save_pretrained('path_to_save_model_weights/')  # Save updated model weights

In the code snippets above:

The first block prepares the data and inputs required for the model.
The second part generates captions for the images using the model.
Finally, the model is fine-tuned on a specific dataset and saved.

Summary of the Evaluation

The Molmo models were evaluated across 11 distinct academic benchmarks.
They showed clear advantages in benchmark performance and human evaluations.
The Molmo-72B model scored highest in benchmark evaluations.
It achieved second place in human preference evaluations, showing agreement between automated and human metrics.

Important Findings from the Evaluation Results

Despite using only high-quality datasets, Molmo models performed closely to proprietary systems like GPT-4o.
This highlights the potential for open models to compete with proprietary solutions in AI.

Analysis: Pros

Open-Weight Models: Encourage transparency and accessibility for researchers.
Innovative Data Collection Approach: Speech-based image descriptions enhance dataset richness.
Strong Performance: Out-competes several proprietary competitors with effective open training methodologies.

Analysis: Cons

Reliance on High-Quality Input Data: Performance depends on the input data quality, which is crucial.
Potential for Overfitting: High specificity may lead to poor generalization on unseen images.
Computational Resource Usage: Training requires significant resources, limiting access for smaller entities.

The article is also available on Medium.

Paper Summaries

Summary of “Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models”