Logo

dev-resources.site

for different kinds of informations.

Try Multimodal Search with ColQwen2!

Published at
1/4/2025
Categories
multimodal
llm
rag
python
Author
m_sea_bass
Categories
4 categories in total
multimodal
open
llm
open
rag
open
python
open
Author
10 person written this
m_sea_bass
open
Try Multimodal Search with ColQwen2!

In this article, we introduce how to use ColQwen2.

ColQwen2 is based on Qwen2-VL-2B and generates ColBERT-style multi-vector representations, enabling highly accurate searches across text and image inputs.

We will test ColQwen2 using Google Colab with an A100 GPU.

Library Installation

First, install the necessary libraries:

!pip install git+https://github.com/illuin-tech/colpali
!pip install pymupdf
Enter fullscreen mode Exit fullscreen mode

Preparing Image Data

Next, prepare the image data. For this tutorial, we’ll use the ColPali paper.

Using pymupdf, we’ll extract images from the PDF file:

import pymupdf
import os

# Constants
DPI = 350  # Can be modified as needed

def convert_pdf_to_images(pdf_path, output_dir):
    """
    Convert PDF pages to images.
    Args:
        pdf_path (str): Path to the PDF file.
        output_dir (str): Directory to save images.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    pdf_document = pymupdf.open(pdf_path)

    for page_number in range(pdf_document.page_count):
        page = pdf_document[page_number]
        pix = page.get_pixmap(dpi=DPI)
        output_file = os.path.join(output_dir, f'page_{page_number + 1:02}.png')
        pix.save(output_file)

    pdf_document.close()

pdf_path = "/content/2407.01449v3.pdf"
output_dir = "output_images"
convert_pdf_to_images(pdf_path, output_dir)
Enter fullscreen mode Exit fullscreen mode

Images will be saved in the "output_images" folder.

Searching the Images

Now, let’s use ColQwen2. Refer to the Huggingface page for sample code.

After downloading and uploading the paper PDF to Google Colab, execute the following code:

import glob, os

import torch
from PIL import Image

from colpali_engine.models import ColQwen2, ColQwen2Processor

device = "cuda:0" if torch.cuda.is_available() else "cpu"

print(f"cuda available: {torch.cuda.is_available()}")

model = ColQwen2.from_pretrained(
        "vidore/colqwen2-v0.1",
        torch_dtype=torch.bfloat16,
        device_map=device,  # or "mps" if on Apple Silicon
    ).eval()
processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v0.1")

# Your inputs
images = [Image.open(filepath) for filepath in glob.glob(os.path.join(output_dir, "*.png"))]
queries = [
    "What is the architecture of ColPali?",
    "How does it differ from previous studies?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
batch_size = 1  # Reduced batch size to 1
image_embeddings = []
for i in range(0, len(images), batch_size):
    batch = images[i : i + batch_size]
    resized_batch = [img.resize((512, 512)) for img in batch]  # Resize before processing
    batch_images = processor.process_images(resized_batch).to(model.device)
    with torch.no_grad():
        embeddings = model(**batch_images)
    image_embeddings.extend(embeddings)
with torch.no_grad():
    query_embeddings = model(**batch_queries)

image_embeddings = torch.stack(image_embeddings)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

print(scores)
Enter fullscreen mode Exit fullscreen mode

The scores are returned as a list (matrix):

tensor([[13.2500,  8.4375, 11.3750, 11.1875, 13.8125, 12.0000,  8.3125,  9.0000,
         10.4375,  8.7500, 10.4375, 11.6250,  7.8438,  7.4375,  9.9375,  8.0625,
          7.5000, 10.9375,  9.7500,  7.8750],
        [ 8.3750,  7.5000,  9.6250,  8.3125,  7.5625,  8.1250,  7.9688,  8.4375,
          8.5000,  9.0625,  7.7812,  8.3125,  7.5000,  7.9062,  8.6875,  7.9688,
          7.9062,  7.9688,  8.7500,  7.5000]])
Enter fullscreen mode Exit fullscreen mode

Visualizing Scores

Let’s visualize the scores:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np


scores_df = pd.DataFrame(scores.cpu().numpy(), columns=[f'Image {i+1}' for i in range(scores.shape[1])]).T
scores_df.index.name = 'Images'

# Create two separate bar plots side-by-side
plt.figure(figsize=(12, 6))

# First bar plot
plt.subplot(1, 2, 1)
sns.barplot(x=scores_df.index, y=scores_df[0], color="skyblue")
plt.title("Query: " + queries[0])
plt.xticks(rotation=45, ha="right")
plt.ylabel('Score')

# Second bar plot
plt.subplot(1, 2, 2)
sns.barplot(x=scores_df.index, y=scores_df[1], color="lightcoral")
plt.title("Query: " + queries[1])
plt.xticks(rotation=45, ha="right")
plt.ylabel('Score')

plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description

Inspecting the Top Results

Let’s check the top 2 results:

for query, high_idx in zip(queries, highest_score_indices.tolist()):
    print(f"{query}: {high_idx}")
    # Display the image
    image_path = os.path.join(output_dir, f"page_{high_idx+1}.png")
    display(Image.open(image_path))
Enter fullscreen mode Exit fullscreen mode

Query 1: What is the architecture of ColPali?

Top 2 results:

1st: Page 5

Image description

2nd: Page 1

Image description

The Page 5 with the highest relevance includes the word “Architecture.” However, the architecture diagram on page 2 received a lower score.

Query 2: How does it differ from previous studies?

Top 2 results:

1st: Page 3

Image description

2nd: Page 10

Image description

Page 3 has contents of “Related Work,” but the start of the related work section on page 2 scored lower. Page 10, which includes references, scored higher, as expected.

Conclusion

We tested image search using ColQwen2. Searching entire PDF pages proved challenging; for practical use, extracting figures as standalone images might improve results.

To extract text, images, and tables more effectively from PDFs, consider tools like pymupdf2llm.

rag Article's
30 articles in total
Favicon
Create an agent and build a deployable notebook from it in watsonx.ai — Part 2
Favicon
How RAG works? Retrieval Augmented Generation Explained
Favicon
Evaluation as a Business Imperative: The Survival Guide for Large Model Application Development
Favicon
Binary embedding: shrink vector storage by 95%
Favicon
Optimize VLM Tokens with EmbedAnything x ColPali
Favicon
Analyzing LinkedIn Company Posts with Graphs and Agents
Favicon
NVIDIA CES 2025 Keynote: AI Revolution and the $3000 Personal Supercomputer
Favicon
Swiftide 0.16 brings AI agents to Rust
Favicon
A RAG for Elixir in Elixir
Favicon
Inference with Fine-Tuned Models: Delivering the Message
Favicon
Building an AI Workflow to Generate Reddit Comments with KaibanJS
Favicon
Submitting a Fine-Tuning Job: Organising the Workforce
Favicon
Rust and Generative AI: Creating High-Performance Applications
Favicon
RAG - Designing the CLI interface
Favicon
RAG in AI: The Technology Driving the Next Generation of Chatbots
Favicon
Try Multimodal Search with ColQwen2!
Favicon
How to run Ollama on Windows using WSL
Favicon
Generative AI Cost Optimization Strategies
Favicon
Embeddings, Vector Databases, and Semantic Search: A Comprehensive Guide
Favicon
Building a React.dev RAG chatbot using Vercel AI SDK
Favicon
Hal9: Create and Share Generative Apps
Favicon
AI + Data Weekly 169 for 23 December 2024
Favicon
Meta Knowledge for Retrieval Augmented Large Language Models
Favicon
Why LLMs Fall Short: Why Large Language Models Aren't Ideal for AI Agent Applications
Favicon
How-to Use AI to See Your Data in 3D
Favicon
Unlocking AI-Powered Conversations: Building a Retrieval-Augmented Generation (RAG) Chatbot
Favicon
Building a Friends-Themed Chatbot: Exploring Amazon Bedrock for Dialogue Refinement
Favicon
AI Agents Tools: LangGraph vs Autogen vs Crew AI vs OpenAI Swarm- Key Differences
Favicon
My Experience at Build Bengaluru 2024
Favicon
🚀 Exploring the Power of Visualization: From Dependency Graphs to Molecular Structures 🧬

Featured ones: