dev-resources.site
for different kinds of informations.
Try Multimodal Search with ColQwen2!
In this article, we introduce how to use ColQwen2.
ColQwen2 is based on Qwen2-VL-2B and generates ColBERT-style multi-vector representations, enabling highly accurate searches across text and image inputs.
We will test ColQwen2 using Google Colab with an A100 GPU.
Library Installation
First, install the necessary libraries:
!pip install git+https://github.com/illuin-tech/colpali
!pip install pymupdf
Preparing Image Data
Next, prepare the image data. For this tutorial, we’ll use the ColPali paper.
Using pymupdf
, we’ll extract images from the PDF file:
import pymupdf
import os
# Constants
DPI = 350 # Can be modified as needed
def convert_pdf_to_images(pdf_path, output_dir):
"""
Convert PDF pages to images.
Args:
pdf_path (str): Path to the PDF file.
output_dir (str): Directory to save images.
"""
if not os.path.exists(output_dir):
os.makedirs(output_dir)
pdf_document = pymupdf.open(pdf_path)
for page_number in range(pdf_document.page_count):
page = pdf_document[page_number]
pix = page.get_pixmap(dpi=DPI)
output_file = os.path.join(output_dir, f'page_{page_number + 1:02}.png')
pix.save(output_file)
pdf_document.close()
pdf_path = "/content/2407.01449v3.pdf"
output_dir = "output_images"
convert_pdf_to_images(pdf_path, output_dir)
Images will be saved in the "output_images"
folder.
Searching the Images
Now, let’s use ColQwen2. Refer to the Huggingface page for sample code.
After downloading and uploading the paper PDF to Google Colab, execute the following code:
import glob, os
import torch
from PIL import Image
from colpali_engine.models import ColQwen2, ColQwen2Processor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(f"cuda available: {torch.cuda.is_available()}")
model = ColQwen2.from_pretrained(
"vidore/colqwen2-v0.1",
torch_dtype=torch.bfloat16,
device_map=device, # or "mps" if on Apple Silicon
).eval()
processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v0.1")
# Your inputs
images = [Image.open(filepath) for filepath in glob.glob(os.path.join(output_dir, "*.png"))]
queries = [
"What is the architecture of ColPali?",
"How does it differ from previous studies?",
]
# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
# Forward pass
batch_size = 1 # Reduced batch size to 1
image_embeddings = []
for i in range(0, len(images), batch_size):
batch = images[i : i + batch_size]
resized_batch = [img.resize((512, 512)) for img in batch] # Resize before processing
batch_images = processor.process_images(resized_batch).to(model.device)
with torch.no_grad():
embeddings = model(**batch_images)
image_embeddings.extend(embeddings)
with torch.no_grad():
query_embeddings = model(**batch_queries)
image_embeddings = torch.stack(image_embeddings)
scores = processor.score_multi_vector(query_embeddings, image_embeddings)
print(scores)
The scores are returned as a list (matrix):
tensor([[13.2500, 8.4375, 11.3750, 11.1875, 13.8125, 12.0000, 8.3125, 9.0000,
10.4375, 8.7500, 10.4375, 11.6250, 7.8438, 7.4375, 9.9375, 8.0625,
7.5000, 10.9375, 9.7500, 7.8750],
[ 8.3750, 7.5000, 9.6250, 8.3125, 7.5625, 8.1250, 7.9688, 8.4375,
8.5000, 9.0625, 7.7812, 8.3125, 7.5000, 7.9062, 8.6875, 7.9688,
7.9062, 7.9688, 8.7500, 7.5000]])
Visualizing Scores
Let’s visualize the scores:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
scores_df = pd.DataFrame(scores.cpu().numpy(), columns=[f'Image {i+1}' for i in range(scores.shape[1])]).T
scores_df.index.name = 'Images'
# Create two separate bar plots side-by-side
plt.figure(figsize=(12, 6))
# First bar plot
plt.subplot(1, 2, 1)
sns.barplot(x=scores_df.index, y=scores_df[0], color="skyblue")
plt.title("Query: " + queries[0])
plt.xticks(rotation=45, ha="right")
plt.ylabel('Score')
# Second bar plot
plt.subplot(1, 2, 2)
sns.barplot(x=scores_df.index, y=scores_df[1], color="lightcoral")
plt.title("Query: " + queries[1])
plt.xticks(rotation=45, ha="right")
plt.ylabel('Score')
plt.tight_layout()
plt.show()
Inspecting the Top Results
Let’s check the top 2 results:
for query, high_idx in zip(queries, highest_score_indices.tolist()):
print(f"{query}: {high_idx}")
# Display the image
image_path = os.path.join(output_dir, f"page_{high_idx+1}.png")
display(Image.open(image_path))
Query 1: What is the architecture of ColPali?
Top 2 results:
1st: Page 5
2nd: Page 1
The Page 5 with the highest relevance includes the word “Architecture.” However, the architecture diagram on page 2 received a lower score.
Query 2: How does it differ from previous studies?
Top 2 results:
1st: Page 3
2nd: Page 10
Page 3 has contents of “Related Work,” but the start of the related work section on page 2 scored lower. Page 10, which includes references, scored higher, as expected.
Conclusion
We tested image search using ColQwen2. Searching entire PDF pages proved challenging; for practical use, extracting figures as standalone images might improve results.
To extract text, images, and tables more effectively from PDFs, consider tools like pymupdf2llm.
Featured ones: