Logo

dev-resources.site

for different kinds of informations.

Stress Testing VLMs: Multi QnA and Description Tasks

Published at
10/14/2024
Categories
tutorial
streamlit
vlm
benchmarking
Author
aryankargwal
Author
12 person written this
aryankargwal
open
Stress Testing VLMs: Multi QnA and Description Tasks

Video Link: https://youtu.be/pwW9zwVQ4L8
Repository Link: https://github.com/aryankargwal/genai-tutorials/tree/main


In the fast-evolving world of AI, Vision-Language Models (VLMs) have garnered attention for their ability to understand and generate responses based on visual and textual inputs. However, testing these models in a structured environment and comparing their performance across various scenarios is still a challenging task. This blog will walk you through an experiment where we used a custom-built Streamlit web application to stress test multiple VLMs like Llama 3.2, Qwen 2 VL, and GPT 4o on a range of tasks. We analyzed their response tokens, latency, and accuracy in generating answers to complex, multimodal questions.

However, please note that most of the findings are still hidden, as this application is part of my process of making a VLM benchmark, the first of which you can check out on Huggingface as SynCap-Flickr8K!

Why Compare Vision-Language Models?

The ability to compare the performance of different VLMs across domains is critical for:

  1. Understanding model efficiency (tokens used, latency).
  2. Measuring how well models can generate coherent responses based on image inputs and textual prompts.
  3. Creating benchmark datasets to improve further and fine-tune VLMs.

To achieve this, we built a VLM Stress Testing Web App in Python, utilizing Streamlit for a user-friendly interface. This allowed us to upload images, input textual prompts, and obtain model-generated responses in real time. The app also calculated and logged critical metrics such as the number of tokens used in responses and latency.

Project Setup

Our main application file, app.py, uses Streamlit as the frontend and is integrated with API requests to call different VLM models. Each query to a model includes:

  • Image: Encoded in Base64 format.
  • Question: A text input by the user.
  • Model ID: We allow users to choose between multiple VLMs.

The API response includes:

  • Answer: The model-generated text.
  • Latency: Time taken for the model to generate the answer.
  • Token Count: Number of tokens used by the model in generating the response.

Below is the code structure for querying the models:

def query_model(base64_image, question, model_id, max_tokens=300, temperature=0.9, stream=False, frequency_penalty=0.2):
    image_content = {
        "type": "image_url",
        "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
        }
    }

    prompt = question

    data = {
        "model": model_id,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    image_content
                ]
            }
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stream": stream,
        "frequency_penalty": frequency_penalty
    }

    response = requests.post(url, headers=headers, json=data)
    return response.json()
Enter fullscreen mode Exit fullscreen mode

Task Definitions and Experiments

We tested four different tasks across multiple domains using the following models:

  1. Llama 3.2
  2. Qwen 2 VL
  3. GPT 4o

Domains:

  • Medical: Questions related to complex medical scenarios.
  • Retail: Product-related queries.
  • CCTV: Surveillance footage analysis.
  • Art: Generating artistic interpretations and descriptions.

The experiment involved five queries per task for each model, and we recorded the following metrics:

  • Tokens: The number of tokens used by the model to generate a response.
  • Latency: Time taken to return the response.

Results

Token Usage Comparison

The tables below highlight the token usage across the four domains for both Llama and GPT models.

Task Q1 Tokens Q2 Tokens Q3 Tokens Q4 Tokens Q5 Tokens Mean Tokens Standard Deviation (Tokens)
Medical (Llama) 1 12 1 1 1 3.2 4.81
Retail (Llama) 18 39 83 40 124 60.8 32.77
CCTV (Llama) 18 81 83 40 124 69.2 37.29
Art (Llama) 11 71 88 154 40 72.2 51.21
Task Q1 Tokens Q2 Tokens Q3 Tokens Q4 Tokens Q5 Tokens Mean Tokens Standard Deviation (Tokens)
Medical (GPT) 1 10 1 1 1 2.4 4.04
Retail (GPT) 7 13 26 14 29 17.8 8.53
CCTV (GPT) 7 8 26 14 29 16.8 7.69
Art (GPT) 10 13 102 43 35 40.6 35.73

Latency Comparison

Latency, measured in seconds, is another critical factor in evaluating the model's performance, especially for real-time applications. The following tables display latency results for the same set of tasks.

Task Q1 Latency Q2 Latency Q3 Latency Q4 Latency Q5 Latency Mean Latency Standard Deviation (Latency)
Medical (Llama) 0.74 0.97 0.78 0.98 1.19 0.73 0.19
Retail (Llama) 1.63 3.00 3.02 1.67 3.14 2.09 0.74
CCTV (Llama) 1.63 3.00 3.02 1.67 3.14 2.09 0.74
Art (Llama) 1.35 2.46 2.91 4.45 2.09 2.46 1.06
Task Q1 Latency Q2 Latency Q3 Latency Q4 Latency Q5 Latency Mean Latency Standard Deviation (Latency)
Medical (GPT) 1.35 1.50 1.21 1.50 1.23 1.38 0.10
Retail (GPT) 1.24 1.77 2.12 1.35 1.83 1.63 0.29
CCTV (GPT) 1.20 2.12 1.80 1.35 1.83 1.68 0.32
Art (GPT) 1.24 1.77 7.69 3.94 2.41 3.61 2.29

Observations

  1. Token Efficiency: Llama models generally use fewer tokens in response generation for simpler tasks like Medical compared to more complex domains like Art.
  2. Latency: Latency is higher for more complex images, especially for tasks like Retail and Art, indicating that these models take more time when generating in-depth descriptions or analyzing images.
  3. GPT vs. Llama: GPT models generally had lower token counts across the tasks, but the latency was comparable, with GPT showing slightly more variability in complex tasks like Art.

Conclusion and Future Work

This experiment highlights the importance of evaluating both token efficiency and latency when stress testing VLMs. The VLM Stress Test App allows us to quickly compare multiple models and analyze their performance across a variety of real-world tasks.

Future Plans:

  • Additional Models: We plan to add more models like Mistral and Claude to the comparison.
  • Expanded Dataset: New tasks in

domains like Legal and Education will be added to challenge the models further.

  • Accuracy Metrics: We'll also integrate accuracy metrics like BLEU and ROUGE scores in the next iteration.

Check out our GitHub repository for the code and further instructions on how to set up and run your own VLM experiments.

streamlit Article's
30 articles in total
Favicon
Introducing GenAI Tweet Creator: Your AI-Powered Tweeting Assistant using Streamlit
Favicon
How to code a title in streamlit
Favicon
Interactive DataFrame Management with Streamlit Fragments 🚀
Favicon
Streamlit Part 10: Page Navigation Simplified
Favicon
Streamlit Part 8: Status Elements
Favicon
IELTS Writing Task Generator
Favicon
Unlocking Knowledge!
Favicon
Build Your YouTube Video Transcriber with Streamlit & Youtube API's 🚀
Favicon
Streamlit app
Favicon
Building an Interactive Budget Calculator with Streamlit 🚀
Favicon
Building an Interactive Quiz App with Streamlit 🚀
Favicon
Building a Streamlit Inventory Management App with Fragment Decorators 🚀
Favicon
Building a Voice Transcription and Translation App with OpenAI Whisper and Streamlit
Favicon
Multi-Modality and Image Gen in a 1.3B Model!🔮
Favicon
🖼️ Build an Image Converter WebApp Using Python and Streamlit
Favicon
AI Agents: Transforming Ideas into Action, Collaboratively
Favicon
Simulating the Monty Hall problem using Streamlit
Favicon
Build a containerized AI Agent with watsonx.ai & CrewAI (and Streamlit) and Podman
Favicon
Streamlit Part 4: Mastering Media Elements - Logos, Images, Videos, and Audio
Favicon
Streamlit Part 7: Build a Chat Interface
Favicon
Stress Testing VLMs: Multi QnA and Description Tasks
Favicon
Building a Document Retrieval & Q&A System with OpenAI and Streamlit
Favicon
Streamlit Part 6: Mastering Layouts
Favicon
Building internal AI tools with Streamlit
Favicon
Streamlit Part 5: Mastering Data Visualization and Chart Types
Favicon
Making a Webapp is so EASY with Streamlit
Favicon
Building a Multi-Turn-Assistant Application using Llama, Claude and GPT4o
Favicon
Building a Document QA with Streamlit & OpenAI
Favicon
Building an 🐝 OpenAI SWARM 🔍 Web Scraping and Content Analysis Streamlit Web App with 👥 Multi-Agent Systems
Favicon
Introduction to Using Generative AI Models: Create Your Own Chatbot 🤖💬

Featured ones: