Logo

dev-resources.site

for different kinds of informations.

Multi-Modality and Image Gen in a 1.3B Model!🔮

Published at
10/24/2024
Categories
streamlit
transformers
computervision
tutorial
Author
aryankargwal
Author
12 person written this
aryankargwal
open
Multi-Modality and Image Gen in a 1.3B Model!🔮

Code: Click Me
Youtube: Click Me

Today, we’re diving into something exciting: Janus 1.3B, one of the tiniest yet competent truly multimodal LLMs. What sets Janus apart is that, despite its smaller size, it delivers powerful results in natural language processing and image generation. This is a perfect example of where AI is heading—smaller models yet versatile and multimodal.


Janus 1.3B

So, what exactly is Janus 1.3B? At its core, Janus is a vision-language model (VLM) designed to handle textual and visual data. With just 1.3 billion parameters, Janus is significantly smaller than some of the other LLMs and multimodal models we’ve discussed on the channel. But don’t let its size fool you; it can perform both text and image generation, making it a powerful tool despite its relatively compact size.

Unlike most models, which specialise in one area or need large architectures to function effectively in multiple domains, Janus achieves this multimodal functionality at a much smaller scale. This is a massive step in making AI more efficient, accessible, and, most importantly, scalable.


How Does Janus Work?

Let’s start with its architecture. Janus processes text understanding, multimodal understanding, and visual generation through independent encoding methods that eventually feed into a unified autoregressive transformer. This design allows it to handle different types of input—text, images, or a combination of both—in a highly efficient manner.

Image description

Here’s the breakdown of how it all works:

  1. Text Understanding: Janus employs a built-in tokenizer from its underlying LLM. This tokenizer converts text into discrete IDs (tokens), which are transformed into feature representations. The LLM processes these features in the same way as any other text-based model.

  2. Multimodal Understanding: Janus integrates SigLIP, a powerful vision encoder that extracts high-dimensional semantic features from images for image processing. These features are flattened from a 2D grid into a 1D sequence and passed through an understanding adaptor. This adaptor maps the image features into the input space of the LLM, ensuring that both image and text data are represented in a way that the model can understand together.

  3. Image Generation: Janus utilizes a Vector Quantization (VQ) tokenizer to generate images. This tokenizer converts images into a sequence of discrete IDs. These ID sequences are flattened and passed through a generation adaptor, which maps them into the LLM’s input space. This allows Janus to generate image content from a text description. A specialized image prediction head is trained for this task, while Janus relies on the LLM’s existing text prediction head for text-based tasks.

Once the inputs, whether text, image, or both, are converted into feature sequences, Janus concatenates them into a unified multimodal feature sequence. This sequence is then fed into the LLM for processing, making it capable of generating text and images based on the input it receives.


Janus Multi-Modal Performance

Now, let’s talk performance. Despite its relatively small size of 1.3 billion parameters, Janus is competitive across several multimodal tasks. It excels in Visual Question Answering (VQA) benchmarks, COCO Captioning, and Image-Text Retrieval.

Janus MultiModal

Janus is designed to handle real-world multimodal applications where parameter efficiency is critical. While larger models might outperform Janus on tasks that require deep reasoning over complex text or high-resolution images, Janus hits a sweet spot by balancing efficiency and performance for general-purpose multimodal applications.


How to Use Janus for Multi-Modal Integration

Now, let us see how to use the model for multimodal inferences. Below is an example of how to set up the generate_answer function, which takes an image and a question as inputs.

def generate_answer(image_path, question):
    # Load the VL-GPT model, tokenizer, and visual language chat processor
    model = load_vl_gpt_model()
    tokenizer = load_tokenizer()
    vl_chat_processor = load_vl_chat_processor()

    # Define conversation structure
    conversation = f"{question} [image: {image_path}]"

    # Prepare image for processing
    image = preprocess_image(image_path)

    # Prepare inputs for the model
    inputs = vl_chat_processor.process(image, conversation)

    # Generate input embeddings
    input_embeddings = model.get_embeddings(inputs)

    # Generate answer using the VL-GPT model
    answer = model.generate(input_embeddings)

    return decode_answer(answer)
Enter fullscreen mode Exit fullscreen mode

In this code, we load the necessary components, prepare the image and question for processing, and generate a response that combines visual context with the posed question.


Janus Image Generation

Finally, let’s examine Janus’ image generation capabilities. While it’s not as large as dedicated models like DALL-E 2 or Stable Diffusion, Janus still creates high-quality images from textual inputs in an incredibly compact form.

Janus Image Gen

As mentioned, Janus uses the VQ tokenizer to convert images into discrete tokens. These tokens are then processed using a latent diffusion model, generating the image in stages and refining it over time to match the text input. The result? Images that are highly coherent and contextually accurate, especially when dealing with more straightforward or abstract prompts.

How to Use Janus for Image Generation

The process starts with tokenizing the prompt using the vl_chat_processor. This converts the text into numerical representations that the model can understand.

def generate_image(prompt):
    # Tokenize the prompt
    tokenized_prompt = vl_chat_processor.tokenize(prompt)

    # Create initial embeddings from tokens
    initial_embeddings = model.create_embeddings(tokenized_prompt)

    # Generate image tokens iteratively
    image_tokens = []
    for _ in range(num_tokens):
        token = model.generate_next_token(initial_embeddings)
        image_tokens.append(token)
        initial_embeddings = model.update_embeddings(initial_embeddings, token)

    # Decode tokens into an image
    image = decode_image(image_tokens)

    # Save image to disk
    save_image(image, "output_image.jpg")
Enter fullscreen mode Exit fullscreen mode

This code illustrates generating an image based on a text prompt using Janus. It showcases the iterative process of generating image tokens while ensuring relevance to the original prompt.


Conclusion

So there you have it—Janus 1.3B, a small but compelling multimodal model that punches well above its weight. Its ability to handle text understanding, multimodal reasoning, and image generation in such a compact framework is a testament to the efficiency of its design.

For those interested in multimodal AI that can be deployed in real-world applications without massive computational power, Janus is a model you should watch.

streamlit Article's
30 articles in total
Favicon
Introducing GenAI Tweet Creator: Your AI-Powered Tweeting Assistant using Streamlit
Favicon
How to code a title in streamlit
Favicon
Interactive DataFrame Management with Streamlit Fragments 🚀
Favicon
Streamlit Part 10: Page Navigation Simplified
Favicon
Streamlit Part 8: Status Elements
Favicon
IELTS Writing Task Generator
Favicon
Unlocking Knowledge!
Favicon
Build Your YouTube Video Transcriber with Streamlit & Youtube API's 🚀
Favicon
Streamlit app
Favicon
Building an Interactive Budget Calculator with Streamlit 🚀
Favicon
Building an Interactive Quiz App with Streamlit 🚀
Favicon
Building a Streamlit Inventory Management App with Fragment Decorators 🚀
Favicon
Building a Voice Transcription and Translation App with OpenAI Whisper and Streamlit
Favicon
Multi-Modality and Image Gen in a 1.3B Model!🔮
Favicon
🖼️ Build an Image Converter WebApp Using Python and Streamlit
Favicon
AI Agents: Transforming Ideas into Action, Collaboratively
Favicon
Simulating the Monty Hall problem using Streamlit
Favicon
Build a containerized AI Agent with watsonx.ai & CrewAI (and Streamlit) and Podman
Favicon
Streamlit Part 4: Mastering Media Elements - Logos, Images, Videos, and Audio
Favicon
Streamlit Part 7: Build a Chat Interface
Favicon
Stress Testing VLMs: Multi QnA and Description Tasks
Favicon
Building a Document Retrieval & Q&A System with OpenAI and Streamlit
Favicon
Streamlit Part 6: Mastering Layouts
Favicon
Building internal AI tools with Streamlit
Favicon
Streamlit Part 5: Mastering Data Visualization and Chart Types
Favicon
Making a Webapp is so EASY with Streamlit
Favicon
Building a Multi-Turn-Assistant Application using Llama, Claude and GPT4o
Favicon
Building a Document QA with Streamlit & OpenAI
Favicon
Building an 🐝 OpenAI SWARM 🔍 Web Scraping and Content Analysis Streamlit Web App with 👥 Multi-Agent Systems
Favicon
Introduction to Using Generative AI Models: Create Your Own Chatbot 🤖💬

Featured ones: