ColPali's Vision RAG and MaxSim for Multi-Modal AI Search on Documents

Unlock deeper insights from complex user manuals with ColPali’s Vision Retrieval Augmented Generation. By leveraging MaxSim and Deep Lake, ColPali’s large, multi-vector embeddings are offloaded to scalable object storage while enabling advanced operations—like MaxSim—natively. This synergy makes high-speed, visually aware retrieval of complex manuals possible without hitting memory or engineering bottlenecks.

In this article, we’ll introduce an innovative Vision Retrieval Augmented Generation approach using Deep Lake and ColPali—a vision language model (VLM) that processes page images directly, capturing both visual and textual cues. Technical manuals often combine complex layouts, images, and text, making them challenging to search and navigate using traditional methods.

Key Benefits

Enhanced Efficiency: Deep Lake scales ColPali’s embeddings without memory limitations.
Advanced Features: Native support for MaxSim allows precise retrieval based on both visual and textual context.
Unparalleled Accuracy: Directly processing page images captures nuances missed by text-only approaches.

ColPali’s Vision Retrieval Augmented Generation with MaxSim and Deep Lake tackles these challenges by combining powerful visual processing, robust scalability, and advanced search operations to deliver 1+1=3 synergy for more accurate and efficient document retrieval.

Processing and Querying User Manuals with ColPali and Deep Lake

Product user manuals are an essential resource for customers and support teams. Many companies have amassed hundreds or even thousands of product user manuals. They contain valuable product information-such as setup steps, technical specifications, troubleshooting guides, and much more.

However, these manuals are rich in data and layout complexity, making them challenging to manage with conventional text-based search or question-answering systems. A lot of contextual information gets lost when attempting to extract just plain text from them. Text-based embedding methods overlook visual elements such as images, diagrams, tables, and other non-textual cues that are crucial for clarity and support.

The impact is inefficient product support, longer troubleshooting times, and overall diminished user satisfaction due to incomplete or suboptimal search results.

Faster, More Accurate Support: Customer support teams need relevant documentation right away.
Cost Savings: Reducing the time spent manually searching, scanning, or re-creating instructions.
Better User Experience: Empowering customers with a self-service portal that finds the right page instantly, complete with text and visual context.

In this tutorial, we demonstrate the process of pre-processing PDF files with ColPali and uploading them to Deep Lake for efficient storage and retrieval. By the end of this guide, you’ll have a complete understanding of how to leverage this powerful combination to handle large-scale document collections.

For this example, we process 1,000 user manuals containing approximately 64033 pages. These user manuals include complex layouts with text, diagrams, tables, and other visual elements, making them ideal candidates for ColPali’s advanced vision-language capabilities.

Once the dataset is uploaded to Deep Lake, we will showcase how to:

Perform high-speed queries across the entire dataset.
Retrieve relevant document pages with both textual and visual context.
Demonstrate how the combination of ColPali and Deep Lake enables faster, more accurate document search and retrieval at scale.

We will be following the code from this notebook!

What is ColPali?

ColPali is a novel document retrieval model that combines Vision Language Models (VLMs) with late interaction mechanisms to process and understand complex documents. It’s revolutionary because it abstracts away the need for standard OCR pipelines, processing entire documents as images and creating multi-vector embeddings that capture both textual content.

Why use Deep Lake And ColPali for Multi-Modal AI Search?

ColPali introduces a revolutionary vision-language approach to document retrieval by capturing fine-grained contextual and visual cues. However, this advancement comes at a cost—storage scalability. As detailed in its research, ColPali’s embeddings require 256 KB per page, significantly exceeding the storage requirements of traditional methods like BM25 Sparse (1.56 KB per page) or BM25 Dense (3.00 KB per page). This 30x larger memory footprint poses challenges when scaling to vast document collections. Additionally, ColPali relies on a multi-vector retrieval mechanism, inspired by ColBERT’s late interaction, which is not natively supported by many vector retrieval frameworks, further increasing the engineering complexity for deployment.

colpali embedding size table

How Deep Lake Solves This Limitation

Deep Lake integrates seamlessly with ColPali to provide a scalable, high-performance solution that tackles both the storage and retrieval challenges of ColPali:

Efficient Storage with Object Storage Offloading:
- ColPali’s multi-vector embeddings are offloaded to object storage (e.g., Amazon S3) instead of relying on costly in-memory storage.
- This ensures storage needs scale effectively while minimizing operational complexity.
Native Multi-Vector Retrieval Support:
- Deep Lake supports multi-vector retrieval mechanisms like ColPali’s late interaction model, including the computationally intensive MaxSim operator. MaxSim computes maximum similarity scores across tokens or patches, a critical feature for ColPali’s retrieval accuracy.
- By natively supporting such advanced operations, Deep Lake eliminates the need for extensive infrastructure engineering, enabling organizations to deploy ColPali seamlessly.
Advanced Query Performance:
- Deep Lake’s indexing and streaming capabilities allow high-speed queries directly from object storage, enabling ColPali to retrieve relevant information without being bottlenecked by its larger embedding size.
Scalability and Accessibility:
- With support for multi-modal data types like embeddings, images, and text, Deep Lake ensures ColPali’s embeddings are managed efficiently across thousands of documents, enabling organizations to handle even the most demanding workloads.

The combination of ColPali’s cutting-edge vision-language model and Deep Lake’s robust data infrastructure unlocks new possibilities for organizations:

Unparalleled Retrieval Quality: Leveraging ColPali’s embeddings ensures retrieval accuracy even in visually rich and complex manuals.
Optimized Scalability: Deep Lake reduces storage constraints, enabling ColPali to scale seamlessly to handle large datasets.
Future-Proof Performance: By offloading storage while maintaining high-speed retrieval, organizations can confidently scale their operations without compromising performance.
Efficient Deployment: Deep Lake’s support for multi-vector retrieval and MaxSim significantly reduces the engineering overhead traditionally required to adapt ColPali for production use.

This integration isn’t just about solving a technical limitation—it’s about making ColPali’s revolutionary technology scalable, practical, and future-proof for real-world, enterprise-scale document retrieval. Together, Deep Lake and ColPali empower organizations to utilize the full potential of vision-language retrieval at scale.

Standard PDF Image Processing

Typically, companies rely on a sequence of steps to make PDFs searchable and indexable.

This visual breaks down the traditional document retrieval pipeline, highlighting the numerous steps involved in indexing and querying data from structured documents like PDFs.

It starts with OCR systems or PDF parsers to extract text from pages. Then, layout detection models identify key components like paragraphs, tables, titles, and figures. A chunking strategy groups text passages into semantically meaningful segments, and in some cases, captioning models describe visual elements in natural language to make them more embedding-friendly.

While effective, this process is slow, requiring 7.22 seconds per page during indexing, and involves significant complexity to ensure both text and visual content are properly captured.

standard retrieval diagram

User manuals can lose crucial context in the transformation from page to text-only chunks. Visual elements (images, diagrams, call-outs) and their relationships to text are often separated or lost entirely.

This page below illustrates why traditional retrieval methods fail to capture the complexity of user manuals:

Loss of Context: Separating text from symbols, diagrams, or tables destroys the relationships critical for accurate interpretation.
Visual Cues: Symbols and diagrams provide essential context that text-based embeddings alone cannot capture.
Non-Linear Layout: The multi-modal and hierarchical structure requires advanced vision-language models like ColPali to integrate spatial, visual, and textual data into a unified embedding.

2 pdf layout highlighted

By highlighting these elements, we show why integrating ColPali and Deep Lake is crucial for enabling sophisticated retrieval systems capable of understanding the full scope of user manuals like this.

ColPali’s Vision-Language Approach

ColPali is designed to handle visual+text data holistically. It’s a Vision Language Model (VLM) that encodes an entire page image into a high-dimensional embedding space—without depending on OCR or elaborate layout analysis.

3 colpali architecture diagram

The diagram showcases ColPali’s architecture, a vision-language model that excels at combining visual and textual cues for document retrieval. Here’s a breakdown of its core components and how they work together efficiently:

Offline Document Encoding

On the left side, a document is passed into ColPali’s Vision Language Model (VLM) through a dedicated offline pipeline:

Vision & Language Encoders: ColPali processes each document with a vision encoder (to handle images and layout) and a language model (for textual content), generating multidimensional embeddings that capture the document’s visual and textual elements.
Pre-Indexing: These embeddings are then stored in a pre-indexed format, making them readily accessible for quick lookups during the query phase.

Online Query Processing

On the right side, the online pipeline manages user searches (e.g., “What are ViTs?”):

Query Embedding: The user’s question is transformed into an embedding using the same language model.
Late Interaction with MaxSim: ColPali compares each component of the query embedding against the previously generated document embeddings. It uses a MaxSim operation to pinpoint the most similar regions—whether they’re textual passages or sections of the page layout.

Similarity Scoring

Based on the MaxSim comparisons, ColPali produces a similarity score indicating which document segments (or entire documents) are most relevant. By simultaneously leveraging the document’s visual layout and textual content, this approach captures critical nuances that might be missed by traditional text-only methods.

 
      
        1!pip install ColPali-engine
2

 
      
        1import torch
2from PILimport Image
3
4from ColPali_engine.modelsimport ColPali, ColPaliProcessor
5
6model_name= "vidore/ColPali-v1.2"
7
8model= ColPali.from_pretrained(
9    model_name,
10    torch_dtype=torch.bfloat16,
11    device_map="cuda:0",# or "mps" if on Apple Silicon).eval()
12
13processor= ColPaliProcessor.from_pretrained(model_name)
14

ColPali’s late interaction design also keeps retrieval fast and efficient, even across large-scale collections with complex, visually rich documents—tables, figures, infographics, and more. By tightly integrating vision and language, ColPali outperforms standard solutions in scenarios where visual context is as important as text.

How ColPali Works

4 colpali embedding diagram

1. Input Transformation

Each user-manual page is converted to an image. The model divides it into a grid—e.g., 32×32 patches—to capture localized features.

5 pdf 32x32 image input transformation by colpali

2. Vision Feature Extraction

Each image patch undergoes multiple transformations to yield a 128-dimensional representation that captures both local (character-level) and global (layout-level) patterns.

3. Semantic Context Integration

ColPali then aligns visual cues with any textual semantics (if text is partially detected in the image) to build a deeper understanding of the page’s content.

4. Representation Refinement

These intermediate vectors are further refined through attention mechanisms or transformers, ensuring that relationships among patches, text blocks, and layout elements are represented holistically.

5. Contextualized Data Embedding

Finally, the model outputs a unified embedding that encodes the entire page’s structure, text, and visuals. This embedding is used for indexing and retrieval.

Result: A vector representation that truly captures the visual and textual context of the page.

We start by installing the necessary packages.

 
      
        1!pip install --quiet deeplake colpali-engine accelerate pytesseract pymupdf pillow
2!sudo apt-get install -y poppler-utils
3!apt-get update
4!apt-get install -y tesseract-ocr
5

Create a Deep Lake dataset for ColPali’s vision question answering. Stored in ds, it includes an embedding** column for 2D float arrays, a title column for the PDF name, a text column for text, and an image column for table images. After defining the structure, ds.commit() saves the setup, optimizing it for ColPali’s multi-modal retrieval in table QA tasks.

 
      
        1import deeplake
2from deeplake import types
3
4org_id = "<your_org_name>"
5user_manual_dataset_name = "<dataset_name>"
6
7ds = deeplake.create(f"al://{org_id}/{user_manual_dataset_name}")
8
9# Force columns to be 2D for embedding, 3D for image
10ds.add_column("title", dtype=types.Text())
11ds.add_column("text", dtype=types.Text())
12ds.add_column("embedding", dtype=types.Array(types.Float32(), dimensions=2))
13ds.add_column("image", dtype=types.Image())
14
15ds.commit()
16ds.summary()
17

Dataset Overview

dataset overview

The script below processes PDFs, generates embeddings, and stores results efficiently.

convert_pdfs_to_images() turns PDF pages into images using PyMuPDF.
ocr_single_image()extracts text from an image with Tesseract OCR.
extract_text_parallel()runs OCR on multiple images using multiprocessing.
process_batch() generates embeddings and processes image batches with a model.
batch_process_multiple_pdfs() converts PDFs to images, extracts text, creates embeddings, and saves to Deep Lake. Error Handling Catches errors during data saving.
process_pdfs_in_batches() processes PDFs in smaller groups and moves completed files.
__main__ block sets paths and processes PDFs in batches.

 
      
        1import os
2import time
3import shutil
4import fitz  # PyMuPDF
5import pytesseract
6import deeplake
7import numpy as np
8import torch
9from concurrent.futures import ProcessPoolExecutor
10
11# Helper function: Convert PDFs to images
12def convert_pdfs_to_images(pdf_path, zoom=1.5):
13    """
14    Converts a single PDF file (pdf_path) into a list of images (one per page) 
15    using PyMuPDF.
16    """
17    document = fitz.open(pdf_path)
18    zoom_matrix = fitz.Matrix(zoom, zoom)  # Scale for high-resolution images
19    images = []
20
21    for page_number, page in enumerate(document):
22        pix = page.get_pixmap(matrix=zoom_matrix, alpha=False)
23        image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
24        images.append(image)
25
26    print(f"Converted {len(images)} pages from {pdf_path} to images.")
27    return images
28
29# Helper function: Parallel OCR
30def ocr_single_image(image):
31    """
32    Extracts text from a single image using Tesseract OCR.
33    """
34    return pytesseract.image_to_string(image)
35
36def extract_text_parallel(images, num_workers=4):
37    """
38    Extract text from images using multiprocessing.
39    """
40    print("Starting parallel OCR...")
41    start_time = time.time()
42    with ProcessPoolExecutor(max_workers=num_workers) as executor:
43        texts = list(executor.map(ocr_single_image, images))
44    print(f"Finished OCR in {time.time() - start_time:.2f} seconds.")
45    return texts
46
47# Helper function: Process a batch of images
48def process_batch(batch_images, batch_texts, batch_titles, model, processor):
49    """
50    Process a batch of images to generate embeddings.
51    """
52    batch_text_prompts = ["<image> <bos>" for _ in batch_images]
53
54    inputs = processor(
55        images=batch_images, text=batch_text_prompts, 
56        return_tensors="pt", truncation=True
57    ).to(model.device)
58
59    with torch.no_grad():
60        outputs = model(**inputs)
61        embeddings = list(torch.unbind(outputs.to("cpu")))  # Convert to list of tensors
62
63    embeddings_list = [embedding.tolist() for embedding in embeddings]
64    numpy_images = [np.array(img).astype(np.uint8) for img in batch_images]
65
66    return {
67    "embedding": embeddings_list, 
68    "text": batch_texts, 
69    "image": numpy_images, 
70    "title": batch_titles
71    }
72
73# Batch Process and Store Multiple PDFs
74def batch_process_multiple_pdfs(
75                                                                dataset_path, 
76                                                                pdf_files, 
77                                                                pdf_folder, 
78                                                                model, 
79                                                                processor, 
80                                                                batch_size=8
81                                                                ):
82    """
83    Processes multiple PDFs together, appending their data to Deep Lake and committing 
84    in bulk.
85    """
86    temp_data = []  # Temporary buffer to store results before committing
87    total_pdfs = len(pdf_files)
88
89    for idx, pdf_file in enumerate(pdf_files):
90        pdf_path = os.path.join(pdf_folder, pdf_file)
91        print(f"Processing: {pdf_file} ({idx + 1}/{total_pdfs})")
92        start_time = time.time()
93
94        # Convert PDF to images
95        images = convert_pdfs_to_images(pdf_path)
96        num_pages = len(images)
97        print(f"Number of pages: {num_pages}")
98
99        # Perform OCR and batch processing
100        extracted_texts = extract_text_parallel(images)
101        total_batches = len(images) // batch_size + int(len(images) % batch_size != 0)
102
103        for i in range(total_batches):
104            start_idx = i * batch_size
105            end_idx = min((i + 1) * batch_size, len(images))
106            batch_images = images[start_idx:end_idx]
107            batch_texts = extracted_texts[start_idx:end_idx]
108            batch_titles = [pdf_file] * len(batch_images)
109
110            # Process the batch and store results in the temporary buffer
111            processed_data = process_batch(
112                                                                           batch_images, 
113                                                                           batch_texts, 
114                                                                           batch_titles, 
115                                                                           model, 
116                                                                           processor
117                                                                           )
118            temp_data.append(processed_data)
119
120        elapsed_time = time.time() - start_time
121        print(f"Time taken for {pdf_file}: {elapsed_time:.2f} seconds")
122        print("-" * 50)
123
124    # Append all data in bulk and commit
125    print(f"Appending data for {len(pdf_files)} PDFs to Deep Lake...")
126    try:
127        ds = deeplake.open(dataset_path)
128        for data in temp_data:
129            ds.append(data)
130
131        # Commit the data
132        ds.commit(
133                    f"Stored embeddings, images, texts, and titles for {len(pdf_files)} PDFs."
134                    )
135        print(f"Committed data for {len(pdf_files)} PDFs.")
136    except Exception as e:
137        print(f"Error while appending/committing data: {e}")
138
139# Process PDFs in Batches
140def process_pdfs_in_batches(
141                                                        pdf_folder, 
142                                                        processed_folder, 
143                                                        dataset_path, 
144                                                        model, 
145                                                        processor, 
146                                                        batch_size=8, 
147                                                        pdf_batch_size=5
148                                                        ):
149    """
150    Processes PDFs in batches (e.g., 5 PDFs at a time), 
151    appending their data in bulk to Deep Lake.
152    """
153    if not os.path.exists(processed_folder):
154        os.makedirs(processed_folder)
155
156    pdf_files = [f for f in os.listdir(pdf_folder) if f.lower().endswith(".pdf")]
157    total_files = len(pdf_files)
158
159    # Process files in batches
160    for i in range(0, total_files, pdf_batch_size):
161        batch_files = pdf_files[i : i + pdf_batch_size]
162        print(f"Processing batch {i // pdf_batch_size + 1} with {len(batch_files)} PDFs")
163        batch_process_multiple_pdfs(
164                                                                dataset_path, 
165                                                                batch_files, 
166                                                                pdf_folder, 
167                                                                model, 
168                                                                processor, 
169                                                                batch_size
170                                                                )
171
172        # Move processed PDFs
173        for pdf_file in batch_files:
174            processed_path = os.path.join(processed_folder, pdf_file)
175            shutil.move(os.path.join(pdf_folder, pdf_file), processed_path)
176            print(f"Moved {pdf_file} to {processed_folder}")
177        print("=" * 50)
178
179# Example usage
180if __name__ == "__main__":
181    pdf_folder = "/content" # Upload your PDF files and point to the directory
182    processed_folder = "/content/processed" # Create a folder to store the processed PDF
183        org_id = "<your_org_name>"
184        user_manual_dataset_name = "<dataset_name>"
185
186        dataset_path = f"al://{org_id}/{user_manual_dataset_name}"
187
188    process_pdfs_in_batches(
189                                                    pdf_folder, 
190                                                    processed_folder, 
191                                                    dataset_path, 
192                                                    model, 
193                                                    processor, 
194                                                    batch_size=8, # adjust based on your requirements
195                                                    pdf_batch_size=10 # adjust based on your requirements
196                                                    )
197

Processing batch 1 with 10 PDFs…
Processing: Black _ Decker_Black-And-Decker-Ks531.pdf (1/10)
Number of pages: 12
Starting parallel OCR…
Finished OCR in 6.64 seconds.
Time taken for Black _ Decker_Black-And-Decker-Ks531.pdf: 11.75 seconds

Processing: Black _ Decker_Black-And-Decker-Bdcdd12c.pdf (2/10)
Number of pages: 20
Starting parallel OCR…
Finished OCR in 10.63 seconds.
Time taken for Black _ Decker_Black-And-Decker-Bdcdd12c.pdf: 19.06 seconds

Retrieval: Querying With MaxSim

When a user or support agent has a query—e.g., “How do I reset the device to factory settings?”—the ColPali pipeline does the following:

6 colpali query diagram

Query Breakdown
- The text query is tokenized into smaller units (tokens).
Semantic Processing
- A language model interprets these tokens, establishing contextual relationships.
Context Projection
- The semantic output is aligned with the visual embeddings of the pages, ensuring consistent dimensionality.
Unified Embedding Generation
- A multi-vector embedding of the query is produced, matching the format of each page’s embedding.
MaxSim Computation
- For each page’s embedding, ColPali computes similarity with the query embedding at a fine-grained level, picking the maximum similarity (similar to ColBERT).
- This approach better captures nuances of each page’s layout, text, and images. The most relevant pages are then returned.

7 maxsim diagram

Faster, more accurate retrieval for user manuals where a single page (with both visuals and text) fully addresses a user’s query.

If you don’t want to upload and process your own PDFs, no problem! You can test the retrieval capabilities directly on a pre-processed dataset of 1,000 user manuals, containing approximately 64033 pages, hosted on Deep Lake.

This dataset has already been processed with ColPali, meaning it includes rich, multi-vector embeddings that capture both textual and visual context. By accessing this dataset, you can:

Experiment with querying real-world, complex user manuals.
Explore how ColPali’s advanced vision-language capabilities enhance retrieval accuracy.
Test retrieval speeds and gain insights into how the pipeline scales for large document collections.

To get started, simply connect to the Deep Lake dataset using the following code:

 
      
        1import deeplake
2org_id = "genai360"
3user_manual_dataset_name = "user_manual_dataset_colpali"
4
5ds = deeplake.open_read_only(f"al://{org_id}/{user_manual_dataset_name}")
6ds.summary()
7

Chat with Images

We enter a question and process it with processor, sending it to the model’s device. Embeddings are generated without gradients and converted to a list format, stored in query_embeddings

 
      
        1# Prepare the query
2queries = ["What can you tell me about the Rated voltage Un?"]
3
4# Generate query embeddings
5batch_queries = processor.process_queries(queries).to(model.device)
6with torch.no_grad():
7    query_embeddings = model(**batch_queries)
8query_embeddings = query_embeddings.tolist()
9

Retrieve The Most Similar Images

For each embedding in query_embeddings, we format it as a nested array string for querying. The innermost lists (q_substrs) are converted to ARRAY[] format, and then combined into a single string, q_str. This formatted string is used in a query on vector_search_images, calculating the maxsim similarity between q_str and embedding. The query returns the top 2 results, ordered by similarity score (score). This loop performs similarity searches for each query embedding.

 
      
        1# Query the dataset using MaxSim
2colpali_results = []
3n_res = 1  # Number of results to retrieve
4
5for query_embedding in query_embeddings:
6    # Convert query embedding into a formatted TQL array string
7    q_substrs = [f"ARRAY[{','.join(str(x) for x in row)}]" for row in query_embedding]
8    q_str = f"ARRAY[{','.join(q_substrs)}]"
9
10    # Construct and execute the TQL query
11    tql_colpali = f"""
12        SELECT *, maxsim({q_str}, embedding) AS score
13        ORDER BY maxsim({q_str}, embedding) DESC
14        LIMIT {n_res}
15    """
16    try:
17        result = ds.query(tql_colpali)
18        colpali_results.append(result)
19    except Exception as e:
20        print(f"Error during query execution: {e}")
21        colpali_results.append([])
22

For each result in view, this code prints the question text and its similarity score. It then converts the image data back to an image format with Image.fromarray(el["image"]) and displays it using el_img.show(). This loop visually presents each query’s closest matches alongside their similarity scores.

 
      
        1import matplotlib.pyplot as plt
2from PIL import Image
3
4# Visualize the results
5num_columns = n_res
6num_rows = len(colpali_results)
7
8fig, axes = plt.subplots(num_rows, num_columns, figsize=(15, 5 * num_rows))
9axes = axes.flatten() if num_rows * num_columns > 1 else [axes]  # Handle single subplot case
10
11idx_plot = 0
12for res, query in zip(colpali_results, queries):
13    for el in res:
14        img = Image.fromarray(el["image"])
15        axes[idx_plot].imshow(img)
16        axes[idx_plot].set_title(f"Query: {query}\nSimilarity: {el['score']:.4f}")
17        axes[idx_plot].axis('off')  # Turn off axes for a cleaner look
18        idx_plot += 1
19
20# Turn off remaining unused axes
21for ax in axes[idx_plot:]:
22    ax.axis('off')
23
24plt.tight_layout()
25plt.show()
26
27# Print retrieved text for review
28for i, res in enumerate(colpali_results):
29    print(f"Query: {queries[i]}")
30    for j, el in enumerate(res):
31        print(f"Result {j + 1}: {el['text'][:300]}") # Display 300 characters of text
32

8 query image

VQA: Visual Question Answering

The following function, generate_VQA, creates a visual question-answering (VQA) system that takes an image and a question, then analyzes the image to provide an answer based on visual cues.

Convert Image to Base64: The image (img) is encoded to a base64 string, allowing it to be embedded in the API request.

System Prompt: A structured prompt instructs the model to analyze the image, focusing on visual details that can answer the question.

Payload and Headers: The request payload includes the model (gpt-4o-mini), the system prompt, and the base64-encoded image. The model is expected to respond in JSON format, specifically returning an answer field with insights based on the image.

Send API Request: Using requests.post, the function sends the payload to the OpenAI API. If successful, it parses and returns the answer; otherwise, it returns False.

This approach enables an AI-powered visual analysis of images to generate contextually relevant answers.

 
      
        1import json
2import os
3import openai
4from getpass import getpass
5os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
6
7client = openai.OpenAI()
8
9def generate_VQA(base64_image: str, question:str):
10
11    system_prompt = f"""You are a visual language model specialized in analyzing images. Below is an image provided by the user along with a question. Analyze the image carefully, paying attention to details relevant to the question. Construct a clear and informative answer that directly addresses the user's question, based on visual cues.
12
13    The output must be in JSON format with the following structure:
14    {{
15        "answer": "The answer to the question based on visual analysis."
16    }}
17
18    Here is the question: {question}
19    """
20
21    response = client.chat.completions.create(
22        model="gpt-4o-mini",
23        messages = [
24            {
25                "role": "user",
26                "content": [
27                    {"type": "text", "text": system_prompt},
28                    {
29                        "type": "image_url",
30                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
31                    },
32                ],
33            }
34        ],
35        response_format={"type": "json_object"},
36    )
37
38    try:
39
40        response = response.choices[0].message.content
41        response = json.loads(response)
42        answer = response["answer"]
43        return answer
44    except Exception as e:
45        print(f"Error: {e}")
46        return False
47

This code sets question to the first item in queries, converts the first image in colpali_results to an image format, and saves it as "image.jpg".

 
      
        1question = queries[0]
2output_image = "image.jpg"
3img = Image.fromarray(colpali_results[0]["image"][0])
4img.save(output_image)
5

The following code opens “image.jpg” in binary mode, encodes it to a base64 string, and passes it with the question to the generate_VQA function, which returns an answer based on the image.

 
      
        1import base64
2
3with open(output_image, "rb") as image_file:
4    base64_image = base64.b64encode(image_file.read()).decode('utf-8')
5
6answer = generate_VQA(base64_image, question)
7print(answer)
8

Query: What are all the parts on the Back Panel of the Cisco 2911 Router?

Answer: The parts on the Back Panel of the Cisco 2911 Router are:

EHWIC slots (0, 1, 2, 3),
USB serial port,
AUX,
RJ-45 serial console port,
10/100/1000 Ethernet port (GE0/0),
10/100/1000 Ethernet port (GE0/1),
10/100/1000 Ethernet port (GE0/2)
USB 0,
USB 1,
Ground,
AC or DC or AC-POE Power Module
CompactFlash 1,
Service module slot 1.

File: Cisco_Cisco-Cisco-2900-Series.pdf

Image:

11 vision answer image

Use Case Examples

Customer Support Portal: A user logs a ticket: “My printer shows error code E01—how do I clear it?” By leveraging ColPali embeddings, the system retrieves the exact page with the printer’s diagnostic table and relevant instructions.
Field Technicians: On a mobile app, a technician searches for “Remove jam from feed roller.” The system shows a page with diagrams highlighting roller compartments, bypassing the noise of extra text or PDF scanning.
Sales & Training: Trainers can quickly find the relevant slide or image from a 300-page user manual for a quick demonstration.

Conclusion: Transforming AI Search on Complex Documents With ColPali, MaxSim, and Deep Lake

ColPali provides an innovative way to leverage Vision Retrieval Augmented Generation, addressing the unique challenges posed by visually dense, richly formatted documents like user manuals. By directly processing pages as images and embedding their layout, text, and graphical elements, ColPali:

Minimizes or eliminates OCR pitfalls, ensuring no valuable context is lost.
Preserves visual and spatial context, critical for understanding tables, diagrams, and symbols often overlooked by text-only methods.
Enhances retrieval accuracy, offering faster, more relevant results that improve support resolution times and user experiences.

When combined with Deep Lake, the solution becomes even more powerful. Deep Lake’s support for multi-vector retrieval and advanced mechanisms like MaxSim ensures that ColPali’s embeddings can scale efficiently, overcoming the memory and storage challenges traditionally associated with multi-vector models. Its ability to offload embeddings to object storage while maintaining high-speed queries unlocks a scalable, high-performance pipeline for even the largest document collections.

As you evaluate how to handle large-scale user manual ingestion and retrieval, consider the synergy of ColPali + Deep Lake for a comprehensive, accurate, and scalable approach. Together, they enable your support teams to leverage the full visual-textual context, delivering richer insights, faster solutions, and higher user satisfaction—empowering your organization to redefine customer support.

Appendix

Deep Lake: a Lakehouse for Deep Learning

ColPali: Efficient Document Retrieval with Vision Language Models

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Vision Retrieval Augmented Generation With ColPali FAQs

What Is Vision Retrieval Augmented Generation (VRAG)?

VRAG combines vision-language models (VLMs) like ColPali with retrieval systems to extract and contextualize information from visually dense documents such as user manuals. Unlike traditional retrieval systems, VRAG captures both textual and visual cues to enhance document understanding.

How Does ColPali Handle Complex Document Layouts?

ColPali processes entire document pages as images, preserving the spatial relationships between text, tables, and visuals. By generating unified embeddings for layout, text, and visual elements, it ensures no critical context is lost, even for complex layouts.

Why Is Multi-Vector Retrieval Important for ColPali?

Multi-vector retrieval allows ColPali to create embeddings for each document patch (e.g., text blocks or image regions). This enables fine-grained matching between queries and document content, ensuring higher accuracy when retrieving relevant sections.

How Does Deep Lake Enhance ColPali’s Performance?

Deep Lake addresses the storage and scalability challenges of ColPali’s large embeddings by offloading them to cost-effective object storage. Its advanced querying capabilities, including MaxSim, ensure high-speed retrieval while managing vast datasets efficiently.

What Makes ColPali Better Than Traditional OCR-Based Systems?

Traditional OCR systems extract only text and often lose visual and spatial context. ColPali processes both visual and textual elements in tandem, enabling retrieval that considers the full richness of documents, including diagrams, tables, and visual cues.

How Does MaxSim Improve Retrieval in ColPali?

MaxSim computes the similarity between each query token and document patch, selecting the maximum similarity score for efficient retrieval. This late-interaction mechanism ensures ColPali can retrieve highly relevant information, even in complex document layouts.

What Storage Challenges Does ColPali Address With Deep Lake?

ColPali’s embeddings require significant storage (256 KB per page), which can be a bottleneck at scale. Deep Lake mitigates this by offloading data to object storage while maintaining retrieval performance, allowing organizations to scale without prohibitive memory costs.

Can ColPali Handle Visually Rich Elements Like Diagrams and Tables?

Yes, ColPali’s ability to process images directly allows it to capture and contextualize visually rich elements such as diagrams, tables, and figures. This ensures accurate retrieval for technical documents where visuals play a critical role.

How Does ColPali + Deep Lake Improve Support Teams’ Workflows?

The combination provides faster, more accurate retrieval from user manuals, enabling support teams to resolve queries efficiently. By leveraging ColPali’s advanced embeddings and Deep Lake’s scalable storage, teams can access relevant information in seconds.

What Are the Main Advantages of ColPali for Enterprise-Scale Retrieval?

ColPali offers unparalleled retrieval accuracy by combining vision and language understanding. When paired with Deep Lake, it ensures:

Scalability for large document collections.
High-speed retrieval with advanced multi-vector support.
Cost-effective storage solutions for massive embeddings.