Unlock deeper insights from complex user manuals with ColPali’s Vision Retrieval Augmented Generation. By leveraging MaxSim and Deep Lake, ColPali’s large, multi-vector embeddings are offloaded to scalable object storage while enabling advanced operations—like MaxSim—natively. This synergy makes high-speed, visually aware retrieval of complex manuals possible without hitting memory or engineering bottlenecks.
In this article, we’ll introduce an innovative Vision Retrieval Augmented Generation approach using Deep Lake and ColPali—a vision language model (VLM) that processes page images directly, capturing both visual and textual cues. Technical manuals often combine complex layouts, images, and text, making them challenging to search and navigate using traditional methods.
Key Benefits
- Enhanced Efficiency: Deep Lake scales ColPali’s embeddings without memory limitations.
- Advanced Features: Native support for MaxSim allows precise retrieval based on both visual and textual context.
- Unparalleled Accuracy: Directly processing page images captures nuances missed by text-only approaches.
ColPali’s Vision Retrieval Augmented Generation with MaxSim and Deep Lake tackles these challenges by combining powerful visual processing, robust scalability, and advanced search operations to deliver 1+1=3 synergy for more accurate and efficient document retrieval.
Processing and Querying User Manuals with ColPali and Deep Lake
Product user manuals are an essential resource for customers and support teams. Many companies have amassed hundreds or even thousands of product user manuals. They contain valuable product information-such as setup steps, technical specifications, troubleshooting guides, and much more.
However, these manuals are rich in data and layout complexity, making them challenging to manage with conventional text-based search or question-answering systems. A lot of contextual information gets lost when attempting to extract just plain text from them. Text-based embedding methods overlook visual elements such as images, diagrams, tables, and other non-textual cues that are crucial for clarity and support.
The impact is inefficient product support, longer troubleshooting times, and overall diminished user satisfaction due to incomplete or suboptimal search results.
- Faster, More Accurate Support: Customer support teams need relevant documentation right away.
- Cost Savings: Reducing the time spent manually searching, scanning, or re-creating instructions.
- Better User Experience: Empowering customers with a self-service portal that finds the right page instantly, complete with text and visual context.
In this tutorial, we demonstrate the process of pre-processing PDF files with ColPali and uploading them to Deep Lake for efficient storage and retrieval. By the end of this guide, you’ll have a complete understanding of how to leverage this powerful combination to handle large-scale document collections.
For this example, we process 1,000 user manuals containing approximately 64033 pages. These user manuals include complex layouts with text, diagrams, tables, and other visual elements, making them ideal candidates for ColPali’s advanced vision-language capabilities.
Once the dataset is uploaded to Deep Lake, we will showcase how to:
- Perform high-speed queries across the entire dataset.
- Retrieve relevant document pages with both textual and visual context.
- Demonstrate how the combination of ColPali and Deep Lake enables faster, more accurate document search and retrieval at scale.
What is ColPali?
ColPali is a novel document retrieval model that combines Vision Language Models (VLMs) with late interaction mechanisms to process and understand complex documents. It’s revolutionary because it abstracts away the need for standard OCR pipelines, processing entire documents as images and creating multi-vector embeddings that capture both textual content.
Why use Deep Lake And ColPali for Multi-Modal AI Search?
ColPali introduces a revolutionary vision-language approach to document retrieval by capturing fine-grained contextual and visual cues. However, this advancement comes at a cost—storage scalability. As detailed in its research, ColPali’s embeddings require 256 KB per page, significantly exceeding the storage requirements of traditional methods like BM25 Sparse (1.56 KB per page) or BM25 Dense (3.00 KB per page). This 30x larger memory footprint poses challenges when scaling to vast document collections. Additionally, ColPali relies on a multi-vector retrieval mechanism, inspired by ColBERT’s late interaction, which is not natively supported by many vector retrieval frameworks, further increasing the engineering complexity for deployment.
How Deep Lake Solves This Limitation
Deep Lake integrates seamlessly with ColPali to provide a scalable, high-performance solution that tackles both the storage and retrieval challenges of ColPali:
- Efficient Storage with Object Storage Offloading:
- ColPali’s multi-vector embeddings are offloaded to object storage (e.g., Amazon S3) instead of relying on costly in-memory storage.
- This ensures storage needs scale effectively while minimizing operational complexity.
- Native Multi-Vector Retrieval Support:
- Deep Lake supports multi-vector retrieval mechanisms like ColPali’s late interaction model, including the computationally intensive MaxSim operator. MaxSim computes maximum similarity scores across tokens or patches, a critical feature for ColPali’s retrieval accuracy.
- By natively supporting such advanced operations, Deep Lake eliminates the need for extensive infrastructure engineering, enabling organizations to deploy ColPali seamlessly.
- Advanced Query Performance:
- Deep Lake’s indexing and streaming capabilities allow high-speed queries directly from object storage, enabling ColPali to retrieve relevant information without being bottlenecked by its larger embedding size.
- Scalability and Accessibility:
- With support for multi-modal data types like embeddings, images, and text, Deep Lake ensures ColPali’s embeddings are managed efficiently across thousands of documents, enabling organizations to handle even the most demanding workloads.
The combination of ColPali’s cutting-edge vision-language model and Deep Lake’s robust data infrastructure unlocks new possibilities for organizations:
- Unparalleled Retrieval Quality: Leveraging ColPali’s embeddings ensures retrieval accuracy even in visually rich and complex manuals.
- Optimized Scalability: Deep Lake reduces storage constraints, enabling ColPali to scale seamlessly to handle large datasets.
- Future-Proof Performance: By offloading storage while maintaining high-speed retrieval, organizations can confidently scale their operations without compromising performance.
- Efficient Deployment: Deep Lake’s support for multi-vector retrieval and MaxSim significantly reduces the engineering overhead traditionally required to adapt ColPali for production use.
This integration isn’t just about solving a technical limitation—it’s about making ColPali’s revolutionary technology scalable, practical, and future-proof for real-world, enterprise-scale document retrieval. Together, Deep Lake and ColPali empower organizations to utilize the full potential of vision-language retrieval at scale.
Standard PDF Image Processing
Typically, companies rely on a sequence of steps to make PDFs searchable and indexable.
This visual breaks down the traditional document retrieval pipeline, highlighting the numerous steps involved in indexing and querying data from structured documents like PDFs.
It starts with OCR systems or PDF parsers to extract text from pages. Then, layout detection models identify key components like paragraphs, tables, titles, and figures. A chunking strategy groups text passages into semantically meaningful segments, and in some cases, captioning models describe visual elements in natural language to make them more embedding-friendly.
While effective, this process is slow, requiring 7.22 seconds per page during indexing, and involves significant complexity to ensure both text and visual content are properly captured.
User manuals can lose crucial context in the transformation from page to text-only chunks. Visual elements (images, diagrams, call-outs) and their relationships to text are often separated or lost entirely.
This page below illustrates why traditional retrieval methods fail to capture the complexity of user manuals:
- Loss of Context: Separating text from symbols, diagrams, or tables destroys the relationships critical for accurate interpretation.
- Visual Cues: Symbols and diagrams provide essential context that text-based embeddings alone cannot capture.
- Non-Linear Layout: The multi-modal and hierarchical structure requires advanced vision-language models like ColPali to integrate spatial, visual, and textual data into a unified embedding.
By highlighting these elements, we show why integrating ColPali and Deep Lake is crucial for enabling sophisticated retrieval systems capable of understanding the full scope of user manuals like this.
ColPali’s Vision-Language Approach
ColPali is designed to handle visual+text data holistically. It’s a Vision Language Model (VLM) that encodes an entire page image into a high-dimensional embedding space—without depending on OCR or elaborate layout analysis.
The diagram showcases ColPali’s architecture, a vision-language model that excels at combining visual and textual cues for document retrieval. Here’s a breakdown of its core components and how they work together efficiently:
Offline Document Encoding
On the left side, a document is passed into ColPali’s Vision Language Model (VLM) through a dedicated offline pipeline:
- Vision & Language Encoders: ColPali processes each document with a vision encoder (to handle images and layout) and a language model (for textual content), generating multidimensional embeddings that capture the document’s visual and textual elements.
- Pre-Indexing: These embeddings are then stored in a pre-indexed format, making them readily accessible for quick lookups during the query phase.
Online Query Processing
On the right side, the online pipeline manages user searches (e.g., “What are ViTs?”):
- Query Embedding: The user’s question is transformed into an embedding using the same language model.
- Late Interaction with MaxSim: ColPali compares each component of the query embedding against the previously generated document embeddings. It uses a MaxSim operation to pinpoint the most similar regions—whether they’re textual passages or sections of the page layout.
Similarity Scoring
Based on the MaxSim comparisons, ColPali produces a similarity score indicating which document segments (or entire documents) are most relevant. By simultaneously leveraging the document’s visual layout and textual content, this approach captures critical nuances that might be missed by traditional text-only methods.
1!pip install ColPali-engine
2
1import torch
2from PILimport Image
3
4from ColPali_engine.modelsimport ColPali, ColPaliProcessor
5
6model_name= "vidore/ColPali-v1.2"
7
8model= ColPali.from_pretrained(
9 model_name,
10 torch_dtype=torch.bfloat16,
11 device_map="cuda:0",# or "mps" if on Apple Silicon).eval()
12
13processor= ColPaliProcessor.from_pretrained(model_name)
14
ColPali’s late interaction design also keeps retrieval fast and efficient, even across large-scale collections with complex, visually rich documents—tables, figures, infographics, and more. By tightly integrating vision and language, ColPali outperforms standard solutions in scenarios where visual context is as important as text.
How ColPali Works
1. Input Transformation
Each user-manual page is converted to an image. The model divides it into a grid—e.g., 32×32 patches—to capture localized features.
2. Vision Feature Extraction
Each image patch undergoes multiple transformations to yield a 128-dimensional representation that captures both local (character-level) and global (layout-level) patterns.
3. Semantic Context Integration
ColPali then aligns visual cues with any textual semantics (if text is partially detected in the image) to build a deeper understanding of the page’s content.
4. Representation Refinement
These intermediate vectors are further refined through attention mechanisms or transformers, ensuring that relationships among patches, text blocks, and layout elements are represented holistically.
5. Contextualized Data Embedding
Finally, the model outputs a unified embedding that encodes the entire page’s structure, text, and visuals. This embedding is used for indexing and retrieval.
Result: A vector representation that truly captures the visual and textual context of the page.
We start by installing the necessary packages.
1!pip install --quiet deeplake colpali-engine accelerate pytesseract pymupdf pillow
2!sudo apt-get install -y poppler-utils
3!apt-get update
4!apt-get install -y tesseract-ocr
5
Create a Deep Lake dataset for ColPali’s vision question answering. Stored in ds
, it includes an embedding
** column for 2D float arrays, a title
column for the PDF name, a text
column for text, and an image
column for table images. After defining the structure, ds.commit()
saves the setup, optimizing it for ColPali’s multi-modal retrieval in table QA tasks.
1import deeplake
2from deeplake import types
3
4org_id = "<your_org_name>"
5user_manual_dataset_name = "<dataset_name>"
6
7ds = deeplake.create(f"al://{org_id}/{user_manual_dataset_name}")
8
9# Force columns to be 2D for embedding, 3D for image
10ds.add_column("title", dtype=types.Text())
11ds.add_column("text", dtype=types.Text())
12ds.add_column("embedding", dtype=types.Array(types.Float32(), dimensions=2))
13ds.add_column("image", dtype=types.Image())
14
15ds.commit()
16ds.summary()
17
Dataset Overview
The script below processes PDFs, generates embeddings, and stores results efficiently.
convert_pdfs_to_images()
turns PDF pages into images using PyMuPDF.ocr_single_image()
extracts text from an image with Tesseract OCR.extract_text_parallel()
runs OCR on multiple images using multiprocessing.process_batch()
generates embeddings and processes image batches with a model.batch_process_multiple_pdfs()
converts PDFs to images, extracts text, creates embeddings, and saves to Deep Lake. Error Handling Catches errors during data saving.process_pdfs_in_batches()
processes PDFs in smaller groups and moves completed files.__main__
block sets paths and processes PDFs in batches.
1import os
2import time
3import shutil
4import fitz # PyMuPDF
5import pytesseract
6import deeplake
7import numpy as np
8import torch
9from concurrent.futures import ProcessPoolExecutor
10
11# Helper function: Convert PDFs to images
12def convert_pdfs_to_images(pdf_path, zoom=1.5):
13 """
14 Converts a single PDF file (pdf_path) into a list of images (one per page)
15 using PyMuPDF.
16 """
17 document = fitz.open(pdf_path)
18 zoom_matrix = fitz.Matrix(zoom, zoom) # Scale for high-resolution images
19 images = []
20
21 for page_number, page in enumerate(document):
22 pix = page.get_pixmap(matrix=zoom_matrix, alpha=False)
23 image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
24 images.append(image)
25
26 print(f"Converted {len(images)} pages from {pdf_path} to images.")
27 return images
28
29# Helper function: Parallel OCR
30def ocr_single_image(image):
31 """
32 Extracts text from a single image using Tesseract OCR.
33 """
34 return pytesseract.image_to_string(image)
35
36def extract_text_parallel(images, num_workers=4):
37 """
38 Extract text from images using multiprocessing.
39 """
40 print("Starting parallel OCR...")
41 start_time = time.time()
42 with ProcessPoolExecutor(max_workers=num_workers) as executor:
43 texts = list(executor.map(ocr_single_image, images))
44 print(f"Finished OCR in {time.time() - start_time:.2f} seconds.")
45 return texts
46
47# Helper function: Process a batch of images
48def process_batch(batch_images, batch_texts, batch_titles, model, processor):
49 """
50 Process a batch of images to generate embeddings.
51 """
52 batch_text_prompts = ["<image> <bos>" for _ in batch_images]
53
54 inputs = processor(
55 images=batch_images, text=batch_text_prompts,
56 return_tensors="pt", truncation=True
57 ).to(model.device)
58
59 with torch.no_grad():
60 outputs = model(**inputs)
61 embeddings = list(torch.unbind(outputs.to("cpu"))) # Convert to list of tensors
62
63 embeddings_list = [embedding.tolist() for embedding in embeddings]
64 numpy_images = [np.array(img).astype(np.uint8) for img in batch_images]
65
66 return {
67 "embedding": embeddings_list,
68 "text": batch_texts,
69 "image": numpy_images,
70 "title": batch_titles
71 }
72
73# Batch Process and Store Multiple PDFs
74def batch_process_multiple_pdfs(
75 dataset_path,
76 pdf_files,
77 pdf_folder,
78 model,
79 processor,
80 batch_size=8
81 ):
82 """
83 Processes multiple PDFs together, appending their data to Deep Lake and committing
84 in bulk.
85 """
86 temp_data = [] # Temporary buffer to store results before committing
87 total_pdfs = len(pdf_files)
88
89 for idx, pdf_file in enumerate(pdf_files):
90 pdf_path = os.path.join(pdf_folder, pdf_file)
91 print(f"Processing: {pdf_file} ({idx + 1}/{total_pdfs})")
92 start_time = time.time()
93
94 # Convert PDF to images
95 images = convert_pdfs_to_images(pdf_path)
96 num_pages = len(images)
97 print(f"Number of pages: {num_pages}")
98
99 # Perform OCR and batch processing
100 extracted_texts = extract_text_parallel(images)
101 total_batches = len(images) // batch_size + int(len(images) % batch_size != 0)
102
103 for i in range(total_batches):
104 start_idx = i * batch_size
105 end_idx = min((i + 1) * batch_size, len(images))
106 batch_images = images[start_idx:end_idx]
107 batch_texts = extracted_texts[start_idx:end_idx]
108 batch_titles = [pdf_file] * len(batch_images)
109
110 # Process the batch and store results in the temporary buffer
111 processed_data = process_batch(
112 batch_images,
113 batch_texts,
114 batch_titles,
115 model,
116 processor
117 )
118 temp_data.append(processed_data)
119
120 elapsed_time = time.time() - start_time
121 print(f"Time taken for {pdf_file}: {elapsed_time:.2f} seconds")
122 print("-" * 50)
123
124 # Append all data in bulk and commit
125 print(f"Appending data for {len(pdf_files)} PDFs to Deep Lake...")
126 try:
127 ds = deeplake.open(dataset_path)
128 for data in temp_data:
129 ds.append(data)
130
131 # Commit the data
132 ds.commit(
133 f"Stored embeddings, images, texts, and titles for {len(pdf_files)} PDFs."
134 )
135 print(f"Committed data for {len(pdf_files)} PDFs.")
136 except Exception as e:
137 print(f"Error while appending/committing data: {e}")
138
139# Process PDFs in Batches
140def process_pdfs_in_batches(
141 pdf_folder,
142 processed_folder,
143 dataset_path,
144 model,
145 processor,
146 batch_size=8,
147 pdf_batch_size=5
148 ):
149 """
150 Processes PDFs in batches (e.g., 5 PDFs at a time),
151 appending their data in bulk to Deep Lake.
152 """
153 if not os.path.exists(processed_folder):
154 os.makedirs(processed_folder)
155
156 pdf_files = [f for f in os.listdir(pdf_folder) if f.lower().endswith(".pdf")]
157 total_files = len(pdf_files)
158
159 # Process files in batches
160 for i in range(0, total_files, pdf_batch_size):
161 batch_files = pdf_files[i : i + pdf_batch_size]
162 print(f"Processing batch {i // pdf_batch_size + 1} with {len(batch_files)} PDFs")
163 batch_process_multiple_pdfs(
164 dataset_path,
165 batch_files,
166 pdf_folder,
167 model,
168 processor,
169 batch_size
170 )
171
172 # Move processed PDFs
173 for pdf_file in batch_files:
174 processed_path = os.path.join(processed_folder, pdf_file)
175 shutil.move(os.path.join(pdf_folder, pdf_file), processed_path)
176 print(f"Moved {pdf_file} to {processed_folder}")
177 print("=" * 50)
178
179# Example usage
180if __name__ == "__main__":
181 pdf_folder = "/content" # Upload your PDF files and point to the directory
182 processed_folder = "/content/processed" # Create a folder to store the processed PDF
183 org_id = "<your_org_name>"
184 user_manual_dataset_name = "<dataset_name>"
185
186 dataset_path = f"al://{org_id}/{user_manual_dataset_name}"
187
188 process_pdfs_in_batches(
189 pdf_folder,
190 processed_folder,
191 dataset_path,
192 model,
193 processor,
194 batch_size=8, # adjust based on your requirements
195 pdf_batch_size=10 # adjust based on your requirements
196 )
197
Processing batch 1 with 10 PDFs…
Processing: Black _ Decker_Black-And-Decker-Ks531.pdf (1/10)
Number of pages: 12
Starting parallel OCR…
Finished OCR in 6.64 seconds.
Time taken for Black _ Decker_Black-And-Decker-Ks531.pdf: 11.75 seconds
Processing: Black _ Decker_Black-And-Decker-Bdcdd12c.pdf (2/10)
Number of pages: 20
Starting parallel OCR…
Finished OCR in 10.63 seconds.
Time taken for Black _ Decker_Black-And-Decker-Bdcdd12c.pdf: 19.06 seconds
Retrieval: Querying With MaxSim
When a user or support agent has a query—e.g., “How do I reset the device to factory settings?”—the ColPali pipeline does the following:
- Query Breakdown
- The text query is tokenized into smaller units (tokens).
- Semantic Processing
- A language model interprets these tokens, establishing contextual relationships.
- Context Projection
- The semantic output is aligned with the visual embeddings of the pages, ensuring consistent dimensionality.
- Unified Embedding Generation
- A multi-vector embedding of the query is produced, matching the format of each page’s embedding.
- MaxSim Computation
- For each page’s embedding, ColPali computes similarity with the query embedding at a fine-grained level, picking the maximum similarity (similar to ColBERT).
- This approach better captures nuances of each page’s layout, text, and images. The most relevant pages are then returned.
Faster, more accurate retrieval for user manuals where a single page (with both visuals and text) fully addresses a user’s query.
If you don’t want to upload and process your own PDFs, no problem! You can test the retrieval capabilities directly on a pre-processed dataset of 1,000 user manuals, containing approximately 64033 pages, hosted on Deep Lake.
This dataset has already been processed with ColPali, meaning it includes rich, multi-vector embeddings that capture both textual and visual context. By accessing this dataset, you can:
- Experiment with querying real-world, complex user manuals.
- Explore how ColPali’s advanced vision-language capabilities enhance retrieval accuracy.
- Test retrieval speeds and gain insights into how the pipeline scales for large document collections.
To get started, simply connect to the Deep Lake dataset using the following code:
1import deeplake
2org_id = "genai360"
3user_manual_dataset_name = "user_manual_dataset_colpali"
4
5ds = deeplake.open_read_only(f"al://{org_id}/{user_manual_dataset_name}")
6ds.summary()
7
Chat with Images
We enter a question and process it with processor
, sending it to the model’s device. Embeddings are generated without gradients and converted to a list format, stored in query_embeddings
1# Prepare the query
2queries = ["What can you tell me about the Rated voltage Un?"]
3
4# Generate query embeddings
5batch_queries = processor.process_queries(queries).to(model.device)
6with torch.no_grad():
7 query_embeddings = model(**batch_queries)
8query_embeddings = query_embeddings.tolist()
9
Retrieve The Most Similar Images
For each embedding in query_embeddings
, we format it as a nested array string for querying. The innermost lists (q_substrs
) are converted to ARRAY[]
format, and then combined into a single string, q_str
. This formatted string is used in a query on vector_search_images
, calculating the maxsim
similarity between q_str
and embedding
. The query returns the top 2 results, ordered by similarity score (score
). This loop performs similarity searches for each query embedding.
1# Query the dataset using MaxSim
2colpali_results = []
3n_res = 1 # Number of results to retrieve
4
5for query_embedding in query_embeddings:
6 # Convert query embedding into a formatted TQL array string
7 q_substrs = [f"ARRAY[{','.join(str(x) for x in row)}]" for row in query_embedding]
8 q_str = f"ARRAY[{','.join(q_substrs)}]"
9
10 # Construct and execute the TQL query
11 tql_colpali = f"""
12 SELECT *, maxsim({q_str}, embedding) AS score
13 ORDER BY maxsim({q_str}, embedding) DESC
14 LIMIT {n_res}
15 """
16 try:
17 result = ds.query(tql_colpali)
18 colpali_results.append(result)
19 except Exception as e:
20 print(f"Error during query execution: {e}")
21 colpali_results.append([])
22
For each result in view
, this code prints the question
text and its similarity score
. It then converts the image
data back to an image format with Image.fromarray(el["image"])
and displays it using el_img.show()
. This loop visually presents each query’s closest matches alongside their similarity scores.
1import matplotlib.pyplot as plt
2from PIL import Image
3
4# Visualize the results
5num_columns = n_res
6num_rows = len(colpali_results)
7
8fig, axes = plt.subplots(num_rows, num_columns, figsize=(15, 5 * num_rows))
9axes = axes.flatten() if num_rows * num_columns > 1 else [axes] # Handle single subplot case
10
11idx_plot = 0
12for res, query in zip(colpali_results, queries):
13 for el in res:
14 img = Image.fromarray(el["image"])
15 axes[idx_plot].imshow(img)
16 axes[idx_plot].set_title(f"Query: {query}\nSimilarity: {el['score']:.4f}")
17 axes[idx_plot].axis('off') # Turn off axes for a cleaner look
18 idx_plot += 1
19
20# Turn off remaining unused axes
21for ax in axes[idx_plot:]:
22 ax.axis('off')
23
24plt.tight_layout()
25plt.show()
26
27# Print retrieved text for review
28for i, res in enumerate(colpali_results):
29 print(f"Query: {queries[i]}")
30 for j, el in enumerate(res):
31 print(f"Result {j + 1}: {el['text'][:300]}") # Display 300 characters of text
32
VQA: Visual Question Answering
The following function, generate_VQA
, creates a visual question-answering (VQA) system that takes an image and a question, then analyzes the image to provide an answer based on visual cues.
Convert Image to Base64: The image (img
) is encoded to a base64 string, allowing it to be embedded in the API request.
System Prompt: A structured prompt instructs the model to analyze the image, focusing on visual details that can answer the question.
Payload and Headers: The request payload includes the model (gpt-4o-mini
), the system prompt, and the base64-encoded image. The model is expected to respond in JSON format, specifically returning an answer
field with insights based on the image.
Send API Request: Using requests.post
, the function sends the payload to the OpenAI API. If successful, it parses and returns the answer; otherwise, it returns False
.
This approach enables an AI-powered visual analysis of images to generate contextually relevant answers.
1import json
2import os
3import openai
4from getpass import getpass
5os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
6
7client = openai.OpenAI()
8
9def generate_VQA(base64_image: str, question:str):
10
11 system_prompt = f"""You are a visual language model specialized in analyzing images. Below is an image provided by the user along with a question. Analyze the image carefully, paying attention to details relevant to the question. Construct a clear and informative answer that directly addresses the user's question, based on visual cues.
12
13 The output must be in JSON format with the following structure:
14 {{
15 "answer": "The answer to the question based on visual analysis."
16 }}
17
18 Here is the question: {question}
19 """
20
21 response = client.chat.completions.create(
22 model="gpt-4o-mini",
23 messages = [
24 {
25 "role": "user",
26 "content": [
27 {"type": "text", "text": system_prompt},
28 {
29 "type": "image_url",
30 "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
31 },
32 ],
33 }
34 ],
35 response_format={"type": "json_object"},
36 )
37
38 try:
39
40 response = response.choices[0].message.content
41 response = json.loads(response)
42 answer = response["answer"]
43 return answer
44 except Exception as e:
45 print(f"Error: {e}")
46 return False
47
This code sets question to the first item in queries, converts the first image in colpali_results
to an image format, and saves it as "image.jpg".
1question = queries[0]
2output_image = "image.jpg"
3img = Image.fromarray(colpali_results[0]["image"][0])
4img.save(output_image)
5
The following code opens “image.jpg” in binary mode, encodes it to a base64 string, and passes it with the question to the generate_VQA
function, which returns an answer based on the image.
1import base64
2
3with open(output_image, "rb") as image_file:
4 base64_image = base64.b64encode(image_file.read()).decode('utf-8')
5
6answer = generate_VQA(base64_image, question)
7print(answer)
8
Query: What are all the parts on the Back Panel of the Cisco 2911 Router?
Answer: The parts on the Back Panel of the Cisco 2911 Router are:
EHWIC slots (0, 1, 2, 3),
USB serial port,
AUX,
RJ-45 serial console port,
10/100/1000 Ethernet port (GE0/0),
10/100/1000 Ethernet port (GE0/1),
10/100/1000 Ethernet port (GE0/2)
USB 0,
USB 1,
Ground,
AC or DC or AC-POE Power Module
CompactFlash 1,
Service module slot 1.
File: Cisco_Cisco-Cisco-2900-Series.pdf
Image:
Use Case Examples
- Customer Support Portal: A user logs a ticket: “My printer shows error code E01—how do I clear it?” By leveraging ColPali embeddings, the system retrieves the exact page with the printer’s diagnostic table and relevant instructions.
- Field Technicians: On a mobile app, a technician searches for “Remove jam from feed roller.” The system shows a page with diagrams highlighting roller compartments, bypassing the noise of extra text or PDF scanning.
- Sales & Training: Trainers can quickly find the relevant slide or image from a 300-page user manual for a quick demonstration.
Conclusion: Transforming AI Search on Complex Documents With ColPali, MaxSim, and Deep Lake
ColPali provides an innovative way to leverage Vision Retrieval Augmented Generation, addressing the unique challenges posed by visually dense, richly formatted documents like user manuals. By directly processing pages as images and embedding their layout, text, and graphical elements, ColPali:
- Minimizes or eliminates OCR pitfalls, ensuring no valuable context is lost.
- Preserves visual and spatial context, critical for understanding tables, diagrams, and symbols often overlooked by text-only methods.
- Enhances retrieval accuracy, offering faster, more relevant results that improve support resolution times and user experiences.
When combined with Deep Lake, the solution becomes even more powerful. Deep Lake’s support for multi-vector retrieval and advanced mechanisms like MaxSim ensures that ColPali’s embeddings can scale efficiently, overcoming the memory and storage challenges traditionally associated with multi-vector models. Its ability to offload embeddings to object storage while maintaining high-speed queries unlocks a scalable, high-performance pipeline for even the largest document collections.
As you evaluate how to handle large-scale user manual ingestion and retrieval, consider the synergy of ColPali + Deep Lake for a comprehensive, accurate, and scalable approach. Together, they enable your support teams to leverage the full visual-textual context, delivering richer insights, faster solutions, and higher user satisfaction—empowering your organization to redefine customer support.
Appendix
Deep Lake: a Lakehouse for Deep Learning
ColPali: Efficient Document Retrieval with Vision Language Models
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
Vision Retrieval Augmented Generation With ColPali FAQs
What Is Vision Retrieval Augmented Generation (VRAG)?
VRAG combines vision-language models (VLMs) like ColPali with retrieval systems to extract and contextualize information from visually dense documents such as user manuals. Unlike traditional retrieval systems, VRAG captures both textual and visual cues to enhance document understanding.
How Does ColPali Handle Complex Document Layouts?
ColPali processes entire document pages as images, preserving the spatial relationships between text, tables, and visuals. By generating unified embeddings for layout, text, and visual elements, it ensures no critical context is lost, even for complex layouts.
Why Is Multi-Vector Retrieval Important for ColPali?
Multi-vector retrieval allows ColPali to create embeddings for each document patch (e.g., text blocks or image regions). This enables fine-grained matching between queries and document content, ensuring higher accuracy when retrieving relevant sections.
How Does Deep Lake Enhance ColPali’s Performance?
Deep Lake addresses the storage and scalability challenges of ColPali’s large embeddings by offloading them to cost-effective object storage. Its advanced querying capabilities, including MaxSim, ensure high-speed retrieval while managing vast datasets efficiently.
What Makes ColPali Better Than Traditional OCR-Based Systems?
Traditional OCR systems extract only text and often lose visual and spatial context. ColPali processes both visual and textual elements in tandem, enabling retrieval that considers the full richness of documents, including diagrams, tables, and visual cues.
How Does MaxSim Improve Retrieval in ColPali?
MaxSim computes the similarity between each query token and document patch, selecting the maximum similarity score for efficient retrieval. This late-interaction mechanism ensures ColPali can retrieve highly relevant information, even in complex document layouts.
What Storage Challenges Does ColPali Address With Deep Lake?
ColPali’s embeddings require significant storage (256 KB per page), which can be a bottleneck at scale. Deep Lake mitigates this by offloading data to object storage while maintaining retrieval performance, allowing organizations to scale without prohibitive memory costs.
Can ColPali Handle Visually Rich Elements Like Diagrams and Tables?
Yes, ColPali’s ability to process images directly allows it to capture and contextualize visually rich elements such as diagrams, tables, and figures. This ensures accurate retrieval for technical documents where visuals play a critical role.
How Does ColPali + Deep Lake Improve Support Teams’ Workflows?
The combination provides faster, more accurate retrieval from user manuals, enabling support teams to resolve queries efficiently. By leveraging ColPali’s advanced embeddings and Deep Lake’s scalable storage, teams can access relevant information in seconds.
What Are the Main Advantages of ColPali for Enterprise-Scale Retrieval?
ColPali offers unparalleled retrieval accuracy by combining vision and language understanding. When paired with Deep Lake, it ensures:
- Scalability for large document collections.
- High-speed retrieval with advanced multi-vector support.
- Cost-effective storage solutions for massive embeddings.