Question 1

How to Conduct Multimodal Search with ImageBind & Deep Lake?

Accepted Answer

If you've ever thought that Multimodality in AI is limited to generating images with Midjourney or Dall-E, think again. Multimodal use cases will increasingly become more prevalent, with each additional modality unlocking incremental business value.   In this guide, we'll explore the creation of a search engine that retrieves AI-generated images using text, audio, or visual inputs, opening new doors for accessibility, user experience, and business intelligence. To achieve this, we will leverage ImageBind by Meta AI, a game-changer for multimodal AI applications. It captures diverse data modalities and maps them into a common vector space, making our search more powerful. This unlocks novel use cases beyond a vanilla  image similarity search . Unlike anything else, Deep Lake by Activeloop enables the storage and querying of multimodal data (not only the embeddings but also the raw data!). With Deep Lake and ImageBind, the potential applications of this technology are vast. Whether improving product discovery in eCommerce, streamlining digital media libraries, enhancing accessibility in tech products, or powering intuitive search in digital archives, this innovation can drive user satisfaction and business growth.

Question 2

What is ImageBind?

Accepted Answer

In a nutshell, ImageBind is a transformer-based model trained on multiple pairs of modalities, e.g., text-image, and learns how to map all of them in the same vector space. This means that a text query  "dog"  will be mapped close to a dog image, allowing us to search in that space seamlessly. The main advantage is that we don't need one model per modality, like in  CLIP  where you have one for text and one for image, but we can use the  same weights  for all of them. The following image taken from the paper shows the idea.  The model supports images, text, audio, depth, thermal, and IMU data. We will limit ourselves to the first three. The task of learning similar embeddings for similar concepts in different modalities, e.g., "dog" and an image of a dog, is called  alignment . The ImageBind authors used a  Vision Transformer (ViT)  , a typical architecture these days. Due to the number of different modalities, the preprocessing step is different. For example, for videos, we need to consider the time dimension, the audio needs to be converted to a spectrogram, but the main weights are the same.  To learn to  align  pairs of modalities (text, image), (audio, text), the Authors used  contrastive learning  and specifically the  InfoNCE  loss . Using  InfoNCE , the model is trained to identify a positive example from a batch of negative ones by maximizing the similarity between positive pairs and minimizing the similarity between negative ones. The most exciting thing is that even if the model was trained on pairs (text, image) and (audio, text), the model also learns (image, audio). This is what the Authors called  the "Emergent alignment of unseen pairs of modalities" / Moreover, we can do  Embedding space arithmetic , adding (or subtracting) multiple modalities embeddings to capture different semantic information. We'll play with it later on.  For the most curious reader, you can learn more by reading the  paper Okay, let's get the  image embeddings . We need to load the model and store the embeddings for all the images, so we can, later on, read them and dump them in the vector database.  Getting the embeddings is quite easy with the  ImageBind code  code. import data
import torch
from models import imagebind_model
from models.imagebind_model import ModalityType

text_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Instantiate model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

# Load data
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text_list, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}

with torch.no_grad():
    embeddings = model(inputs)

print(embeddings[ModalityType.VISION])
print(embeddings[ModalityType.AUDIO])
print(embeddings[ModalityType.TEXT])
 We first store all the image embeddings as  pth  files on disk using a simple function to batch the images. Note that we store a dictionary to add metadata; we are interested in the  image_path  and will use it later. @torch.no_grad()
def encode_images(
    images_root: Path,
    model: torch.nn.Module,
    embeddings_out_dir: Path,
    batch_size: int = 64,
):
    # not the best way, but the faster, best way would be to use a torch Dataset + Dataloader
    images = images_root.glob("*.jpeg")
    embeddings_out_dir.mkdir(exist_ok=True)
    for batch_idx, chunk in tqdm(enumerate(chunks(images, batch_size))):
        images_paths_str = [str(el) for el in chunk]
        images_embeddings = get_images_embeddings(model, images_paths_str)
        torch.save(
            [
                {"metadata": {"path": image_path}, "embedding": embedding}
                for image_path, embedding in zip(images_paths_str, images_embeddings)
            ],
            f"{str(embeddings_out_dir)}/{batch_idx}.pth",
        )
 Note that a better solution would have been using  torch Dataset + Dataloader , and we dive into this in this  image embedding tutorial .

Question 3

How to Store Embeddings in a Vector Database?

Accepted Answer

After we have obtained our embeddings, load them into Deep Lake. You can learn more about Deep Lake in Deep Lake docs . To start, we need to define the vector database. import deeplake ds = deeplake.empty( path="/", runtime={"db_engine": True}, token="", overwrite=overwrite, ) We are setting db_engine=True , meaning we won't store the data on our disk, but we will use the managed Deep Lake database to store the data and run our queries. This comes in handy when developing applications where you need to have compute and data storage separation while keeping data where it matters to you. You can deploy the same setup entirely locally and not send your sensitive data anywhere it's not supposed to be. Next, we need to define the shape of the data. with ds: ds.create_tensor( "metadata", htype="json", create_id_tensor=False, create_sample_info_tensor=False, create_shape_tensor=False, chunk_compression="lz4", ) ds.create_tensor("images", htype="image", sample_compression="jpg") ds.create_tensor( "embeddings", htype="embedding", dtype=np.float32, create_id_tensor=False, create_sample_info_tensor=False, max_chunk_size=64 * MB, create_shape_tensor=True, ) Here we create three tensors, one to hold the metadata of each embedding, one to store the images (in our case, this is optional, but it's cool to showcase), and one to store the actual tensors of our embeddings. Deep Lake stands out from the crowd with this feature. Then it's time to add our data. We stored batched embeddings to disk as .pth files if you recall. def add_torch_embeddings(ds: deeplake.Dataset, embeddings_data_path: Path): embeddings_data = torch.load(embeddings_data_path) for embedding_data in embeddings_data: metadata = embedding_data["metadata"] embedding = embedding_data["embedding"].cpu().float().numpy() image = read_image(metadata["path"]).permute(1, 2, 0).numpy() metadata["path"] = Path(metadata["path"]).name ds.append({"embeddings": embedding, "metadata": metadata, "images": image}) embeddings_data_paths = embeddings_root.glob("*.pth") list( tqdm( map( partial(add_torch_embeddings, ds), embeddings_data_paths, ) ) ) Here we are just iterating all the embeddings file and adding everything within each one. We can have a look at the data from activeloop dashboard - spoiler alert. It is quite cool. You can also visualize the 3D embedding space (and pick your preferred clustering algorithm). Cool! To run a query on Deep Lake we can embedding = # getting the embeddings from ImageBind dataset_path = # our path to active loop dataset limit = # number of results we want query = f'select * from (select metadata, cosine_similarity(embeddings, ARRAY{embedding.tolist()}) as score from "{dataset_path}") order by score desc limit {limit}' query_res = ds.query(query, runtime={"tensor_db": True}) # query_res = Dataset(path='hub://zuppif/lexica-6k', read_only=True, index=Index([(1331, 1551)]), tensors=['embeddings', 'images', 'metadata']) We can access the metadata by query_res.metadata.data(aslist=True)["value"] # [{'path': '5e3a7c9b-e890-4975-9342-4b6898fed2c6.jpeg'}, {'path': '7a961855-25af-4359-b869-5ae1cc8a4b95.jpeg'}] If you remember, these are the metadata we stored previously, so the image filename. We wrapped all the vector store-related code into a VectorStore class inside vector_store.py . class VectorStore(): ... def retrieve(self, embedding: torch.Tensor, limit: int = 15) -> List[str]: query = f'select * from (select metadata, cosine_similarity(embeddings, ARRAY{embedding.tolist()}) as score from "{self.dataset_path}") order by score desc limit {limit}' query_res = self._ds.query(query, runtime={"tensor_db": True}) images = [ el["path"].split(".")[0] for el in query_res.metadata.data(aslist=True)["value"] ] return images, query_res So, since the model supports text, images, and audio we can also create a utility function to make our life easier. @torch.no_grad() def get_embeddings( model: torch.nn.Module, texts: Optional[List[str]], images: Optional[List[ImageLike]], audio: Optional[List[str]], dtype: torch.dtype = torch.float16 ) -> Dict[str, torch.Tensor]: inputs = {} if texts is not None: # they need to be ints inputs[ModalityType.TEXT] = load_and_transform_text(texts, device) if images is not None: inputs[ModalityType.VISION] = load_and_transform_vision_data(images, device, dtype) if audio is not None: inputs[ModalityType.AUDIO] = load_and_transform_audio_data(audio, device, dtype) embeddings = model(inputs) return embeddings Always remember the torch.no_grad decorator :) Next, we can easily do vs = VectorStore(...) vs.retrieve(get_embeddings(texts=["A Dog"])) vs = VectorStore(...) vs.retrieve(get_embeddings(images=["car.jpeg"]))

Question 4

What Modalities are Supported by ImageBind?

Accepted Answer

ImageBind supports six different modalities - images, text, audio, depth, thermal, and IMU data. The modality encoders are based on a Transformer architecture, including the Vision Transformer (ViT) for images and videos. Audio is encoded by converting a 2-second audio sample into spectrograms, which are treated as 2D signals and encoded using a ViT. Thermal images and depth images are treated as one-channel images and also encoded using a ViT.

Question 5

Is ImageBind Open Source?

Accepted Answer

ImageBind is an open-source project with a PyTorch implementation and pretrained models available. The model and accompanying weights are available for download and can be used to feed text, image, and audio data into ImageBind.

Question 6

Is ImageBind Applications?

Accepted Answer

ImageBind enables novel applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. It has several potential use cases, including information retrieval, zero-shot classification, and connecting the output of ImageBind to other models. ImageBind could play a crucial role in developing autonomous vehicles, helping them to perceive and interpret their surroundings more effectively.

Question 7

Who Developed ImageBind?

Accepted Answer

ImageBind has been developed and open-sourced by researchers at Meta.

Question 8

What is multimodal AI?

Accepted Answer

Multimodal AI is an AI category that integrates various types, or modalities, of data to reach more precise conclusions, make insightful deductions, or provide more accurate real-world problem predictions. Multimodal AI platforms use and learn from a variety of data including video, audio, speech, images, text, and numerous traditional structured datasets.

Question 9

What are the benefits of multimodal AI?

Accepted Answer

Multimodal AI generally surpasses single modal AI in many real-world situations. Through the combination of different data types, multimodal AI can produce more accurate, human-like responses, thereby enhancing its versatility and adaptability in varying scenarios. Industries like healthcare, finance, and retail could significantly benefit from multimodal AI due to its ability to provide precise and customized responses.

Question 10

What are the challenges of multimodal AI?

Accepted Answer

Despite its potential and advantages, multimodal AI does have associated challenges, specifically related to data quality and interpretation for developers. Certain modalities may be excessively noisy, complicating the AI system's learning process. The complexity of multimodal AI systems necessitates substantial computational resources. Lastly, for these systems to be trusted, they need to be explainable.

Question 11

What is multimodal machine learning?

Accepted Answer

Multimodal machine learning is an evolving multidisciplinary research domain aimed at developing computer agents with intelligent capabilities to process and connect information from multiple modalities. There has been considerable progress in this emerging field over recent years.

Question 12

What are the applications of multimodal AI?

Accepted Answer

Multimodal AI has extensive applications across various sectors including healthcare, finance, and retail. Multimodal conversational AI systems can answer queries, complete tasks, and mimic human conversations by comprehending and conveying information from multiple modalities. Complex recipe generation from images is another potential application for multimodal AI.

Question 13

What is the difference between multimodal AI and single modal AI?

Accepted Answer

The key distinction between multimodal AI and conventional single modal AI lies in the data. Single modal AI is typically designed to handle a singular data source or type. For instance, a financial AI leverages business financial data, along with wider economic and industrial sector data, to conduct analyses, make financial predictions, or identify potential financial issues for the company. In other words, the single modal AI is specialized for a specific task.

Use ImageBind & Multimodal Retrieval for AI Image Search

How to Conduct Multimodal Search with ImageBind & Deep Lake?

Let’s build the AI Image Search App!

Gathering AI-Generated Images from Lexica for AI Image Search

Create Image Embeddings for Multimodal Retrieval