Computer vision is one of the biggest challenges of Machine Learning. Humans are very good at distinguishing visual representations, but teaching this skill to a computer is no easy task.
For instance, we can easily tell the difference between a dog or a cat. Even between different breeds of dog. A chihuahua does not look like a golden retriever, but it can be hard for a computer to learn how to distinguish between these two breeds of dog. We will leave out contemplating which one’s cuter (as it’s much, much harder to answer this question than generating image embeddings in Python using a pre-trained CNN and storing them in Activeloop Deep Lake).
In this article, we will thus study image embeddings: what are they, how they are generated, and why they are so useful in Computer Vision. Before we start though, if you’re reading this right now, chances are you’re considering training your own Large Language Model (LLM), finetuning it or connecting an LLM to LangChain.
- Training A CLiP model from scratch with Deep Lake: code example.
- Generative AI Data Infrastructure: How to Train Large Language Models (LLMs) with Deep Lake - a practical example showing high GPU utilization with Deep Lake + Lambda Labs.
- LangChain & GPT-4 for Code Understanding: Twitter Algorithm
- Ultimate Guide to LangChain & Deep Lake: Build ChatGPT to Answer Questions on Your Financial Data
- How we integrated GPT-4 into our product to create Text to SQL (or TQL - Tensor Query Language in our case)
What are image embeddings?
An image embedding is a lower-dimensional representation of the image. In other words, it is a dense vector representation of the image which can be used for many tasks such as classification.
A convolutional neural network (CNN) can be used to create the image embedding.
For instance, these deep learning representations are sometimes used to create a search engine since it relies on image similarity. Indeed, to find images of one class (for example dog), we would only need to find the embedding vectors the closest to the dog image’s vector.
A good way to find those is by calculating the cosine similarity between the embeddings. Similar images will have a high cosine similarity between embeddings.
Dog Breed Images Dataset from Kaggle
For this example, we will use one of my favorite datasets: the Kaggle Dog Breed Images 🐶
First, we need to download this dataset:
!export KAGGLE_USERNAME="xxxx" && export KAGGLE_KEY="xxxx" && mkdir -p data && cd data
&& kaggle datasets download -d eward96/dog-breed-images
&& unzip -n dog-breed-images.zip && rm dog-breed-images.zip
Let’s see what is in this data folder:
So here we have 10 different breeds of dog: *bernese_mountain_dog, chihuahua, dachshund, jack_russell, pug, border_collie, corgi, golden_retriever, labrador, siberian_husky. *
import glob
data_dir = 'data'
list_imgs = glob.glob(data_dir + "/**/*.jpg")
print(f"There are {len(list_imgs)} images in the dataset {data_dir}")
=> There are 918 images in the dataset data.
Here is an example of how to create a Deep Lake dataset from the dog breeds folder and store it in Deep Lake cloud.
To create the dataset, we used the torchvision modules: datasets and transforms, along with torch.utils.data.DataLoader:
from torchvision import datasets, transforms
import torch
# create dataloader with required transforms
tc = transforms.Compose([
transforms.Resize((256, 256)),
transforms.ToTensor()
])
image_datasets = datasets.ImageFolder(data_dir, transform=tc)
dloader = torch.utils.data.DataLoader(image_datasets, batch_size=10, shuffle=False)
print(len(image_datasets)) # returns 918
We have a resized, and batched dataset dloader ready to be used.
NB: Pytorch default backend for images are Pillow, and when you use ToTensor()class, PyTorch automatically converts all images into [0,1] so no need to normalize the images here.
If we want to visualize the first image in this dataset:
for img, label in dloader:
print(np.transpose(img[0], (1,2,0)).shape)
print(img[i])
plt.imshow((img[i].detach().numpy().transpose(1, 2, 0)*255).astype(np.uint8))
plt.show()
i = i + 1
break
We can see that the image was resized to 256x256 and is normalized.
Generate image embeddings from the Dog Breed Images Dataset
To generate the image embeddings, we will use a pre-trained model up to the last layer before classification, also called the penultimate layer.
The first layers of a CNN (Convolutional Neural Network) extract the features of the input image, then the fully-connected layers handle the classification and return class probabilities that are then passed to a softmax loss for example that will determine which class had the highest probability score:
In our case, we will use a pre-trained **ResNet-18** model:
It can easily be downloaded with torch.hub.load:
# fetch pretrained model
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
Now we need to select the layer we want to extract features from. If we look at the architecture of ResNet-18 again:
We can see that the last layer is the layer fc (fully-connected) where the features are being classified. We want the features before the classification part of the CNN, so, we want the first fully-connected layer which is the one before fc: the avgpool layer.
We can select this layer using the model object:
# Select the desired layer
layer = model._modules.get('avgpool')
Then, we use the register_forward_hook module to get the embeddings:
def copy_embeddings(m, i, o):
"""Copy embeddings from the penultimate layer.
"""
o = o[:, :, 0, 0].detach().numpy().tolist()
outputs.append(o)
outputs = []
# attach hook to the penulimate layer
_ = layer.register_forward_hook(copy_embeddings)
NB: The function copy_embeddings will be called every time after forward() has computed an output and will save it in the list ouputs .
Then, we need to model to inference mode:
model.eval() # Inference mode
Let’s use this model to generate embeddings for our dog breed images:
# Generate image's embeddings for all images in dloader and saves
# them in the list outputs
for X, y in dloader:
_ = model(X)
print(len(outputs)) #returns 92
Since dloader is batched, we need to flatten the outputs:
# flatten list of embeddings to remove batches
list_embeddings = [item for sublist in outputs for item in sublist]
print(len(list_embeddings)) # returns 918
print(np.array(list_embeddings[0]).shape)) #returns (512,)
As expected, the length of the new flattened list list_embeddings is equal to 918 which is the number of images we have in this dog breed dataset. Plus, the shape of the first item in the list list_embeddings is (512,) which corresponds to the shape of the output of the avgpool layer.
Send images and image embeddings to Deep Lake
Once the embeddings of all images are generated, we do not need to generate them again and can use them directly to perform diverse tasks such as classification, as explained previously. This is one of the reasons why embeddings in computer vision are so popular as they are very easy to re-use once generated.
Therefore, we will send our freshly generated embeddings and their images to Activeloop Deep Lake.
First, we need to login into our Activeloop account with this command:
!activeloop login -u username -p password
You can alternatively use a Deep Lake API token to authenticate.Then, we choose the name of the canine dataset we are about to create from the dog breed images dataset:
hub_dogs_path = "hub://margauxmforsythe/dogs_breeds_embeddings"
Now, we can send our doggie data into this dataset that will be easily accessible using the path “hub://margauxmforsythe/dogs_breeds_embeddings”. In this example, we use the “with” syntax for better performance (see more about it here):
with deeplake.empty(hub_dogs_path) as ds:
# Create the tensors
ds.create_tensor('images', htype = 'image',
sample_compression = 'jpeg')
ds.create_tensor('embeddings')
# Add arbitrary metadata - Optional
ds.info.update(description = 'Dog breeds embeddings dataset')
ds.images.info.update(camera_type = 'SLR')
# Iterate through the images and their corresponding embeddings,
and append them to hub dataset
for i in tqdm(range(len(image_datasets))):
img = image_datasets[i][0].detach().numpy().transpose(1, 2, 0)
img = img * 255 # images are normalized
img = img.astype(np.uint8)
# Append to Deep Lake Dataset
ds.images.append(img)
ds.embeddings.append(list_embeddings[i])
Our dog breed — embeddings dataset is now available in Hub. Paw-some! This means we can load these images and their embeddings easily with this line:
ds_from_hub = deeplake.dataset(hub_dogs_path)
Let’s visualize some of the images and their embeddings:
def show_image_in_ds(ds, idx=1):
image = ds.images[idx].numpy()
embedding = ds.embeddings[idx].numpy()
print("Image:")
print(image.shape)
plt.imshow(image)
plt.show()
print(embedding[0:10]) # show only 10 first values of the image embedding
for i in range(4):
show_image_in_ds(ds_from_hub, i)
Alternatively, you can visualize the dataset calling the following function:
1ds_from_hub.visualize()
2
We can now easily get an image and its embedding from our Hub dataset, and start finding similar images using the similarities between embeddings! On a side note, those doggos are so beautiful, they could’ve easily been on the cover of… Vanity Fur.
Embeddings are routinely used across industries such as AgriTech, Autonomous Vehicles & Robotics, Audio Processing & Enhancement
Here is the link to the notebook with all the steps demonstrated in this article. If you have more questions about the notebook, feel free to ask in #community channel of team Activeloop’s Slack.