This article was motivated by the fact that there are only a few comprehensive examples of data augmentation using TensorFlow, especially for object detection tasks where the position of bounding boxes makes the augmentation pipeline more complex.
What is Object Detection?
Object detection is a common supervised learning technique in computer vision that locates objects in an image. It involves fitting a bounding box around the object and estimating its class label.
A bit simplified, an Object Detection model is both a regressor and a classifier: the regressor head predicts the bounding box coordinates of an object, and the classifier predicts the object class.
What is Image Augmentation?
Image Augmentation is a Machine Learning technique that creates robust models and reduces overfitting by training them on modified copies of existing data.
This technique is especially useful in cases where the training data is scarce. It increases the variation in the existing images and thus technically creates new training instances, which are slight variations of the original training images.
What We Will Do
In this blog post, we dive into a case study from a recent project centered on developing a compact object detection model. The project highlights a common problem deep learning practitioners face: being bound to a specific training framework, in our case, TensorFlow, while striving to leverage the most powerful modern tools. For our project, we chose TensorFlow, a decision driven by our need to align with the cutting-edge capabilities of our camera hardware. This accelerator hardware was specifically designed to support TensorFlow Lite, making TensorFlow the optimal choice for avoiding the complexities and potential pitfalls of converting between frameworks. Our exploration not only demonstrates how TensorFlow can be effectively utilized to enhance real-time object detection but also how you can use Deep Lake to unify ML data management and connect ML data to any framework of choice, whether it’s PyTorch, TensorFlow, or MMDetection, for instance.
What is Albumentations?
Albumentations is a fast and flexible library for image augmentation and transformation that handles bounding boxes and image segments.
Why Should You Use Albumentations in ML?
Albumentations offers efficient image preprocessing and advanced augmentation capabilities. It efficiently resizes and pads images to specific dimensions, maintaining proportions without losing important visual information. Additionally, its diverse range of augmentation techniques, including affine transformations, blurring, or random cropping, enhances the variability of a limited dataset, improving the robustness and performance of computer vision models.
In our case, the use of the Albumentations library had three purposes:
Preprocessing: Due to the limitations of our tiny model, we needed to resize the images from a filtered COCO dataset (filtered before training an initial model) to 244x244 pixels in resolution. We opted to use padding to keep the images’ proportions. Image normalization (of the intensity values) was also a part of the preprocessing.
Augmentation: Due to the small dataset size we had after filtering the COCO dataset, we needed to use data augmentation to get a more diverse training dataset and improve the model’s robustness. Besides the basic rotate, blur, and flip, we used the affine transformation, padding, and resize as well to achieve a zoom effect and randomly position the image on the 244x244 canvas.
Improvement in Model Performance: The two steps above help standardize the input data and introduce variability, which is crucial for training robust and high-performing machine learning models, no matter their size. This leads to better generalization of unseen data and more accurate object detection.
What Datasets Can You Use for Object Detection?
Several well-known benchmark datasets are commonly used in the ML community for object detection. The COCO (Common Objects in Context) dataset is highly popular for its diversity and large number of images across multiple categories. The Pascal VOC 2012 dataset is another choice, known for its annotated images, which allow both object detection and image segmentation tasks. Additionally, the ImageNet dataset, although primarily used for image classification, contains bounding box annotations for hundreds of object categories, making it also suitable for object detection.
All the datasets above are available for streaming with one line of code from Deep Lake. With Deep Lake, you can stream, query, version control, and visualize any data. We will create a preprocessing pipeline and train a model using TensorFlow without the need to download the whole dataset locally.
This article will go through the data filtering, preprocessing, and augmentation process for object detection datasets using Albumentations, Tensorflow, and a Deep Lake dataset.
The High-Level Overview
The steps we will perform in this article are:
- Loading and filtering a Deep Lake dataset
- Convert the dataset into a TensorFlow-compatible format.
- Defining desired augmentations
- Creating the augmentation pipeline
- Visualizing the results
To keep the article focused, we will not train a model. Instead, we will end the article with a dataloader implementation that you can use in your projects.
Designing a Data Pipeline with Deep Lake and TensorFlow
We decided to combine the Deep Lake and Tensorflow data pipelines, taking advantage of each implementation’s benefits. We filter samples in the Deep Lake data loader and then convert them to a TensorFlow dataset, where we apply the image and annotation transformations.
A similar workflow for applying the transformations could also have been done using the Deep Lake C++ dataloader, which builds the whole preprocessing and data loading pipeline using only Deep Lake.
We can filter with Deep Lake using metadata to exclude unnecessary image data, such as irrelevant or grayscale images, from the COCO dataset. With Deep Lake, we won’t even need to duplicate this filtered view; we can just reference it later with a specific ID.
We will transform the image data using the TensorFlow dataset functionality so we can transform on the fly and apply the augmentations differently at each epoch. At this step, the image data will be loaded, and the transformation functions will be called for each image (or for a batch of images) each time the image is accessed. This is well suited for the augmentations since it will ensure that the randomized augmentations are different in every iteration of the dataset, increasing the data’s variety.
Getting started
Let’s start by installing the libraries and import the modules we need.
1!pip install deeplake -qq
2!pip install Albumentations -qq
3!pip install tensorflow -qq
4
1import deeplake
2import cv2 as cv
3import numpy as np
4import tensorflow as tf
5import albumentations as A
6import matplotlib.pyplot as plt
7import matplotlib.patches as patches
8from functools import partial
9import matplotlib.colors as mcolors
10import random
11
Log in to the Deep Lake Account
Even without an account, you can access datasets from Activeloop. However, this article will explore advanced features such as querying datasets using the Tensor Query Language (TQL) and creating your own private cloud dataset—capabilities requiring logging in (sign-up is free!).
When you have registered and logged in using the browser, you can create an Activeloop token by pressing your avatar and API tokens. This token should now be used to log into your account so that only the saved datasets will be visible to you.
1import getpass
2import os
3os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass('Activeloop token:')
4
Loading a Deep Lake Dataset
We will use the COCO dataset for this example. The COCO dataset is a large-scale object detection dataset created by Microsoft, and it is widely used in training and evaluating Computer Vision models. The COCO dataset hosted by the Deep Lake community has 80 categories and 118,286 images. We will use the dataset’s images, categories, and bounding boxes.
The first step is to load a dataset from Deep Lake Cloud. Loading the dataset means downloading the metadata so that the Python object is aware of the content in the dataset, but each image will be lazily downloaded when we access it. This keeps performance good and memory usage low, and it also makes prototyping and experimentation much easier.
1ds = deeplake.load('hub://activeloop/coco-train')
2
To get an initial understanding of the dataset, we can interactively explore it using the visualization method.
1ds.visualize()
2
Here is an image of how the dataset is displayed via Deep Lake visualizer:
Querying the COCO Dataset
In our project, we wanted to train an object detector model to find persons and chairs. We are only interested in the images containing these two classes.
One of the powers of Deep Lake is the ability to use the Tensor Query Language to filter the dataset using the metadata (think SQL for images). Getting images with these two classes is thus easy.
1chair_query = "SELECT * WHERE CONTAINS(categories, 'chair')"
2person_query = "SELECT * WHERE CONTAINS(categories, 'person')"
3
Since training a model with an unbalanced dataset is hard, we will also subsample the dataset to get a new dataset view with the two classes balanced (i.e., an equal number of samples in both classes). We need to know the number of samples in the two classes to do this. We can do this by executing the query to get a new dataset view and then checking the length of this new view.
1num_chairs = len(ds.query(chair_query))
2num_persons = len(ds.query(person_query))
3
4class_size = min(num_chairs, num_persons)
5
6print(f"There are {num_chairs} chairs and {num_persons} persons in the dataset")
7
Now, let’s use the queries to get a new dataset view containing only images where chairs and/or persons are present. To achieve a random distribution of person and chair samples in the dataset, we use the RANDOM() function in the ORDER BY TQL expression.
The dataset length should be 25548 (class_size*2), but we get 23911 samples instead. This is because 1637 samples were in both query results, and the UNION operation added them only once to the dataset.
1new_ds = ds.query(f"({chair_query} LIMIT {class_size}) UNION ({person_query} LIMIT {class_size}) ORDER BY RANDOM()")
2
3print(f"Length of dataset: {len(new_ds)}")
4
For this example, we create a smaller dataset using array slicing. This will create a view of the dataset that only contains the first one hundred samples of the original dataset.
1new_ds = new_ds[:100]
2
We now have a view of the dataset, which means that our Python object has a list of indexes in the original dataset that should be accessed when iterating the view. We could use this view directly to train our model, but we prefer to save the view as a new dataset on our own account. This will create a persistent version of our smaller dataset, which we could, if we wanted, version control with the advanced versioning and branching features of Deep Lake.
Filtering Out Grayscale Images in Microsoft CoCo with Deep Lake
Some images in COCO are grayscale, while most are in color. We want to remove the gray images.
Another way that Deep Lake allows you to filter dataset views is with the .filter() method. Compared to filtering later in the data pipeline, this is very efficient since the filtering is done on the dataset’s metadata layer before even downloading the image’s actual content.
The method allows you to apply a filter function on each sample in the dataset; if the condition is met, the sample is included in the returned dataset view. Otherwise, it is ignored. This allows us to easily create custom filtering logic to decide which images we want to use in our training without modifying the stored dataset.
1def is_color(instance):
2 if instance.images.shape[-1] > 1:
3 return True
4
5print(f'Length before filtering: {len(new_ds)}')
6new_ds = new_ds.filter(is_color, progressbar=False)
7print(f'Length after filtering: {len(new_ds)}\n')
8
1username = "<USERNAME_OR_ORG>"
2
1new_ds.copy(f"hub://{username}/obj_det_article2")
2
To be able to visualize the dataset we will load the new dataset we just saved.
1ds = deeplake.load(f"hub://{username}/obj_det_article2")
2
1ds.visualize()
2
Collecting Dataset Information
Now that we have our dataset, we should gather some information we will need later in the pipeline.
The max_object variable stores the number of objects in the sample with the most objects. We use this number later on to create uniform shape output tensors. If you plan to use this augmentation code to train your own model, you might need to adjust this logic according to your needs. We will discuss this topic more when we use this variable.
1# Number of unique classes
2num_classes = len(ds.categories.info.class_names)
3# Name of the categories
4classes = ds.categories.info.class_names
5# Max objects on any image
6max_objects = max([len(e) for e in ds.categories.numpy(aslist=True)])
7# Setting batch size
8batch_size = 32
9
Conversion from Deep Lake to TensorFlow
Now, we are ready to transform the dataset into a TensorFlow-compatible dataset. The COCO dataset has many tensors with information, but we only need the tensors for images, categories, and boxes. In the TensorFlow method, we can specify the names of the tensors that the dataset should generate. As we will see later, these tensors will be included in a dict object that the TensorFlow dataset will generate for each sample image.
Creating a TensorFlow adaptor for the Deep Lake dataset is quickly done in one line:
1tf_ds = ds.tensorflow(tensors=["images", "categories", "boxes"])
2
Image Transformations with Albumentations
The next step is to define the transformations we want to apply to our dataset using the Albumentations library. We define:
- preprocessing: resizing, normalizing
- augmentations: random rotation, blur, flip, etc.
In a real project, we would use the preprocessing pipeline on all the datasets, while the augmentation pipeline would only be used on the training dataset. This means that the model learns from a diverse dataset but is evaluated and tested on non-augmented data, which is more similar to the real-world data we want to know its accuracy.
Defining the Preprocessing with Albumentations
The LongestMaxSize and PadIfNeeded methods in Albumentations helped us to rescale the images into a square format while keeping the original proportions of the image content. The resulting unused space was filled with black. Image normalization is also done in this step.
Image Normalization with Albumentations
The images are normalized in this step using the A.Normalize() method. A.Normalize uses the following formula: img = (img - mean * max_pixel_value) / (std * max_pixel_value). We have several choices on how to normalize the images:
- Scaling between [0, 1]: A.Normalize(mean=0.0, std=1.0, max_pixel_value=255), which is essentially equal to image/255
- Scaling between [-1, 1]: A.Normalize(mean=0.5, std=0.5, max_pixel_value=255), which is essentially equal to (image - 127) / 127
- Scaling using ImageNet statistics: By not specifying the parameters, the mean intensity and standard deviation for the images in the ImageNet dataset will be used. This would result in roughly unit variance and a mean of zero.
We choose to scale the images between -1 and 1 because 0 centered normalization is often the recommended method when feeding samples into a model.
NOTE: when displaying the images later on, we need to rescale the pixel values to be within the [0, 1] range, as matplotlib.pyplot.imshow()
expects float values between [0, 1] or integers between [0, 255] for RGB images.
Instead of handing the arguments directly to the augmentation methods, we choose to store all the augmentation related parameters in a dictionary.
This will make it easy to log the used parameters into your experiment tracking tool, such as the Weights & Biases platform in our case. Doing so allows us to see all the differences between two trainings when we compare the accuracy of the resulting models.
1preprocess_dict = {
2 "A.LongestMaxSize": {
3 "max_size":244,
4 "p":1
5 },
6 "A.PadIfNeeded": {
7 "min_height":244,
8 "min_width":244,
9 "position":'center',
10 "border_mode":cv.BORDER_CONSTANT,
11 "value":0,
12 "p":1
13 },
14 "A.Normalize": {
15 "mean":0.5,
16 "std":0.5,
17 "max_pixel_value":255,
18 "p":1
19 }
20 }
21
Composing Preprocessing Transformations with Albumentations
Now that we have defined our transformations for the preprocessing step as a dictionary, we can compose the preprocess transformations.
1preprocess_transform = A.Compose(
2 [
3 A.LongestMaxSize(**preprocess_dict["A.LongestMaxSize"]),
4 A.PadIfNeeded(**preprocess_dict["A.PadIfNeeded"]),
5 A.Normalize(**preprocess_dict["A.Normalize"]),
6 ]
7 )
8
The position
argument to the A.PadIfNeeded
method decides where the padding as added (or rather where the inserted image is placed).
For the training dataset, we used random
for the position
argument which will cause the placement of the image to vary between iterations. This creates more diversify the dataset.
For the validation and test datasets the center
or top_left
option was used to make sure it is deterministic when testing and evaluating models.
Defining the Augmentations in Albumentations
In this step, we define the augmentations we want to apply to the training dataset. This is done in a similar fashion to defining the preprocessing transformation steps.
We are using 3 different augmentations:
- Rotate
- HorizontalFlip
- Blur
1augmentation_dict = {
2 "A.Rotate": {
3 "limit":15,
4 "border_mode":cv.BORDER_CONSTANT,
5 "p":0.9
6 },
7 "A.HorizontalFlip": {
8 "p":0.5
9 },
10 "A.Blur": {
11 "blur_limit":3,
12 "p":0.9}
13}
14
Composing augmentation transformations
In the previous step, we only defined a dictionary containing all the transformation parameters; now, we will compose the transformations for the training dataset.
1augmentation_transform = A.Compose(
2 [
3 A.Rotate(**augmentation_dict["A.Rotate"]),
4 A.HorizontalFlip(**augmentation_dict["A.HorizontalFlip"]),
5 A.Blur(**augmentation_dict["A.Blur"]),
6 ]
7 )
8
Putting all transformations together with Albumentations
The A.Compose class can accept parameters for bounding box transformation. If we include it, the bounding boxes will “follow” all the image transformations. I.e., if the image is shifted or rotated, the coordinates for the bounding box will also be. More specifically, the resulting bounding boxes are not rotated; they are changed in size to fit the rotated bounding box while being axis-aligned. If the object ends outside the visible image, the annotation should be dropped; therefore, we must also forward the labels to the augmentation flow.
A.BboxParams specifies settings for working with bounding boxes.
We use the COCO format, which expects the bounding box position and size in a (x_min, y_min, width, height) format.
The label_fields are the categories and bounding boxes we also want to include in the transformation.
The minimum pixel area for a bounding box after transformation is 2 pixels; otherwise, the object category and bounding box are dropped. This can, for example, happen when zooming out on an image.
If the object’s area is less than 60% compared to the original area, the object is dropped; this can, e.g., happen when an image is shifted outside the canvas of the image.
Now that we have defined the augmentations we want to use, we can build the augmentation transformation pipeline.
1tform = A.Compose(
2 [preprocess_transform, augmentation_transform],
3 bbox_params=A.BboxParams(
4 format="coco",
5 label_fields=["class_labels", "bbox_ids"],
6 min_area=2,
7 min_visibility=0.6,
8 ),
9 )
10
Defining the helper functions
So far, we have created a TensorFlow dataset and an Albumentations transform object. The next step is to create the transformation functions that can be called for each sample in the TensorFlow dataset.
Unpacking the dict
As noted before, each sample that the TensorFlow dataset adaptor generates from the Deep Lake dataset will be a dict with the different tensor names we specified as keys when converting it.
This simple function can unpack them into a tuple of 3 tensors.
The same functionality could be implemented directly into the next function, but we preferred to create small functions that do a specific transformation. This is also an illustrative example of defining a step in a data transformation pipeline.
1def _unpack_dict(data_dict):
2 return data_dict["images"], data_dict["categories"], data_dict["boxes"]
3
Augmentation function
The real magic of the augmentations happens here.
The inputs of the _augment_data
function are 3 tf.Tensor
objects sent by the map
method: images, categories and boxes (as returned by the _unpack_dict
function). \
The Albumentations methods expects numpy arrays, not tf.Tensor
objects. We use tf.numpy_function
to wrap our python function into something that will work as a TensorFlow mapping function. This wrapping will transform the arguments from tf.Tensor
to numpy arrays and it will also convert the returned numpy arrays to tf.Tensor
objects again.
The catch is that the image tensors will lose their shape information after using a tf.numpy_function
, so we need to restore the tensor shapes with the .set_shape()
method.
Creating uniformed sized tensors
The COCO dataset has 80 categories in total, and most of the images have more than one object on them. The samples tensor shape is not constant. Earlier, we calculated the maximum number of objects in any image in the dataset (max_objects). We will now use that to create our fixed-size output tensor. You might want to do this differently depending on your model architecture.
Our Tensorflow model only works with uniform shape tensors, so we must solve this issue. Our solution:
- categories tensor: COCO has 80 categories, so one-shot encoding with a depth of 80 is done; we create one row for each possible object in the image. The resulting shape is (max_objects, 80)
- boxes tensor: Each box has four values; therefore the boxes tensor will have shape (max_objects, 4)
The unused rows (i.e., in images with fewer than max_objects objects) are filled with zeros.
Creating a single function for augmentation
aug_fn
takes in the numpy arrays from the tf.numpy_function, and applies the augmentations.
The padded_ variables
ensure we create equal-sized tensors for the categories and bounding boxes. The samples will fill the array from the beginning, and the ‘leftover’ spaces will be filled with zeros.
The preprocessor will be the tform instance of the Albumentation’s Compose class we created earlier. It contains all the augmentation blueprints and applies them.
1def _augment_data(
2 images,
3 categories,
4 boxes,
5 new_height,
6 new_width,
7 num_classes,
8 classes,
9 preprocessor,
10 ):
11 def aug_fn(images, categories, boxes):
12 aug_data = preprocessor(
13 image=images.astype(np.float32),
14 bboxes=boxes,
15 bbox_ids=np.arange(boxes.shape[0]),
16 class_labels=categories,
17 )
18
19 images = np.array(aug_data["image"], dtype=np.float32)
20
21 # Make the categories to a uniformed sized array
22 categories = np.array(aug_data["class_labels"], dtype=np.int64)
23 padded_categories = np.zeros(max_objects, dtype=np.int64)
24 padded_categories[:len(categories)] = categories
25
26 # Make the boxes to a uniformed sized array
27 boxes = np.array(aug_data["bboxes"], dtype=np.float32)
28 padded_boxes = np.zeros([max_objects, 4], dtype=np.float32)
29 padded_boxes[:len(boxes)] = boxes
30
31 return images, padded_categories, padded_boxes
32
33 # Applying transformations
34 aug_img, aug_categories, aug_boxes = tf.numpy_function(
35 func=aug_fn,
36 inp=[images, categories, boxes],
37 Tout=[tf.float32, tf.int64, tf.float32],
38 )
39 # Restore the tensor's shape info
40 aug_img.set_shape((new_width, new_height, 3))
41 aug_categories.set_shape(max_objects)
42 aug_boxes.set_shape((max_objects, 4))
43
44 return aug_img, (aug_categories, aug_boxes)
45
TensorFlow Dataset Transformation Pipeline
We have now created all the transformation functions needed to transform our data samples. The next step is to use the TensorFlow dataset’s .map() method, which will call the function for each sample on the fly and forward the returned tensors downstream. We use Python’s partial function to create a new anonymous function with the extra arguments in the augmentation transform function set to the correct values.
1tf_ds_unpacked = tf_ds.map(_unpack_dict)
2tfds_transformed = tf_ds_unpacked.map(
3 partial(
4 _augment_data,
5 new_height=244,
6 new_width=244,
7 num_classes=num_classes,
8 classes=classes,
9 preprocessor=tform,
10 )
11)
12
Next, we will set some dataset properties, such as shuffling, repetition, and batching.
We are using the .shuffle() method, which will fill up a buffer with the size five times the batch size. This buffer is then shuffled. When streaming the data, we take one sample at a time from this buffer until the batch size is reached. Meanwhile, new samples are streamed from Activeloop to refill the shuffling buffer. This results in a random order of samples within a few batches but does not shuffle the dataset over its full size. A global shuffling was, however, done before saving the filtered Deep Lake dataset to Activeloop. The main difference is that the global shuffling is the same over all epochs, while the shuffle buffer is randomly shuffled each epoch.
The .repeat() is called to ensure that the generator is recreated once it runs out of images. Note that the repeat will not cache the images and repeat the same; instead, it will iterate over the dataset again when it runs out of values. This way, the shuffle order and the augmentations will differ in each epoch.
Finally, we call .batch() to create tensors with batches of images.
1tfds_transformed = tfds_transformed.shuffle(buffer_size=batch_size*5)
2tfds_transformed = tfds_transformed.repeat()
3tfds_transformed = tfds_transformed.batch(batch_size)
4
Inspecting the results of the augmentation
We can now visualize the TensorFlow dataset to see how the augmentations look. We will do this using matplotlib this time (a better option would be to upload the generated images and bounding boxes to a new dataset in Deep Lake for visualization).
1named_colors = dict(mcolors.CSS4_COLORS)
2all_colors = list(named_colors.keys())
3random.shuffle(all_colors)
4
1def display_tfds_images(tfds, nbr_images, title=None, range=(-1, 1)):
2 assert nbr_images % 4 == 0, "In this example we expect that the number of requested images is divisible by 4"
3 plt.figure(figsize=(30, 15))
4 for i, item in enumerate(tfds.unbatch().take(nbr_images)):
5 ax = plt.subplot(nbr_images // 4, 4, i + 1)
6 image, (classes, box) = item
7
8 # Rescale image values to [0, 1] since matplotlib expects that,
9 # then plot it.
10 rescaled_image = (image - range[0]) / (range[1] - range[0])
11 ax.imshow(rescaled_image)
12 # Draw bounding boxes (unless the row is only zeros)
13 for i, (c, b) in enumerate(zip(classes, box)):
14 if np.all(b == 0):
15 continue
16 color = all_colors[c]
17 rect = patches.Rectangle((b[0], b[1]), b[2], b[3], linewidth=2, edgecolor=color, facecolor="none")
18 ax.add_patch(rect)
19 ax.text(b[0], b[1], ds.categories.info.class_names[c], color=color, fontsize="xx-large")
20 ax.axis("off")
21
22 plt.axis("off")
23 plt.show()
24
1display_tfds_images(tfds_transformed, 8)
2
The images show that transformations such as rotation and padding have been applied. Despite this, we can also see that the bounding boxes’ positions are correct.
Beyond the Tutorial. What to Do Next?
As a next exercise, you could investigate what happens if you filter the dataset to be so small that you can visualize more images than there are in the dataset (as mentioned in the .shuffle() and .repeat() methods).
- Do they repeat?
- Are they in a different order?
- Are the augmentations different?
Create a notebook and experiment with the different parts to ensure you fully understand every step. Once you understand this tutorial in detail, it will be much easier to implement your new projects.
Regardless of the task at hand, here are some nuggets of wisdom on the types of object detection algorithms and what are some of the most used object detection models.
What are the most popular object detection models?
Some of the most popular object detection models for computer vision tasks include:
- YOLO (You Only Look Once): The YOLO Object Detection model is a one-stage model known for its speed and real-time performance. It is hands down the most popular, in our opinion. Its latest version, YOLOv7, offers improved accuracy and efficiency.
- SSD (Single Shot MultiBox Detector): is another efficient and accurate one-stage model that detects objects using a set of default boxes over various aspect ratios and scales.
- R-CNN Family: Includes deep learning algorithms like Faster R-CNN (faster than the original R-CNN and Fast R-CNN), Mask R-CNN, and Cascade R-CNN. These two-stage detectors focus on high accuracy and are suitable for applications where precision is more critical than speed.
- RetinaNet: A one-stage detector that employs a feature pyramid network and a focal loss function to address class imbalance issues.
- EfficientDet: Known for its optimal balance between accuracy and computational efficiency among deep learning algorithms, it’s a state-of-the-art choice in object detection.
- CenterNet: Distinguishes itself by predicting the centers of objects rather than using bounding boxes, offering a unique approach to object detection.
These models are widely recognized for their effectiveness and are commonly used across various applications in the field.
One-stage vs Two-stage Object Detectors in Deep Learning
One-stage and two-stage object detectors represent two distinct approaches in computer vision, each with its trade-offs between speed and accuracy. Here’s a quick summary of both.
One-stage detectors
- Approach object detection as a direct regression task, predicting class probabilities and bounding box coordinates in a single network pass.
- Examples include YOLO and SSD.
- Typically faster, offering real-time performance but yields less accuracy than two-stage models.
Two-stage detectors
- In the first stage, they generate region proposals (regions of interest), which they then classify and refine in the second stage.
- Examples include Faster R-CNN and Mask R-CNN.
- Provide higher accuracy by methodically focusing on potential objects before classification but are slower due to the two-step process.
Summary: Using Deep Lake, Albumentations, and TensorFlow for More Accurate Object Detection with Data Augmentation
In this article, we explored integrating the Deep Lake, Albumentations, and TensorFlow libraries to enhance object detection tasks with data augmentation.
Amongst other things we took a look at:
- Querying the database with the .query() method using Activeloop’s TQL language to return samples with relevant classes to our use case.
- Filtering data samples on the metadata level using the Deep Lake .filter() method filters the samples based on their properties, such as grayscale samples or samples where a certain class occurs only once.
- Creating a new dataset from an existing subset of a Deep Lake dataset using the .copy() method.
- Augmenting image samples with dynamic bounding box updates using Albumentations in the Tensorflow ecosystem to improve model performance.
The article detailed the process of utilizing the COCO dataset from Activeloop, leveraging Deep Lake for efficient dataset filtering on a metadata level, and implementing composed transformations using Albumentations both for image preprocessing and data augmentation. By doing this, we will achieve significantly more variation in the datasets and can train bigger models to better accuracy using smaller datasets.
We have seen how Deep Lake image pipelines differ from those in the TensorFlow datasets. One of the real powers of Deep Lake is that we don’t need to store all the data locally. Instead, we are streaming each sample as we use it. With the Deep Lake filtering, implemented using either Python functions or the powerful Tensor Query Language, we can skip images even before our data pipeline downloads them. With the Activeloop visualizations, we can show our dataset in an interactive view with a single line of code. Despite this, we have only touched on the real power of Activeloop and Deep Lake; we leave it to you to explore the data versioning, the branches, and much more.
We also demonstrated how TensorFlow is used for image and annotation transformations, focusing on practical implementation without delving into model training. The next step for you should be to adapt the format of the images and the labels to suit your model; then, you can get started training your models on your dataset with TensorFlow.
Object Detection and Image Augmentation FAQs
What is Machine Learning?
TensorFlow is an open-source software library for machine learning and artificial intelligence. It’s flexible for defining and running machine learning algorithms and is particularly adept at numerical tasks.
What is Computer Vision?
Computer vision is a field of artificial intelligence that trains computers to interpret and understand the visual world. Using digital images from cameras and videos and deep learning models, machines can accurately identify and classify objects and then react to what they see.
What is Supervised Learning?
Supervised learning is a type of machine learning in which an algorithm learns from labeled training data and makes predictions based on that data. The algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.
What is TensorFlow?
TensorFlow is an open-source software library for machine learning and artificial intelligence. It’s flexible for defining and running machine learning algorithms and is particularly adept at numerical tasks.
What is TensorFlow Used For?
TensorFlow is primarily used for building and training machine learning models, particularly deep learning models. It provides a comprehensive ecosystem of tools, libraries, and community resources to facilitate the development of various machine learning applications, including image recognition, natural language processing, and predictive analytics.
What is TensorFlow Lite?
TensorFlow Lite is a set of tools provided by TensorFlow to help developers run TensorFlow models on mobile, embedded, and IoT devices. It enables on-device machine learning inference with low latency and small binary size.
What is the Difference Between PyTorch and TensorFlow?
PyTorch and TensorFlow are both popular deep learning frameworks. They offer similar functionalities for building and training neural networks but differ in design philosophies and syntax. PyTorch emphasizes dynamic computation graphs and is favored for its simplicity, flexibility, and ‘Pythonic’ approach, while TensorFlow is known for its scalability, extensive tooling, and deployment capabilities.
What is TorchVision?
Torchvision is a package in the PyTorch library that contains computer vision models, datasets, and image transformations. It also provides utilities for visualizing images, bounding boxes, and segmentation masks.
What is Object Detection?
Object detection is a computer vision task that identifies and locates multiple objects within a single image or video frame. It extends beyond recognizing what objects are present (as in image classification) by also determining their boundaries through bounding boxes. This capability makes object detection pivotal for applications like autonomous driving, security surveillance, and augmented reality, where understanding the context and position of objects in real time is crucial.
What is the Difference Between Image Classification and Object Detection?
Image classification involves assigning a single label to an entire image, identifying what objects are present without their locations. Object detection, on the other hand, identifies and locates multiple objects within the image by predicting bounding boxes around each one. Thus, while image classification categorizes the entire image, object detection specifies what and where objects are. In conclusion, image classification and object detection are fundamental computer vision tasks, but they serve different purposes and involve different processes.
What is the Difference Between One-Stage and Two-Stage Object Detectors?
One-stage detectors prioritize speed and are suited for applications requiring real-time processing. In contrast, two-stage detectors focus on achieving higher accuracy, which is beneficial in scenarios where precision is critical. The choice between them depends on the specific needs and constraints of the application.
What is a Bounding Box?
In the context of object detection in machine learning and computer vision, a bounding box is a box drawn around the object of interest in an image. The bounding box provides the spatial context and location of the object.
What is Image Segmentation?
Image segmentation is dividing or partitioning an image into multiple segments or sets of pixels. In machine learning and computer vision, image segmentation is used to identify objects, boundaries, and features in an image.
What is Image Scaling?
Image scaling refers to the resizing of a digital image. In computer machine learning, scaling is a non-trivial process that involves a trade-off between efficiency, smoothness, and sharpness.
What is Data Augmentation?
Data augmentation is a machine learning technique that increases the diversity of a training set by applying random (but realistic) transformations such as image rotation. This helps improve the model’s performance and can lead to better prediction accuracy.
What is the COCO Dataset in ML?
The COCO (Common Objects in Context) dataset is a large-scale object detection dataset. COCO has about 200,000 labeled images with over 80 different object categories. For the most detailed documentation, see the COCO Dataset.
How to Install Albumentations?
Albumentations can be installed using pip, a package manager for Python. To do so, run the following command in your terminal: 'pip install albumentations’.