TensorFlow tf.data & Activeloop Hub. How to implement your TensorFlow data pipelines with Hub

The Tensorflow data API tf.data is a well-known tool used to build complex Machine Learning (ML) input-data pipelines when training a model with Tensorflow. It is a very useful and powerful tool when applying transformations to a whole dataset for example.

In this tutorial, we will show how to use Hub instead of tf.data for several cases:

How to load a common dataset: CIFAR10
How to create dataset from directory: Flower Photos dataset
How to conduct Data Augmentation
How to work with segmentation datasets

Before starting, we need to install and import the packages required for this tutorial:

!pip install hub==2.0.7 # restart runtime after this

Imports:

import hub 
import tensorflow as tf
import pathlib
import os
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
from tqdm import tqdm

1) How to load the CIFAR10 dataset

Let’s start with a simple task: loading the CIFAR10 dataset. CIFAR10 dataset comprises 60000 32x32 colour images in 10 classes, and there are 6000 images per class. In total, there are 50000 training images and 10000 test images in the CIFAR10 dataset.

with tf.data:

train, test = tf.keras.datasets.cifar10.load_data()

images, labels = train
images = images/255 # normalize

dataset_cifar10_tf_data = tf.data.Dataset.from_tensor_slices((images, labels))

➡️ <TensorSliceDataset shapes: ((32, 32, 3), (1,)), types: (tf.float64, tf.uint8)>

with Hub:

ds_cifar10_hub = hub.load('hub://activeloop/cifar10-train')

  def to_model_fit(item):
      x = item['images']/255 # normalize
      y = item['labels']
      return (x, y)

ds_cifar10_hub_tf = ds_cifar10_hub.tensorflow()
ds_cifar10_hub_tf = ds_cifar10_hub_tf.map(lambda x: to_model_fit(x))

➡️ <MapDataset shapes: ((32, 32, 3), (1,)), types: (tf.float32, tf.uint32)>

2) How to create dataset from directory: Flower dataset

We are using the Flower Photos dataset from Kaggle to demonstrate how to create a Tensorflow dataset from a local directory.

We first download the Flower Photos dataset:

!export KAGGLE_USERNAME="xxxxx" && export KAGGLE_KEY="xxxxx" && kaggle datasets download -d batoolabbas91/flower-photos-by-the-tensorflow-team && unzip -n flower-photos-by-the-tensorflow-team.zip

We need to take a look at what we have in this folder:

Now we can gather more information:

dataset_flowers_path = 'flower_photos'

from imutils import paths
files_list = sorted(list(paths.list_images(dataset_flowers_path)))
classes_flowers = sorted(os.listdir(dataset_flowers_path))

print(f'There are {len(classes_flowers)} classes of flowers in the dataset" {classes_flowers}')

➡️ There are 6 classes of flowers in the dataset” [‘LICENSE.txt’, ‘daisy’, ‘dandelion’, ‘roses’, ‘sunflowers’, ‘tulips’]

This is incorrect, we do not have 6 classes but 5 classes, so let’s fix this:

# Removing the 'LICENSE.txt'
classes_flowers.remove('LICENSE.txt')
print(f'There are {len(classes_flowers)} classes of flowers in the dataset" {classes_flowers}')

➡️ There are 5 classes of flowers in the dataset” [‘daisy’, ‘dandelion’, ‘roses’, ‘sunflowers’, ‘tulips’]

That’s better! Now let’s take a look at some of the images:

for i in range(4):
  image = Image.open(files_list[i])
  image_size = image.size
  print(image_size)
  image.show()

Visualize the first 4 images in the Kaggle [Flower Photos](https://www.kaggle.com/batoolabbas91/flower-photos-by-the-tensorflow-team) dataset — Image by author

We can see that the images all have different sizes, so we need to keep this in mind for the rest of our coding.

with tf.data:

We use the list files_list defined previously to create a Tensorflow dataset ds_flowers_tf_data:

ds_flowers_tf_data = tf.data.Dataset.from_tensor_slices(files_list)

We then implement a function parse_image that takes the path to the file, reads the image, get the label, and returns the normalized and resized image (since we saw that all the images have different sizes so we choose to use resize_size=(256,256))and the encoded label:

def parse_image(file_name):
  # read the image, decode it, resize it, and normalize it
  image = tf.io.read_file(file_name)
  image = tf.image.decode_jpeg(image, channels=3)
  image = tf.image.resize(image, resize_size) / 255.0

  # Found the label and encode it
  label = tf.strings.split(file_name, os.path.sep)[-2]
  one_hot = label == classes_flowers
  encoded_label = tf.argmax(one_hot)

  # return the image and the integer encoded label
  return (image, encoded_label)

Now we can see this function to map the file paths to the image/label pairs and apply it to ds_flowers_tf_data, that we then batch (batch_size=10), shuffle using a common seed shuffle_common_seed and prefetch:

ds_flowers_tf_data = (ds_flowers_tf_data
                      # Calling parse_image
                      .map(parse_image, num_parallel_calls=tf.data.AUTOTUNE)
                      .batch(batch_size)
                      .shuffle(len(ds_flowers_tf_data), seed=shuffle_common_seed)
                      .prefetch(tf.data.AUTOTUNE))

We implement a function called visualize_img_label_in_first_batch_TF_ds that takes as inputs a batched dataset ds along with the batch_size used and displays the image, its shape and its label:

def visualize_img_label_in_first_batch_TF_ds(ds, batch_size):
  for image, label in ds:
    for b in range(batch_size):
      print(f'Image size: {image.numpy()[b].shape}')
      print(label.numpy()[b])
      plt.imshow(image.numpy()[b])
      plt.show()
    break

Now we can use this function on the batched dataset ds_flowers_tf_data we created with tf.data:

visualize_img_label_in_first_batch_TF_ds(ds_flowers_tf_data, batch_size)

First batch (image&label) in the dataset ds_flowers_tf_data— Image by author

with Hub: Now we want to do the exact same thing but using Hub instead of tf.data. So we still want to be able to process the path to the images, read them, get their label, and returns the normalized and resized images (resize_size=(256,256))and the encoded labels.

First, we create the Hub dataset structure using the same list files_list as we did for the tf.data dataset and using the classes_flowers collected previously:

with hub.empty('./flowers_hub') as ds_flowers_hub:
    # Create the tensors with names of your choice.
    ds_flowers_hub.create_tensor('images', htype = 'image', sample_compression = 'jpg')
    ds_flowers_hub.create_tensor('labels', htype = 'class_label', class_names = classes_flowers)

    # Iterate through the files and append to hub dataset
    for file in tqdm(files_list):
        label_text = os.path.basename(os.path.dirname(file))
        label_num = classes_flowers.index(label_text)

        # Append to images tensor using hub.read
        ds_flowers_hub.images.append(hub.read(file))  

        # Append to labels tensor
        ds_flowers_hub.labels.append(np.uint32(label_num))

We have created a Hub dataset called ds_flowers_hub that expects an image and a mask, both of types images.

We resize the dataset using the Hub transformation feature:

# Resize op[
@hub.compute
def resize(sample_in, sample_out, new_size):    
    # Append the label and image to the output sample
    sample_out.labels.append(sample_in.labels.numpy())
    sample_out.images.append(np.array(Image.fromarray(sample_in.images.numpy()).resize(new_size)))

    return sample_out

# name resized dataset
path_dataset_resized = './flowers-dataset-resized-256x256'

# hub.like is used to create an empty dataset with the same tensor structure
ds_flowers_hub_resized = hub.like(path_dataset_resized, ds_flowers_hub, overwrite = True)

# Resize the dataset ds_flowers_hub that will be store in ds_flowers_hub_resized
resize(new_size=resize_size).eval(ds_flowers_hub, ds_flowers_hub_resized, num_workers = 2)

Now we can create the Tensorflow resized, batched, shuffled and prefetched dataset from ds_flowers_hub_resized:

def to_model_fit(item):
    x = item['images']/255 # normalize
    y = item['labels']
    return (x, y)

ds_flowers_hub_tf = ds_flowers_hub_resized.tensorflow()

ds_flowers_hub_tf =  (ds_flowers_hub_tf
                      # calling to_model_fit
                      .map(lambda x: to_model_fit(x))
                      .batch(batch_size)
                      .shuffle(len(ds_flowers_hub_resized), seed=shuffle_common_seed)
                      .prefetch(tf.data.AUTOTUNE))

Using the same function visualize_img_label_in_first_batch_TF_ds as previously, we visualize the first batch in the dataset ds_flowers_hub_tf which should be exactly the same as the images in the first batch of ds_flowers_tf_data because we are using the same shuffling seed (shuffle_common_seed) and we are using the same original dataset:

visualize_img_label_in_first_batch_TF_ds(ds_flowers_hub_tf, batch_size)

First batch (image&label) in the dataset ds_flowers_hub_tf— Image by author

➡️ The images in the first batch of ds_flowers_hub_tf and ds_flowers_tf_data are indeed identical. 🌸

Finally, we can try to start a simple training with these datasets in order to check they are behaving correctly and can be used for an image classification training. First, we implement a function train_with_simple_CNN_function that defines the model to use (a very basic CNN), compiles the model and starts the training:

def train_with_simple_CNN_function(ds):
  model = tf.keras.Sequential([
      tf.keras.layers.InputLayer(input_shape=(resize_size[0], resize_size[1], 3)),
      tf.keras.layers.Conv2D(16,3,padding='same',activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(32,3,padding='same',activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(64,3,padding='same',activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Dropout(0.2),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(128,activation='relu'),
     tf.keras. layers.Dense(len(classes_flowers), activation='softmax')
  ])

  # Compile the model, we are using the Adam optimizer, the SparseCategoricalCrossentropy loss
  # and SparseCategoricalAccuracy because our labels are not categorical 
  model.compile(
      optimizer='adam',
      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]
  )

  # Start training over 2 epoch
  history = model.fit(ds, epochs = 2)

Results:

Logs: running training on 2 epochs with ds_flowers_hub_tf and ds_flowers_tf_data -- Image by author

We see that the trainings go through the same numbers of iterations per epoch: 367 (which makes sense because len(dataset)=3670 and batch_size=10). In this tutorial, we do not care about the metrics, and only focus on whichever the dataset is usable in a training or not.

So, everything looks good! We can now work on training the best image classification model to differentiate the 5 classes of flowers we have with these datasets 🌻

Photo by [Gérôme Bruneau](https://unsplash.com/@geromebruneau?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&utm_medium=referral)

3) Data Augmentation 🌻🌻🌻

Data augmentation is a common method used to avoid overfitting of the model. Let’s see how we can implement it using tf.data and Hub:

tf.data:

We use the same files_list and tf.data.Dataset.from_tensor_slices but this time, we add another mapping that will use the function augment_using_ops to augment our dataset:

def augment_using_ops(images, labels):
 images = tf.image.random_flip_left_right(images)
 images = tf.image.random_flip_up_down(images)
 images = tf.image.rot90(images)
 return (images, labels)

ds_flowers_tf_data = tf.data.Dataset.from_tensor_slices(files_list)

# We shuffle, cachhe, batched, augment and prefetch 
ds_directory_tf_data_data_aug = (ds_flowers_tf_data
                  .map(parse_image, num_parallel_calls=tf.data.AUTOTUNE)
                  .shuffle(len(ds_flowers_tf_data), seed=shuffle_common_seed)
                  .cache()
                  .batch(batch_size)
                  .map(augment_using_ops, num_parallel_calls=tf.data.AUTOTUNE)
                  .prefetch(tf.data.AUTOTUNE)
)

Then we want to see if the augmentation was successful (we should have flipped and rotated images now:

visualize_img_label_in_first_batch_TF_ds(ds_directory_tf_data_data_aug, batch_size)

First batch of the augmented tf dataset ds_directory_tf_data_data_aug— Image by author

The data augmentation worked fine!

Hub:

We do the exact same augmentation but this time we are using Hub to create the TF dataset. We re-use the Hub dataset ds_flowers_hub_resized constructed previously. We create the augmented TF dataset ds_flowers_hub_data_aug by replacing the previous function to_model_fit that was used in the mapping by the function normalize_and_augment that augments and normalizes our dataset:

def normalize_and_augment(item):
    x = item['images']/255 # normalize
    x = tf.image.random_flip_left_right(x)
    x = tf.image.random_flip_up_down(x)
    x = tf.image.rot90(x)
    y = item['labels']
    return (x, y)

ds_flowers_hub_tf = ds_flowers_hub_resized.tensorflow()

# We shuffle, cachhe, batched, augment and prefetch 
ds_flowers_hub_data_aug = (ds_flowers_hub_tf
                  .shuffle(len(ds_flowers_tf_data), seed=shuffle_common_seed)
                  .cache()
                  .batch(batch_size)
                  .map(normalize_and_augment, num_parallel_calls=tf.data.AUTOTUNE)
                  .prefetch(tf.data.AUTOTUNE)
)

We check the images in the first batch:

First batch of the augmented tf dataset ds_flowers_hub_data_aug — Image by author

We see that we have the same images in the first batch as in ds_directory_tf_data_data_aug (because we see the same shuffle_common_seed), however the random data augmentation are different: the images are rotated and flipped in other directions in ds_flowers_hub_data_aug than in ds_directory_tf_data_data_aug because we used random data augmentations.

4) Segmentation Dataset: Image + Mask

All of the previous examples used image classification datasets. But a lot of Computer Vision projects focus on segmentation tasks and not classification. If you are just starting out and do not know what is image segmentation - segmentation is a pixel-wize classification method.

For this example, we will use the Kaggle dataset: Accurate damaged flower shapes/segmentation:

!export KAGGLE_USERNAME="xxxxx" && export KAGGLE_KEY="xxxxxx" && kaggle datasets download -d metavision/accurate-damaged-flower-shapessegmentation && unzip -n accurate-damaged-flower-shapessegmentation.zip

Looking in the dataset’s structure — Image by author

When looking in the dataset, we see that the images of the flowers are under the subfolder called “720p” and the corresponding masks are under the subfolder “mask”. We collect all the paths to the images in the list files_list_flowers_images and all the paths to the masks in the list files_list_flowers_masks. In total, we have 2544 pairs of image/mask.

As usual, we want to take a look at these images:

for i in range(4):
  img = Image.open(files_list_flowers_images[i])
  print(img.size)
  img.show()

First 4 images in the dataset — Image by author

for i in range(4):
  mask = Image.open(files_list_flowers_masks[i]).convert('L')
  print(mask.size)
  mask.show()
  print(np.unique(mask))

First 4 masks in the dataset — Image by author

For the masks, we also display the np.unique(mask) because we want to know if the values are binary or not. Ideally, they should be: for example 0 for background and 1 for flower since we are planning on doing binary segmentation. However, we see here, that we have np.unique(mask)=[ 0 20 74 77 78] for the first mask for example. So, we need to keep in mind that we will have to do something about this when creating the dataset.

This time we want to resize the images to 512x512 because the original are too big to be efficiently used in a model on Google Colab (we would need way more resources). So we re-define some of the common variables:

resize_size = (512, 512)
batch_size = 4
shuffle_common_seed = 21

Okay, let’s start!

tf.data:

We are using the same tf.data.Dataset.from_tensor_slices architecture as before to create the TF dataset with tf.data. But this time, we are passing two file lists instead of one: files_list_flowers_images and files_list_flowers_masks:

ds_flowers_tf_data_seg = tf.data.Dataset.from_tensor_slices((files_list_flowers_images, files_list_flowers_masks))

ds_flowers_tf_data_seg = (ds_flowers_tf_data_seg
                        # Calling parse_image_mask
                        .map(parse_image_mask, num_parallel_calls=tf.data.AUTOTUNE)
                        .batch(batch_size)
                        .shuffle(len(ds_flowers_tf_data_seg), seed=shuffle_common_seed)
                        .prefetch(tf.data.AUTOTUNE))
ds_flowers_tf_data_seg

However, we need to modify the function used in the mapping to read both the image and the mask. For this, we implement this new mapping function parse_image_mask:

def parse_image_mask(image_name, mask_name):
  # read the image, decode it, resize it, and normalize it
  image = tf.io.read_file(image_name)
  image = tf.image.decode_jpeg(image, channels=3)
  image = tf.image.resize(image, resize_size) / 255.0

  # read the mask, decode it
  mask = tf.io.read_file(mask_name)
  mask = tf.image.decode_jpeg(mask, channels=1)

  # Need to have binary values: 0 or 1
  mask = tf.cast(mask > 0, tf.int32)

  # Resize
  mask = tf.image.resize(mask, resize_size, method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)

  # return the image and its mask
  return (image, mask)

In this new mapping function, we read the image and resize and normalize it. Then we read the mask perform a thresholding so that we only have binary values (0 or 1) for training the model, and finally we resize it.

NB: we used the method: method=tf.image.ResizeMethod.NEAREST_NEIGHBOR when resizing the images so that it does not add new values that are not initially in the images (for example a value different from 0 or 1 for the mask)

We implement a function show_img_mask_in_first_batch to visualize the images and masks in the first batch of a dataset ds:

def show_img_mask_in_first_batch(ds, batch_size):
  for image, mask in ds:
    # first batch
    for b in range(batch_size):
      print(image.numpy()[b].shape)
      plt.imshow(image.numpy()[b])
      plt.show()
      print(mask.numpy()[b].shape)
      plt.imshow(mask.numpy()[b][:,:,0])
      plt.show()
      print(np.unique(mask.numpy()[b][:,:,0])) # we want [0. 1.]
    break

And use it with our new segmentation dataset ds_flowers_tf_data_seg:

show_img_mask_in_first_batch(ds_flowers_tf_data_seg, batch_size)

Pairs image/mask in first batch of ds_flowers_tf_data_seg -- Image by author

Here we see that we do have only 0 and 1 as values in the mask image. Both the images and masks are correctly resized.

Hub:

First, we create the dataset ds_flowers_hub_seg locally at the path ./flowers_seg_hub, we populate it using the list files_list_flowers_images:

with hub.empty('./flowers_seg_hub') as ds_flowers_hub_seg:
    # Create the tensors with names of your choice.
    ds_flowers_hub_seg.create_tensor('images', htype = 'image', sample_compression = 'jpg')
    ds_flowers_hub_seg.create_tensor('masks', htype = 'image', sample_compression = 'png')

    # Iterate through the files and append to hub dataset
    for file in tqdm(files_list_flowers_images):
        # Append to images tensor using hub.read
        ds_flowers_hub_seg.images.append(hub.read(file))

        path_to_mask = file.replace('image', 'mask').replace('720p','mask').replace('jpg','png')
        # Append to masks tensor using Pillow Image
        ds_flowers_hub_seg.masks.append(np.array(Image.open(path_to_mask)))

NB: the paths to the masks are the same as the ones to the images if we replace the strings “image” by “mask”, “720p” by “masks” and “jpg” by “png”. This is what we are doing when defining the variable path_to_mask.

Now that we have our datatse ds_flowers_hub_seg, we can create the TF dataset:

ds_flowers_hub_seg_tf = ds_flowers_hub_seg.tensorflow()

ds_flowers_hub_seg_tf =  (ds_flowers_hub_seg_tf
                        # calling to_model_fit
                        .map(lambda x: to_model_fit(x))
                        .batch(batch_size)
                        .shuffle(len(ds_flowers_hub_seg), seed=shuffle_common_seed)
                        .prefetch(tf.data.AUTOTUNE))

And this time, this is the mapping function to_model_fit that we use:

def to_model_fit(item):
    x = tf.image.resize(item['images'], resize_size)/255
    y = item['masks']

    # 3 channels to 1 channel
    y = tf.image.rgb_to_grayscale(y)

    # Need to have binary values: 0 or 1
    y = tf.cast(y > 0, tf.int32)

    # Resize
    y = tf.image.resize(y, resize_size, method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)

    return (x, y)

This function resize and normalize the images, convert the RGB masks to grayscale (3 channels to 1 channel) and perform the same thresholding as we did for ds_flowers_tf_data_seg, and finally resize the mask.

Let’s take a look at the first batch:

show_img_mask_in_first_batch(ds_flowers_hub_seg_tf, batch_size)

Pairs image/mask in first batch of ds_flowers_hub_seg_tf -- Image by author

This looks good! Now, we want to check that these datasets are both usable to train a segmentation model.

We use the Unet model architecture from the article Binary Semantic Segmentation: Cloud detection with U-net and Activeloop Hub:

model = unet(input_shape = (512,512,3))
model.compile(optimizer=tf.keras.optimizers.Adam(1e-4),
              loss='binary_crossentropy',
              metrics=['accuracy', tf.keras.metrics.Recall(name="recall"), 
                       tf.keras.metrics.Precision(name="precision"), 
                       tf.keras.metrics.MeanIoU(num_classes=2, name='iou')])

and then train with the two datasets we just created:

Testing if we can use the datasets to train Unet — Image by author

Both datasets were usable to train Unet for 1 epoch 🌼

The Notebook for this tutorial is available here.

TensorFlow tf.data & Activeloop Hub. How to implement your TensorFlow data pipelines with Hub

1) How to load the CIFAR10 dataset

2) How to create dataset from directory: Flower dataset

3) Data Augmentation 🌻🌻🌻

4) Segmentation Dataset: Image + Mask

Release Notes: Hub 2.3.4 is released, new features for ingesting data from Kaggle, enhancements to hub auto and PyCon.

Loopy News: our Database for AI landed, enhanced video support, experiment tracking