How to create collaborative Machine Learning datasets for projects gathering 50+ collaborators

The common keys to successful projects are efficient collaborative work and good communication lead by a diverse team. At Omdena, every project gathers 50+ collaborators from all around the world who all work together in order to develop innovative, ethical, and useful AI solutions in two months. Each project tackles issues like climate change, fake news, food insecurity, online threats, disease spread, bank security, and more.

Therefore, collaborators need to have access to collaborative Machine Learning datasets that are easily accessible to all, so that the progress of the project is not delayed by dataset’s related issues.

In this article, I will show how we used collaborative Machine Learning datasets for the Omdena GPSDD project.

Photo by Ant Rozetsky on Unsplash

Project

The Omdena project’s Improving Food Security and Crop Yield in Senegal was a collaboration with the Global Partnership for Sustainable Development Data (GPSDD). The goal was to use machine learning in order to increase food security in Senegal. With this goal in mind, this project had several objectives tackling various areas connected to food insecurity, such as crop yield, climate risk, crop diseases, deforestation, and food storage/transport.

Problem Statement

We needed a solution on how to handle the datasets for the crop yield prediction subtask that consisted of analyzing satellite data and field data to estimate yields at different levels.
structure

Summary of project’s structure — Source: Omdena

So, we had several issues:

Raw satellite images are too heavy to be easily stored and accessible to all collaborators
The Deep Learning (DL) models developed use preprocessed satellite images
The preprocessed data are dependent on the crop type and so need to be carefully prepared
We had one dataset per country studied and crop type (Maize, Rice, Millet) — which was a choice
The models are trained for each crop type and take specific size input according to the latter
The training datasets (preprocessed data + ground truth) cannot easily be stored/accessed from AWS or Github — especially that we trained the models on Google Colab

We solved most of these issues by using Activeloop.

Activeloop is a fast and simple framework for building and scaling data pipelines for machine learning.

Datasets description

For the GPSDD Senegal project, we use Activeloop to store the datasets used for training our Deep Learning models. In these datasets, we had the 32-bins histograms of satellite images, the normalized difference vegetation indexes (ndvi), and yields (ground truth) values originally saved locally as npy files. The resulting Activeloop Dataset’s schema was then:

 
      
        1ds = Dataset(tag,shape=(histograms.shape[0],),
2schema = { 
3    "histograms": schema.Tensor(histograms[0].shape, dtype="float"),
4    "ndvi": schema.Tensor(ndvi[0].shape, dtype="float"),
5    "yields": schema.Tensor(shape=(1,), dtype="float"),
6    }
7,mode="w+",)
8

Ground Truth

In order to be able to add the ground truth yields values in the Activeloop dataset, we had to save them as a list of lists of the values as followed:

yields_list = [[yield_1], [yield_2], [yield_3], …, [yield_n]]

Use case: storing/combining several datasets for each country

We had several datasets with data from different countries that we wanted to store in separate datasets because we sometimes used them individually, and sometimes combined them.

For example, in order to perform transfer learning with the crop yield prediction model, we used the datasets from South Sudan and Ethiopia to first train the model, then used the pre-trained model obtained and fine-tuned it using the combined datasets from South Sudan, Ethiopia, and Senegal this time.

Using Activeloop for this purpose made it easier and cleaner. Each dataset was loaded from the Activeloop hub using its unique path and then we were able to combine the datasets easily.

For example:

 
      
        1tag1 = "username/SouthSudan_dataset"
2tag2 = "username/Ethiopia_dataset"
3tag3 = "username/Senegal_dataset"
4
5ds1 = Dataset(tag1)
6ds2 = Dataset(tag2)
7ds3 = Dataset(tag3)
8
9print(f"Dataset {tag1} shape: {ds1['histograms'].compute().shape}")
10print(f"Dataset {tag2} shape: {ds2['histograms'].compute().shape}")
11print(f"Dataset {tag3} shape: {ds3['histograms'].compute().shape}")
12
13histograms = np.concatenate(
14    (
15        ds1["histograms"].compute(),
16        ds2["histograms"].compute(), 
17        ds3["histograms"].compute()
18    ),
19    axis=0)
20
21yields_list = np.concatenate(
22    (
23        ds1["yields"].compute(), 
24        ds2["yields"].compute(), 
25        ds3["yields"].compute()
26    ),
27    axis=0)
28
29print(f"Datasets combined, histograms set's shape is {histograms.shape}")
30print(f"Data loaded from {tag1}, {tag2} and {tag3}")
31

Training

When we used np.concatenate to combine the three datasets, we used the tf.data.Dataset.from_tensor_slices module to convert the lists to tensorflow tensors:

 
      
        1list_ds = tf.data.Dataset.from_tensor_slices((histograms, yields))
2image_count = histograms.shape[0]
3

But when we worked only with one of the datasets, we directly used the feature “convert to Tensorflow” from Activeloop ds.to_tensorflow():

 
      
        1def to_model_fit(item):
2    x = item["histograms"]
3    y = item["yields"]
4    return (x, y)
5
6list_ds = ds1.to_tensorflow()
7list_ds = list_ds.map(lambda x: to_model_fit(x))
8image_count = ds1["histograms"].compute().shape[0]
9

Here is a nice example of how to directly use Activeloop datasets to train with Tensorflow.

Then we split the data into train, validation, and test sets using the skip and take functions. Once we had the three sets, we batched, shuffled, and cached them using the Tensorflow functions.

 
      
        1batch_size = 16
2
3print("Total files: {}".format(image_count))
4train_size = int(0.8 * image_count)
5val_size = int(0.1 * image_count)
6test_size = int(0.1 * image_count)
7
8list_ds = list_ds.shuffle(image_count)
9test_ds = list_ds.take(test_size)
10train_ds = list_ds.skip(test_size)
11val_ds = list_ds.take(val_size)
12train_ds = list_ds.skip(val_size)
13
14train_ds = train_ds.shuffle(train_size)
15train_ds = train_ds.batch(batch_size)
16
17val_ds = val_ds.shuffle(val_size)
18val_ds = val_ds.batch(val_size)
19
20test_ds = test_ds.batch(test_size)
21

And finally, we trained our CNN model using these common lines of commands:

 
      
        1metrics_list = [
2    'accuracy',
3    tf.keras.metrics.RootMeanSquaredError(name='RMSE'),
4    tf.keras.losses.MeanSquaredError(name='MSE')
5    ]
6
7model.compile(
8    optimizer=tf.keras.optimizers.Adam(),
9    loss=tf.keras.losses.MeanSquaredError(),
10    metrics=metrics_list
11    )
12
13model.fit(train_ds,
14          epochs=1,
15          validation_data=val_ds,
16          verbose=1)
17
18

Consistency of path and data

Another advantage of using Activeloop in this project was that the paths to the datasets were accessible to all developers without them having to store the data locally, and we were also certain that all developers were using the same dataset as well.

Update a dataset

It was also easy to replace a dataset by re-uploading the updated dataset to the same tag which was really useful when we were able to collect and process more data and had to update the datasets for the trainings. All the trainings were done in Google Colab Notebooks, all using Activeloop stored datasets. The import step, therefore, consisted only in loading the Dataset using the code above, which imports all the data at once and not each file at a time which might require a dataloader class/function in some cases.

Discussion

We could have used Activeloop to store the satellite images but decided instead to store them in the S3 bucket so that the raw data would be in the bucket and the pre-processed ready-to-use datasets in Activeloop. So all the pre-processing of the data that lead to the histograms was done on a local machine with satellite images downloaded locally.

The way we store the ground truth yield values can probably be improved by using the available schemas more efficiently.

To conclude, in this project we used Activeloop as a way to efficiently and easily store the ML datasets, and have a unique and consistent path for all collaborators.