Creating your own DataLoader in PyTorch for combining images and tabular data

Lucas Ramos
Analytics Vidhya
Published in
4 min readJan 29, 2021

--

The main goal of this post is to show how you can load images and metadata/tabular using a DataLoader in Pytorch, create batches and feed them together to the network. This is often desired when we want to combine the metadata to the images at some point in the network.

A good example of when this could be useful in the case of clinical data. Clinical data is often composed of patient information (here referred to as metadata), like age and sex, but also other information, like the exams the patient underwent. These exams often generate images, that contain lots of useful information.

Both data sources generate data that is unique and can add a lot to multiple prediction tasks. To make the most of all the information available, Convolutional Neural Networks can be trained on the imaging data, and the metadata can be added to, let’s say, the dense layers to provide extra information during prediction.

There are many tutorials for creating your own DataLoaders in Pytorch. One of the best of them is available in the Pytorch documentation, you can check it by clicking HERE. Below I will explain how each section of the DataLoader works and how you can adapt it for your needs.

The Dataset

One of the most important things for a simple and straightforward DataLoader is to structure your data well.

By structure I mean the following: make sure there’s a clear connection to your images and your tabular data. This can be a digit, a column with the image id, name, path, it doesn’t matter, as long as you have a clear link between the variables in the table and one or multiple images.

In our case, we are using the test set from the OSIC Pulmonary Fibrosis Progression dataset, available HERE. We will use the image name as a connection to the clinical table.

This is what our tabular data looks like:

Tabular data

The most important columns are the Patient column, which has the name of the images and is the link to the image data, and the FVC, which is our label. The rest of the variables are to be used as extra data and be combined with the images in the network.

You can find the complete code by clicking below. I will further describe what is happening inside each function below.

How does it work?

A DataLoader will load a sample per time, but it will return a tensor of the size of the batch. It is a magical thing that will make training a lot faster and your code more organized.

Below you have the definition of the DataLoader class and a link to the complete code (so you can copy and paste if you need it).

Complete Code
DataLoader class

In the class above, we have an Init method, that initializes the variables we will use in the DataLoader. The method __getitem__, is responsible for loading an instance of our data. The DataLoader will automatically call this function multiple times until the batch size is reached. The beauty here is that you don’t have to worry about calling this function or controlling the batch size and when to stop, the DataLoader will do it all for you.

In the __getitem__ method, we use the idx that is controlled by the DataLoader and will increment it automatically, to read a sample from the idx position. We read the dicom image given by the column that connects images to tabular data and proceed to read the tabular data itself and the label.

Once the DataLoader reaches the end of a batch, it will return the batch so it can be used to train a model. The__len__ is used by the DataLoader to mark the end of an epoch.

Finally, you just need to create an instance of CombineDataset class we defined and feed it to a DataLoader instance. in this case we did it only for the training set, but you can create multiple DataLoaders, one for training, validation and test sets.

When defining the DataLoader instance, you can select the batch size, if you want to shuffle your data or use batches in sequence, the number of multi-process to speed up data loading and if you want to have a smaller batch in the end in case some samples are left and are not enough to make a whole batch.

DataLoader in action

That’s all, you have a working DataLoader that reads and connects the tabular and image data. The next step is to feed this data to a network. For the sake of length and simplicity, I will present this in the following article.

--

--