Tutorial 3 - Training the bulkDGD model

The code and input data regarding this tutorial can be found in the bulkDGD/tutorials/tutorial_3 directory.

The output data are not included because of GitHub’s file size constraints.

Step 1 - Set the bulkDGD model

In this tutorial, we are going to train the bulkDGD model using a new set of samples.

First, we create a new instance of the bulkDGD model with the desired options. We model the genes’ counts using negative binomial distributions.

To do so, we need a configuration file with these options.

In this case, we will use the model_untrained.yaml file already present in the tutorial’s directory.

First, we set the logging so that every message above and including the INFO level gets reported to have a better idea of what the program is doing. By default, only messages associated with a WARNING level or above get reported.

# Import the 'logging' module.
import logging as log

# Set the logging options.
log.basicConfig(level = "INFO")

Then, we can load the configuration for the bulkDGD model using the ioutil.load_config_model() function.

# Import the 'ioutil' module from 'bulkDGD'.
from bulkDGD import ioutil

# Load the configuration.
config_model = ioutil.get_config_model("model.yaml")

# Check the configuration.
config_model = util.check_config_model(config = config_model)

Once loaded, the configuration consists of a dictionary of options, which maps to the arguments required by the core.model.BulkDGDModel constructor. One of these options is the genes_txt_file, which maps to the path to a plain text file containing a list of Ensembl IDs representing the genes that should be included in the bulkDGD model. If this option is set to "default", the list of genes defined in bulkDGD/data/model/genes/genes.txt is used.

Here, we use the custom list contained in the custom_genes.txt file (in the model_untrained.yaml configuration file, genes_txt_file is set to custom_genes.txt).

A detailed descritpions of the other options used to initialize the bulkDGD model can be found here.

We can now initialize the bulkDGD model.

# Import the 'model' module from 'bulkDGD.core'.
from bulkDGD.core import model

# Get the bulkDGD model (Gaussian mixture model and decoder).
dgd_model = model.BulkDGDModel(**config_model)

If we have a GPU available, we can move the model there.

# Import 'torch'.
import torch

# If a CPU with CUDA is available.
if torch.cuda.is_available():

    # Set the GPU as the device.
    device = torch.device("cuda")

# Otherwise
else:

    # Set the CPU as the device.
    device = torch.device("cpu")

# Move the model to the device.
dgd_model.device = device

Step 2 - Preprocess the input samples

We are going to use the samples provided in the samples_train.csv (training samples) and samples_test.csv (test samples) files.

The files have the following structure:

,ENSG00000187634,ENSG00000188976,ENSG00000187961,ENSG00000187583,...,tissue
1627,80736,275265,52208,2088,...,testis
111,44899,176358,65177,2660,...,adipose_visceral_omentum
555,60662,381897,90671,24486,...,breast_mammary_tissue
...

As we can see, each row contains the expression data for a specific sample. The first column contains the samples’ unique names, IDs, or indexes, while the rest of the columns contain either the expression data for a specific gene (identified by its Ensembl ID) or additional information about the samples. In our case, for example, the last column, named tissue, identifies the tissue from which each sample comes.

Before proceeding with the training, we want to make sure that the genes whose expression data are reported in the CSV files correspond to the genes included in the bulkDGD model and that these genes are reported in the correct order in the files. Furthermore, we would like to know whether we have duplicate samples, duplicate genes, and genes with missing expression values. We can do all this using the ioutil.preprocess_samples() function.

We load our CSV files as data frames using the ioutil.load_samples() function.

# Load the training samples into a data frame.
df_train_raw = \
   ioutil.load_samples(# The CSV file where the samples are stored
                       csv_file = "samples_train.csv",
                       # The field separator in the CSV file
                       sep = ",",
                       # Whether to keep the original samples' names/
                       # indexes (if True, they are assumed to be in
                       # the first column of the data frame
                       keep_samples_names = True,
                       # Whether to split the input data frame into
                       # two data frames, one containing only gene
                       # expression data and the other containing
                       # the extra data about the samples
                       split = False)

 # Load the test samples into a data frame.
 df_test_raw = \
     ioutil.load_samples(# The CSV file where the samples are stored
                         csv_file = "samples_test.csv",
                         # The field separator used in the CSV file
                         sep = ",",
                         # Whether to keep the original samples' names/
                         # indexes (if True, they are assumed to be in
                         # the first column of the data frame
                         keep_samples_names = True,
                         # Whether to split the input data frame into
                         # two data frames, one containing only gene
                         # expression data and the other containing
                         # additional information about the samples
                         split = False)

Then, we can preprocess the samples.

# Preprocess the training samples.
df_train, genes_excluded_train, genes_missing_train = \
    ioutil.preprocess_samples(df_samples = df_train_raw,
                              genes_txt_file = "custom_genes.txt")

# Preprocess the test samples.
df_test, genes_excluded_test, genes_missing_test = \
    ioutil.preprocess_samples(df_samples = df_test_raw,
                              genes_txt_file = "custom_genes.txt")

The function looks for duplicated samples, duplicated genes, and missing values in the columns containing gene expression data. If the function finds duplicated samples or genes with missing expression values, it raises a warning but keeps the samples where the duplication or missing values were found. However, the function will throw an error if it finds duplicated genes since the bulkDGD model assumes the input samples report expression data for unique genes.

Then, the function re-orders the columns containing gene expression data according to the list of genes included in the bulkDGD model and places all the columns containing additional information about the samples (in our case, the tissue column) as the last columns of the output data frame.

Finally, the function checks that all genes in the input samples are among those included in the bulkDGD model, and that all genes used in the bulkDGD model are found in the input samples.

The function returns three objects:

df_train/df_test is a data frame containing the preprocessed samples.
genes_excluded_train/genes_excluded_test is a list containing the Ensembl IDs of the genes that were found in the input samples but are not part of the set of genes included in the bulkDGD model. These genes are absent from df_train/df_test. In our case, no genes were excluded.
genes_missing_train/genes_missing_test is a list containing the Ensembl IDs of the genes that are part of the set of genes included in the the bulkDGD model but were not found in the input samples. These genes are added to df_train/df_test with a count of 0 for all samples. In our case, no genes were missing.

Step 3 - Get the training options

Before training the bulkDGD model, we need to obtain the configuration for the training procedure (which optimizers to use, for how many epochs to train, etc.). Here, we load the configuration from the bulkDGD/configs/training/training.yaml configuration file. We can refer to this file using its name (without extension) because the file is stored in the bulkDGD/configs/training directory.

The configuration can also be stored in a dictionary whose structure is described here.

# Load the configuration for training the bulkDGD model.
config_train = ioutil.load_config_train("training")

# Check the configuration.
config_train = util.check_config_train(config = config_train)

Step 4 - Train the bulkDGD model

We can now train the bulkDGD model.

# Train the bulkDGD model.
df_rep_train, df_rep_test, df_loss, df_time = \
     dgd_model.train(df_train = df_train,
                     df_test = df_test,
                     config_train = config_train)

The functions returns four objects:

df_rep_train is a pandas.DataFrame containing the representations found for the training samples in latent space. In this data frame, each row represents a different representation, and each column represents either the value of the representatione along a dimension of the latent space (in the latent_dim_* columns) or additional information about the original samples (in our case, the tissue column).
df_rep_test is a pandas.DataFrame containing the representations found for the test samples in latent space. In this data frame, each row represents a different representation, and each column represents either the value of the representatione along a dimension of the latent space (in the latent_dim_* columns) or additional information about the original samples (in our case, the tissue column).
df_loss is a pandas.DataFrame containing the losses computed per-epoch during the training procedure.
df_time is a pandas.DataFrame containing information about the CPU and wall clock time used by each training epoch and by the backpropagation steps through the decoder.

Furthermore, the function writes out two files, dec.pth and gmm.pth, containing the parameters of the trained decoder and Gaussian mixture model, respectively. If these files already exist in the working directory (if, for instance, you have already trained the model multiple times), a numerical suffix will be added to the new files as not to overwrite the old ones. Therefore, you will have dec_2.pth and gmm_2.pth in case dec.pth, dec_1.pth, gmm.pth, and gmm_1.pth already exist.

Step 5 - Save the outputs

We can save the preprocessed samples, the representations, the losses, and the information about the training time to CSV files using the ioutil.save_samples(), ioutil.save_representations(), ioutil.save_loss(), and ioutil.save_time() functions.

# Save the preprocessed training samples.
ioutil.save_samples(\
    # The data frame containing the samples
    df = df_train,
    # The output CSV file
    csv_file = "samples_preprocessed_train.csv",
    # The field separator in the output CSV file
    sep = ",")

# Save the preprocessed test samples.
ioutil.save_samples(\
    # The data frame containing the samples
    df = df_test,
    # The output CSV file
    csv_file = "samples_preprocessed_test.csv",
    # The field separator in the output CSV file
    sep = ",")

# Save the representations for the training samples.
ioutil.save_representations(\
    # The data frame containing the representations
    df = df_rep_train,
    # The output CSV file
    csv_file = "representations_train.csv",
    # The field separator in the output CSV file
    sep = ",")

# Save the representations for the test samples.
ioutil.save_representations(\
    # The data frame containing the representations
    df = df_rep_train,
    # The output CSV file
    csv_file = "representations_test.csv",
    # The field separator in the output CSV file
    sep = ",")

# Save the losses.
ioutil.save_loss(\
    # The data frame containing the losses
    df = df_loss,
    # The output CSV file
    csv_file = "loss.csv",
    # The field separator in the output CSV file
    sep = ",")

# Save the time data.
ioutil.save_time(\
    # The data frame containing the time data
    df = df_time,
    # The output CSV file
    csv_file = "train_time.csv",
    # The field separator in the output CSV file
    sep = ",")