Tutorial 2 - Differential Expression Analysis

In this tutorial, we will use the bulkDGD model to perform differential gene expression analysis (DEA) for a set of samples.

The code and input/output data regarding this tutorial can be found in the bulkDGD/tutorials/tutorial_2 directory.

In this tutorial, we will walk you through how to integrate differential expression analysis with bulkDGD in a Python script.

However, if you need to perform DEA on a large set of samples, we recommend using the bulkdgd_dea executable, which parallelizes the calculations.

Step 1 - Get the samples’ representations in latent space and corresponding decoder’s outputs

First, we need to find the best representations for these samples in the latent space defined by the bulkDGD model. You can follow the instructions provided in Tutorial 1 to find the representations and the corresponding decoder outputs. These outputs are the predicted scaled means and predicted r-values of the negative binomials modeling the genes’ counts in silico samples corresponding to the representations.

For this tutorial, we use the first ten preprocessed samples in the samples_preprocessed.csv file and the corresponding ten pre-calculated scaled means and r-values in the pred_means.csv and pred_r_values.csv files obtained following Tutorial 1.

First, we set the logging so that every message above and including the INFO level gets reported to have a better idea of what the program is doing. By default, only messages associated with a WARNING level or above get reported.

# Import the 'logging' module.
import logging as log

# Set the logging options.
log.basicConfig(level = "INFO")

We load the samples using the ioutil.load_samples() function.

# Import the 'ioutil' module from 'bulkDGD'.
from bulkDGD import ioutil

# Load the preprocessed samples into a data frame.
df_samples = \
    ioutil.load_samples(# The CSV file where the samples are stored
                        csv_file = "samples_preprocessed.csv",
                        # The field separator used in the CSV file
                        sep = ",",
                        # Whether to keep the original samples' names/
                        # indexes (if True, they are assumed to be in
                        # the first column of the data frame)
                        keep_samples_names = True,
                        # Whether to split the input data frame into
                        # two data frames, one containing only gene
                        # expression data and the other containing
                        # additional information about the samples
                        split = False)

# Get only the first ten samples.
df_samples = df_samples.iloc[:10,:]

Then, we load the predicted scaled means using the ioutil.load_decoder_outputs() function.

# Load the predicted scaled means into a data frame.
df_pred_means = \
    ioutil.load_decoder_outputs(# The CSV file where the predicted
                                # scaled means are stored
                                csv_file = "decoder_outputs.csv",
                                # The field separator used in the CSV
                                # file
                                sep = ",",
                                # Whether to split the input data frame
                                # into two data frames, one containing
                                # only the predicted scaled means and
                                # the other containing additional
                                # information about the original samples
                                split = False)

# Get only the first ten rows.
df_dec_out = df_dec_out.iloc[:10,:]

Finally, we load the predicted r-values using the ioutil.load_decoder_output() function. If we used an instance of the bulkDGD model using Poisson distributions instead of negative binomial distributions to model the predicted genes’ counts, we would not have an output file with the r-values and we would not need to load them.

# Load the predicted r-values into a data frame.
df_pred_r_values = \
   ioutil.load_decoder_outputs(# The CSV file where the predicted
                               # r-values are stored
                               csv_file = "decoder_outputs.csv",
                               # The field separator used in the CSV
                               # file
                               sep = ",",
                               # Whether to split the input data frame
                               # into two data frames, one containing
                               # only the predicted r-values and
                               # the other containing additional
                               # information about the original samples
                               split = False)

# Get only the first ten rows.
df_pred_r_values = df_pred_r_values.iloc[:10,:]

Step 3 - Perform differential expression analysis

We can perform differential expression analysis for each sample with the analysis.dea.perform_dea() function, and save the results to CSV files (one per sample).

# Import the 'dea' module from 'bulkDGD.analysis'.
from bulkDGD.analysis import dea

# For each sample
for sample in df_samples.index:

    # Perform differential expression analysis.
    dea_results, _ = \
        dea.perform_dea(# The observed gene counts for the current
                        # sample
                        obs_counts = df_samples.loc[sample,:],
                        # The predicted scaled means for the current
                        # sample
                        pred_means = df_pred_means.loc[sample,:],
                        # The r-values for the current sample
                        r_values = df_pred_r_values.loc[sample,:],
                        # Which statistics should be computed and
                        # included in the results
                        statistics = \
                            ["p_values", "q_values",
                             "log2_fold_changes"],
                        # The resolution for the p-values calculation
                        # (the higher, the more accurate the
                        # calculation; set to 'None' for an exact
                        # calculation)
                        resolution = 1e4,
                        # The family-wise error rate for the
                        # calculation of the q-values
                        alpha = 0.05,
                        # The method used to calculate the q-values
                        method = "fdr_bh")

    # Save the results.
    dea_results.to_csv(# The CSV file where to save the results
                       # for the current sample
                       f"dea_sample_{sample}.csv",
                       # The field separator to use in the output
                       # CSV file
                       sep = ",",
                       # Whether to keep the rows' names
                       index = True,
                       # Whether to keep the columns' names
                       header = True)