Tutorial 2 - Differential Expression Analysis
In this tutorial, we will use the bulkDGD model to perform differential gene expression analysis (DEA) for a set of samples.
The code and input/output data regarding this tutorial can be found in the bulkDGD/tutorials/tutorial_2 directory.
In this tutorial, we will walk you through how to integrate differential expression analysis with bulkDGD in a Python script.
However, if you need to perform DEA on a large set of samples, we recommend using the bulkdgd_dea executable, which parallelizes the calculations.
Step 1 - Get the samples’ representations in latent space and corresponding decoder’s outputs
First, we need to find the best representations for these samples in the latent space defined by the bulkDGD model. You can follow the instructions provided in Tutorial 1 to find the representations and the corresponding decoder outputs. These outputs are the predicted scaled means and predicted r-values of the negative binomials modeling the genes’ counts in silico samples corresponding to the representations.
For this tutorial, we use the first ten preprocessed samples in the samples_preprocessed.csv file and the corresponding ten pre-calculated scaled means and r-values in the pred_means.csv and pred_r_values.csv files obtained following Tutorial 1.
First, we set the logging so that every message above and including the INFO level gets reported to have a better idea of what the program is doing. By default, only messages associated with a WARNING level or above get reported.
# Import the 'logging' module.
import logging as log
# Set the logging options.
log.basicConfig(level = "INFO")
We load the samples using the ioutil.load_samples() function.
# Import the 'ioutil' module from 'bulkDGD'.
from bulkDGD import ioutil
# Load the preprocessed samples into a data frame.
df_samples = \
ioutil.load_samples(# The CSV file where the samples are stored
csv_file = "samples_preprocessed.csv",
# The field separator used in the CSV file
sep = ",",
# Whether to keep the original samples' names/
# indexes (if True, they are assumed to be in
# the first column of the data frame)
keep_samples_names = True,
# Whether to split the input data frame into
# two data frames, one containing only gene
# expression data and the other containing
# additional information about the samples
split = False)
# Get only the first ten samples.
df_samples = df_samples.iloc[:10,:]
Then, we load the predicted scaled means using the ioutil.load_decoder_outputs() function.
# Load the predicted scaled means into a data frame.
df_pred_means = \
ioutil.load_decoder_outputs(# The CSV file where the predicted
# scaled means are stored
csv_file = "decoder_outputs.csv",
# The field separator used in the CSV
# file
sep = ",",
# Whether to split the input data frame
# into two data frames, one containing
# only the predicted scaled means and
# the other containing additional
# information about the original samples
split = False)
# Get only the first ten rows.
df_dec_out = df_dec_out.iloc[:10,:]
Finally, we load the predicted r-values using the ioutil.load_decoder_output() function. If we used an instance of the bulkDGD model using Poisson distributions instead of negative binomial distributions to model the predicted genes’ counts, we would not have an output file with the r-values and we would not need to load them.
# Load the predicted r-values into a data frame.
df_pred_r_values = \
ioutil.load_decoder_outputs(# The CSV file where the predicted
# r-values are stored
csv_file = "decoder_outputs.csv",
# The field separator used in the CSV
# file
sep = ",",
# Whether to split the input data frame
# into two data frames, one containing
# only the predicted r-values and
# the other containing additional
# information about the original samples
split = False)
# Get only the first ten rows.
df_pred_r_values = df_pred_r_values.iloc[:10,:]
Step 3 - Perform differential expression analysis
We can perform differential expression analysis for each sample with the analysis.dea.perform_dea() function, and save the results to CSV files (one per sample).
# Import the 'dea' module from 'bulkDGD.analysis'.
from bulkDGD.analysis import dea
# For each sample
for sample in df_samples.index:
# Perform differential expression analysis.
dea_results, _ = \
dea.perform_dea(# The observed gene counts for the current
# sample
obs_counts = df_samples.loc[sample,:],
# The predicted scaled means for the current
# sample
pred_means = df_pred_means.loc[sample,:],
# The r-values for the current sample
r_values = df_pred_r_values.loc[sample,:],
# Which statistics should be computed and
# included in the results
statistics = \
["p_values", "q_values",
"log2_fold_changes"],
# The resolution for the p-values calculation
# (the higher, the more accurate the
# calculation; set to 'None' for an exact
# calculation)
resolution = 1e4,
# The family-wise error rate for the
# calculation of the q-values
alpha = 0.05,
# The method used to calculate the q-values
method = "fdr_bh")
# Save the results.
dea_results.to_csv(# The CSV file where to save the results
# for the current sample
f"dea_sample_{sample}.csv",
# The field separator to use in the output
# CSV file
sep = ",",
# Whether to keep the rows' names
index = True,
# Whether to keep the columns' names
header = True)