bulkdgd dea
This command can be used to perform differential expression analysis (DEA) of genes between a “treated” sample (for instance, a cancer sample) against an “untreated” or “control” sample.
Within the context of the bulkDGD model, the DEA is intended between a “treated” experimental sample and a “control” sample, which is the model’s decoder’s output for the best representation of the “treated” sample in latent space. Therefore, the decoder output for the best representation of the “treated” sample acts as an in silico control sample.
bulkdgd dea expects two to three inputs. First, a CSV file containing a data frame set of experimental “treated” samples. The program assumes that each row represents a sample and each column represents a gene or additional information about the samples. Then, the program expects a CSV file containing a data frame with the means of the distributions modeling the genes’ counts in the in silico “control” samples. The third input is needed if the genes’ counts were modeled using negative binomial distributions and is a CSV file containing a data frame containing the r-values of the negative binomials modeling the genes’ counts in the “control” samples. These last two files are obtained by running the bulkdgd find representations command on the “treated” samples.
The output of bulkdgd dea is a CSV file for each sample containing the results of the differential expression analysis. Here, the p-values, q-values (adjusted p-values), and log2-fold changes relative to each gene’s differential expression are reported.
To speed up DEA’s performance on a set of samples, bulkdgd dea uses the Dask Python package to parallelize the calculations.
Command line
bulkdgd dea [-h] -is INPUT_SAMPLES -im INPUT_MEANS [-iv INPUT_RVALUES] [-op OUTPUT_PREFIX] [-pr P_VALUES_RESOLUTION] [-qa Q_VALUES_ALPHA] [-qm Q_VALUES_METHOD] [-d WORK_DIR] [-n N_PROC] [-lf LOG_FILE] [-lc] [-v] [-vv]
Options
Help options
Option |
Description |
|---|---|
|
Show the help message and exit. |
Input files
Option |
Description |
|---|---|
|
The input CSV file containing a data frame with the gene expression data for the samples |
|
The input CSV file containing the data frame with the predicted means of the distributions used to model the genes’ counts for each in silico control sample. |
|
The input CSV file containing the data frame with the predicted r-values of the negative binomials for each in silico control sample if negative binomial distributions were used to model the genes’ counts. |
Output files
Option |
Description |
|---|---|
|
The prefix of the output CSV file(s) that will contain the results of the differential expression analysis. Since the analysis will be performed for each sample, one file per sample will be created. The files’ names will have the form |
DEA options
Option |
Description |
|---|---|
|
The resolution at which to sum over the probability mass function to compute the p-values. The higher the resolution, the more accurate the calculation. The default is |
|
The alpha value used to calculate the q-values (adjusted p-values). The default is |
|
The method used to calculate the q-values (i.e., to adjust the p-values). The default is |
Run options
Option |
Description |
|---|---|
|
The number of processes to start. The default number of processes started is 1. |
Working directory options
Option |
Description |
|---|---|
|
The working directory. The default is the current working directory. |
Logging options
Option |
Description |
|---|---|
|
The name of the log file. The file will be written in the working directory. The default file name is |
|
Show log messages also on the console. |
|
Enable verbose logging (INFO level). |
|
Enable maximally verbose logging for debugging purposes (DEBUG level). |