bulkdgd reduction
This command allows the user to perform dimensionality reduction analyses on sets of data points (such as latent representations generated by the bulkDGD model).
bulkdgd reduction has four sub-commands:
bulkdgd reduction pcaperforms a principal component analysis (PCA).bulkdgd reduction kpcaperforms a kernel principal component analysis (KPCA).bulkdgd reduction mdsperforms a multidimensional scaling analysis (MDS).bulkdgd reduction tsneperforms a t-distributed stochastic neighbor embedding analysis (t-SNE).
All sub-commands expect a CSV file containing a data frame with the data points as input, where each row represents a data point. Each column should contain either the data point values along a dimension or additional information about the data points.
The sub-commands produce three outputs:
A CSV file containing a data frame with the results of the dimensionality reduction.
A PKL file containing the fitted model used to perform the dimensionality reduction
A file containing a scatter plot of the results of the dimensionality reduction.
The sub-commands can also take a configuration file as input, specifying the plot’s aesthetics and output format. If not provided, the default configuration file for the analysis performed by the sub-command is used to generate the plot. These default configuration files can be found in bulkDGD/configs/plotting and are named after the analyses they refer to.
Parallelization
If the command is parallelized over several input files or configuration files, each is assumed to be identically named and placed in a different directory. The paths to such directories must be specified using the -ds, --dirs option (see the full description in the Parallelization options section below) and must be relative to the specified working directory (-d, --work-dir option).
You can also run the command with the same input file using different configuration files and vice versa.
In these cases, the files that vary among the runs must be placed in different directories and referenced by their corresponding options by name (not path).
In contrast, the input/configuration files that stay the same among runs may be placed anywhere and referenced by their corresponding options using their absolute path or path relative to the specified working directory.
The output and log files for each run will be written in the directory where the corresponding input/configuration files were placed. If the command is not parallelized, these files will be written in the working directory.
Command lines
dgd reduction pca [-h] -id INPUT_DATA [-im INPUT_MODEL] [-ic INPUT_COLUMNS] [-oa OUTPUT_ANALYSIS] [-om OUTPUT_MODEL] [-op OUTPUT_PLOT] [-cd CONFIG_FILE_DIM_RED] [-cp CONFIG_FILE_PLOT] [-fp FILL_POS_INF] [-fn FILL_NEG_INF] [-gc GROUPS_COLUMN] [-gr GROUPS] [-pg] [-d WORK_DIR] [-lf LOG_FILE] [-lc] [-v] [-vv] [-p] [-n N_PROC] [-ds DIRS [DIRS ...]]
dgd reduction kpca [-h] -id INPUT_DATA [-im INPUT_MODEL] [-ic INPUT_COLUMNS] [-oa OUTPUT_ANALYSIS] [-om OUTPUT_MODEL] [-op OUTPUT_PLOT] [-cd CONFIG_FILE_DIM_RED] [-cp CONFIG_FILE_PLOT] [-fp FILL_POS_INF] [-fn FILL_NEG_INF] [-gc GROUPS_COLUMN] [-gr GROUPS] [-pg] [-d WORK_DIR] [-lf LOG_FILE] [-lc] [-v] [-vv] [-p] [-n N_PROC] [-ds DIRS [DIRS ...]]
dgd reduction mds [-h] -id INPUT_DATA [-im INPUT_MODEL] [-ic INPUT_COLUMNS] [-oa OUTPUT_ANALYSIS] [-om OUTPUT_MODEL] [-op OUTPUT_PLOT] [-cd CONFIG_FILE_DIM_RED] [-cp CONFIG_FILE_PLOT] [-fp FILL_POS_INF] [-fn FILL_NEG_INF] [-gc GROUPS_COLUMN] [-gr GROUPS] [-pg] [-d WORK_DIR] [-lf LOG_FILE] [-lc] [-v] [-vv] [-p] [-n N_PROC] [-ds DIRS [DIRS ...]]
dgd reduction tsne [-h] -id INPUT_DATA [-im INPUT_MODEL] [-ic INPUT_COLUMNS] [-oa OUTPUT_ANALYSIS] [-om OUTPUT_MODEL] [-op OUTPUT_PLOT] [-cd CONFIG_FILE_DIM_RED] [-cp CONFIG_FILE_PLOT] [-fp FILL_POS_INF] [-fn FILL_NEG_INF] [-gc GROUPS_COLUMN] [-gr GROUPS] [-pg] [-d WORK_DIR] [-lf LOG_FILE] [-lc] [-v] [-vv] [-p] [-n N_PROC] [-ds DIRS [DIRS ...]]
Options (all sub-commands)
Help options
Option |
Description |
|---|---|
|
Show the help message and exit. |
Input options
Option |
Description |
|---|---|
|
The input CSV file containing the data frame with the data points. |
|
The input PKL file containing the fitted model on which to project the new data points. |
|
A comma-separated list of columns or a string representing a pattern matching the columns of interest. These will be the columns considered when performing the dimensionality reduction analysis. By default, all columns are considered. |
Output files
Option |
Description |
|---|---|
|
The name of the output CSV file containing the results of the dimensionality reduction analysis. The default file name changes according to the sub-command used. |
|
The name of the output pickle file containing the fitted model used to perform the dimensionality reduction analysis. The default file name changes according to the sub-command used. |
|
The name of the output file containing the plot displaying the results of the dimensionality reduction analysis. The default file name changes according to the sub-command used. The file format and, therefore, extension are inferred from the |
Configuration files
Option |
Description |
|---|---|
|
The YAML configuration file specifying the options for the dimensionality reduction analysis. If it is a name without an extension, it is assumed to be the name of a configuration file in |
|
The YAML configuration file specifying the plot’s aesthetics and output format. If it is a name without an extension, it is assumed to be the name of a configuration file in |
Pre-processing options
Option |
Description |
|---|---|
|
Replace all positive infinite values with the given value before performing the dimensionality reduction analysis. |
|
Replace all negative infinite values with the given value before performing the dimensionality reduction analysis. |
Plotting options
Option |
Description |
|---|---|
|
The name/index of the column in the input data frame containing the groups by which the samples will be colored in the output plot. By default, the program assumes that no such column is present. |
|
A comma-separated list of groups whose data points should be plotted. By default, all groups found in the |
|
Whether to plot data points from the groups not included in the |
Working directory options
Option |
Description |
|---|---|
|
The working directory. The default is the current working directory. |
Logging options
Option |
Description |
|---|---|
|
The name of the log file. The default file name depends on the sub-command. |
|
Show log messages also on the console. |
|
Enable verbose logging (INFO level). |
|
Enable maximally verbose logging for debugging purposes (DEBUG level). |
Parallelization options
Option |
Description |
|---|---|
|
Whether to run the command in parallel. |
|
The number of processes to start. The default number of processes started is 1. |
|
The directories containing the input/configuration files. It can be either a list of names or paths, a pattern that the names or paths match, or a plain text file containing the names of or the paths to the directories. If names are given, the directories are assumed to be inside the working directory. If paths are given, they are assumed to be relative to the working directory. |