core.model

This module contains the class implementing the full bulkDGD model (core.model.BulkDGDModel).

class core.model.BulkDGDModel(input_dim, gmm_options, dec_options, genes_txt_file=None, gmm_pth_file=None, dec_pth_file=None)

Class implementing the full bulkDGD model.

__init__(input_dim, gmm_options, dec_options, genes_txt_file=None, gmm_pth_file=None, dec_pth_file=None)

Initialize an instance of the class.

The model is initialized on the CPU. To move the model to another device, modify the device property.

Parameters:

input_dimint

The dimensionality of the input (= the dimensionality of the representations, of the Gaussian mixture model, and of the first layer of the decoder.

gmm_optionsdict

The options for setting up the Gaussian mixture model.

For the available options, refer to the Configuration for creating an instance of the bulkDGD model page.

dec_optionsdict

The options for setting up the decoder.

For the available options, refer to the Configuration for creating an instance of the bulkDGD model page.

genes_txt_filestr

A .txt file containing the Ensembl IDs of the genes included in the model.

Training data will be checked to ensure counts are reported for all genes.

The number of output units in the decoder is initialized from the number of genes found in this file.

gmm_pth_filestr, optional

A .pth file with the GMM’s trained parameters (means, weights, and log-variance of the components).

Please ensure that the parameters match the Gaussian mixture model’s structure.

Omit it if the model needs training.

dec_pth_filestr, optional

A .pth file containing the decoder’s trained parameters (weights and biases).

Please ensure that the parameters match the decoder’s architecture.

Omit it if the model needs training.

get_probability_density(df_rep)

Given a set of representations, get the probability density of each component of the Gaussian mixture model for each representation and the representation(s) having the maximum probability density for each component.

Parameters:

df_reppandas.DataFrame: A data frame containing the representations.

Returns:

df_prob_reppandas.DataFrame: A data frame containing the probability densities for each representation, together with an indication of what the maximum probability density found is and for which component it is found.
df_prob_comppandas.DataFrame: A data frame containing, for each component, the representation(s) having the maximum probability density for the component, together with the probability density for that(those) representation(s).

get_representations(df_samples, config_rep)

Find the best representations for a set of samples.

Parameters:

df_samplespandas.DataFrame

A data frame containing the samples.

config_repdict

A dictionary of options for the optimization(s). It varies according to the selected method.

The supported options for all available methods can be found here.

Returns:

df_reppandas.DataFrame

A data frame containing the representations.

Here, each row contains a representation and the columns contain either the values of the representations’ along the latent space’s dimensions or additional information about the input samples found in the input data frame. Columns containing additional information, if present in the input data frame, will appear last in the data frame.

df_pred_meanspandas.DataFrame

A data frame containing the predicted means of the distributions modelling the genes’ counts for the representations found.

Here, each row contains the predicted means for a given representation, and the columns contain either the mean of a distribution or additional information about the input samples found in the input data frame. Columns containing additional information, if present in the input data frame, will appear last in the data frame.

If the genes counts are modelled using negative binomial distributions, the predicted means are scaled by the corresponding distributions’ r-values.

df_pred_r_valuespandas.DataFrame or None

A data frame containing the predicted r-values of the negative binomials for the representations found, if the genes’ counts are modelled by negative binomial distributions

Here, each row contains the predicted r-values for a given representation, and the columns contain either the r-value of a negative binomial or additional information about the input samples found in the input data frame. Columns containing additional information, if present in the input data frame, will appear last in the data frame.

df_pred_r_values is None if the genes’ counts are modelled by Poisson distributions.

df_timepandas.DataFrame

A data frame containing data about the CPU and wall clock time used by each epoch (and backpropagation step within each epoch) in each optimization step.

Here, each row represents an epoch of an optimization step, and the columns contain data about the platform where the calculation was run, the number of CPU threads used by the computation, and the CPU and wall clock time used by the entire epoch and by the backpropagation step run inside it.

static rescale_pred_means(df_pred_means, df_pred_r_values)

Rescale the means of the negative binomials modeling the genes’ counts.

Parameters:

df_pred_meanspandas.DataFrame

A data frame containing the predicted scaled means of the negative binomials modeling the genes’ counts.

Here, each row contains the scaled mean for a given representation/sample, and the columns contain either the values of the scaled means or additional information.

The columns containing the scaled means must be named after the corresponding genes’ Ensembl IDs.

df_pred_r_valuespandas.DataFrame

A data frame containing the predicted r-values of the negative binomials modeling the genes’ counts.

Here, each row contains the r-value for a given representation/sample, and the columns contain either the r-values or additional information.

The columns containing the r-values must be named after the corresponding genes’ Ensembl IDs.

Returns:

df_scaledpandas.DataFrame

A data frame containing the predicted means.

It contains the same columns of the df_pred_means data frame, in the same order they appear in the df_pred_means data frame.

However, the values in the columns containing the predicted means are scaled back by the corresponding r-values.

train(df_train, df_test, config_train, gmm_pth_file='gmm.pth', dec_pth_file='dec.pth', rep_pth_file='rep.pth')

Train the model.

Parameters:

df_trainpandas.DataFrame

A data frame containing the training data.

Each row should contain a unique sample, and each column should either contain a gene’s expression for that sample (if the column is named after the gene’s Ensembl ID) or additional information about the sample.

df_testpandas.DataFrame

A data frame containing the testing data.

Each row should contain a unique sample, and each column should either contain a gene’s expression for that sample (if the column is named after the gene’s Ensembl ID) or additional information about the sample.

config_traindict

A dictionary of options for the training.

gmm_pth_filestr, "gmm.pth"

The .pth file where to save the GMM’s trained parameters (means of the components, weights of the components, and log-variance of the components).

dec_pth_filestr, "gmm.pth"

The .pth file where to save the decoder’s trained parameters (weights and biases).

rep_pth_filestr, "rep.pth"

The .pth file where to save the representations found for the training samples.

Returns:

df_losspandas.DataFrame

A data frame containing the losses for the training (such as loss for each model’s components per epoch, overall loss per epoch, etc.).

df_timepandas.DataFrame

A data frame containing data about the CPU and wall clock time used by each training epoch (and backpropagation step within each epoch).

Here, each row represents a training epoch, and the columns contain data about the platform where the calculation was run, the number of CPU threads used by the computation, and the CPU and wall clock time used by the entire epoch and by the backpropagation step run inside it.

property dec: The decoder.

property device: The device where the model is.

property gmm: The Gaussian mixture model.