core.model
This module contains the class implementing the full bulkDGD model (core.model.BulkDGDModel).
- class core.model.BulkDGDModel(input_dim, gmm_options, dec_options, genes_txt_file=None, gmm_pth_file=None, dec_pth_file=None)
Class implementing the full bulkDGD model.
- __init__(input_dim, gmm_options, dec_options, genes_txt_file=None, gmm_pth_file=None, dec_pth_file=None)
Initialize an instance of the class.
The model is initialized on the CPU. To move the model to another device, modify the
deviceproperty.- Parameters:
- input_dim
int The dimensionality of the input (= the dimensionality of the representations, of the Gaussian mixture model, and of the first layer of the decoder.
- gmm_options
dict The options for setting up the Gaussian mixture model.
For the available options, refer to the Configuration for creating an instance of the bulkDGD model page.
- dec_options
dict The options for setting up the decoder.
For the available options, refer to the Configuration for creating an instance of the bulkDGD model page.
- genes_txt_file
str A .txt file containing the Ensembl IDs of the genes included in the model.
Training data will be checked to ensure counts are reported for all genes.
The number of output units in the decoder is initialized from the number of genes found in this file.
- gmm_pth_file
str, optional A .pth file with the GMM’s trained parameters (means, weights, and log-variance of the components).
Please ensure that the parameters match the Gaussian mixture model’s structure.
Omit it if the model needs training.
- dec_pth_file
str, optional A .pth file containing the decoder’s trained parameters (weights and biases).
Please ensure that the parameters match the decoder’s architecture.
Omit it if the model needs training.
- input_dim
- get_probability_density(df_rep)
Given a set of representations, get the probability density of each component of the Gaussian mixture model for each representation and the representation(s) having the maximum probability density for each component.
- Parameters:
- df_rep
pandas.DataFrame A data frame containing the representations.
- df_rep
- Returns:
- df_prob_rep
pandas.DataFrame A data frame containing the probability densities for each representation, together with an indication of what the maximum probability density found is and for which component it is found.
- df_prob_comp
pandas.DataFrame A data frame containing, for each component, the representation(s) having the maximum probability density for the component, together with the probability density for that(those) representation(s).
- df_prob_rep
- get_representations(df_samples, config_rep)
Find the best representations for a set of samples.
- Parameters:
- Returns:
- df_rep
pandas.DataFrame A data frame containing the representations.
Here, each row contains a representation and the columns contain either the values of the representations’ along the latent space’s dimensions or additional information about the input samples found in the input data frame. Columns containing additional information, if present in the input data frame, will appear last in the data frame.
- df_pred_means
pandas.DataFrame A data frame containing the predicted means of the distributions modelling the genes’ counts for the representations found.
Here, each row contains the predicted means for a given representation, and the columns contain either the mean of a distribution or additional information about the input samples found in the input data frame. Columns containing additional information, if present in the input data frame, will appear last in the data frame.
If the genes counts are modelled using negative binomial distributions, the predicted means are scaled by the corresponding distributions’ r-values.
- df_pred_r_values
pandas.DataFrameorNone A data frame containing the predicted r-values of the negative binomials for the representations found, if the genes’ counts are modelled by negative binomial distributions
Here, each row contains the predicted r-values for a given representation, and the columns contain either the r-value of a negative binomial or additional information about the input samples found in the input data frame. Columns containing additional information, if present in the input data frame, will appear last in the data frame.
df_pred_r_valuesisNoneif the genes’ counts are modelled by Poisson distributions.- df_time
pandas.DataFrame A data frame containing data about the CPU and wall clock time used by each epoch (and backpropagation step within each epoch) in each optimization step.
Here, each row represents an epoch of an optimization step, and the columns contain data about the platform where the calculation was run, the number of CPU threads used by the computation, and the CPU and wall clock time used by the entire epoch and by the backpropagation step run inside it.
- df_rep
- static rescale_pred_means(df_pred_means, df_pred_r_values)
Rescale the means of the negative binomials modeling the genes’ counts.
- Parameters:
- df_pred_means
pandas.DataFrame A data frame containing the predicted scaled means of the negative binomials modeling the genes’ counts.
Here, each row contains the scaled mean for a given representation/sample, and the columns contain either the values of the scaled means or additional information.
The columns containing the scaled means must be named after the corresponding genes’ Ensembl IDs.
- df_pred_r_values
pandas.DataFrame A data frame containing the predicted r-values of the negative binomials modeling the genes’ counts.
Here, each row contains the r-value for a given representation/sample, and the columns contain either the r-values or additional information.
The columns containing the r-values must be named after the corresponding genes’ Ensembl IDs.
- df_pred_means
- Returns:
- df_scaled
pandas.DataFrame A data frame containing the predicted means.
It contains the same columns of the
df_pred_meansdata frame, in the same order they appear in thedf_pred_meansdata frame.However, the values in the columns containing the predicted means are scaled back by the corresponding r-values.
- df_scaled
- train(df_train, df_test, config_train, gmm_pth_file='gmm.pth', dec_pth_file='dec.pth', rep_pth_file='rep.pth')
Train the model.
- Parameters:
- df_train
pandas.DataFrame A data frame containing the training data.
Each row should contain a unique sample, and each column should either contain a gene’s expression for that sample (if the column is named after the gene’s Ensembl ID) or additional information about the sample.
- df_test
pandas.DataFrame A data frame containing the testing data.
Each row should contain a unique sample, and each column should either contain a gene’s expression for that sample (if the column is named after the gene’s Ensembl ID) or additional information about the sample.
- config_train
dict A dictionary of options for the training.
- gmm_pth_file
str,"gmm.pth" The .pth file where to save the GMM’s trained parameters (means of the components, weights of the components, and log-variance of the components).
- dec_pth_file
str,"gmm.pth" The .pth file where to save the decoder’s trained parameters (weights and biases).
- rep_pth_file
str,"rep.pth" The .pth file where to save the representations found for the training samples.
- df_train
- Returns:
- df_loss
pandas.DataFrame A data frame containing the losses for the training (such as loss for each model’s components per epoch, overall loss per epoch, etc.).
- df_time
pandas.DataFrame A data frame containing data about the CPU and wall clock time used by each training epoch (and backpropagation step within each epoch).
Here, each row represents a training epoch, and the columns contain data about the platform where the calculation was run, the number of CPU threads used by the computation, and the CPU and wall clock time used by the entire epoch and by the backpropagation step run inside it.
- df_loss
- property dec
The decoder.
- property device
The device where the model is.
- property gmm
The Gaussian mixture model.