analysis.dea

Utilities to perform differential expression analysis (DEA).

analysis.dea.get_log2_fold_changes(obs_counts, pred_means)

Get the log2-fold change of the expression of a set of genes.

Parameters:

obs_countspandas.Series

The observed gene counts in a single sample.

This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.

pred_meanspandas.Series

The predicted means of the distributions modelling the genes’ counts in a single sample.

This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.

Returns:

log2_fold_changespandas.Series

The log2-fold change associated with each gene in the given sample.

This is a series whose index correspond to the one of obs_counts and pred_means.

analysis.dea.get_p_values(obs_counts, pred_means, r_values=None, resolution=None, return_pmf_values=False)

Given the observed gene counts in a single sample, and the predicted mean gene counts in a single sample, calculate the p-value ssociated with the predicted mean of each distribution modeling a gene’s counts by comparing it to the actual gene count.

Parameters:

obs_countspandas.Series

The observed gene counts in a single sample.

This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.

pred_meanspandas.Series

The predicted means of the distributions modelling the genes’ counts in a single sample.

This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.

If the genes’ counts were modelled using negative binomial distributions, the predicted means are scaled by the corresponding distributions’ r-values.

r_valuespandas.Series, optional

The predicted r-values of the negative binomial distributions modelling the genes’ counts in a single sample, if the genes’ counts were modelled using negative binomial distributions.

This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.

If r_values is not provided, it is assumed that the genes’ counts were modelled using Poisson distributions.

resolutionint, optional

How accurate the calculation of the p-values should be.

The resolution corresponds to the coarseness of the sum over the probability mass function of each distribution to compute the corresponding p-value.

The higher the resolution, the more accurate (and more computationally expensive) the calculation of the p-values will be.

If not passed, the calculation will be exact.

return_pmf_valuesbool, False

Return the points at which the log-probability mass function was evaluated and the corresponding values of the log- probability mass function, together with the p-values.

Set it to True only if you have a low resolution (for instance, 1e3 or lower) or a lot of RAM available since the arrays containing the points at which the log- probability mass function was evaluated and the corresponding values of the function will contain resolution floating-point numbers for each gene.

Returns:

p_valuespandas.Series

A series containing one p-value per gene.

kspandas.DataFrame

A data frame containing the count values at which the log- probability mass function was evaluated to compute the p-values.

The data frame has as many rows as the number of genes and as many columns as the number of count values.

This is an empty data frame if return_pmf_values is False.

pmfsnumpy.ndarray

A data frame containing the value of the log-probability mass function for each count value at which it was evaluated.

The data frame has as many rows as the number of genes and as many columns as the number of count values.

This is an empty data frame if return_pmf_values is False.

analysis.dea.get_q_values(p_values, alpha=0.05, method='fdr_bh')

Get the q-values associated with a set of p-values.

The q-values are the p-values adjusted for the false discovery rate.

Parameters:

p_valuespandas.Series: The p-values.
alphafloat, 0.05: The family-wise error rate for the calculation of the q-values.
methodstr, "fdr_bh": The method used to adjust the p-values. The available methods are listed in the documentation for statsmodels.stats.multitest.multipletests.

Returns:

q_valuespandas.Series

A series containing the q-values (adjusted p-values).

The index of the series is equal to the index of the input series of p-values.

rejectedpandas.Series

A series containing booleans indicating whether a p-value in the input data frame was rejected (True) or not (False).

The index of the series is equal to the index of the input series of p-values.

analysis.dea.perform_dea(obs_counts, pred_means, r_values=None, sample_name=None, statistics=['p_values', 'q_values', 'log2_fold_changes'], resolution=None, alpha=0.05, method='fdr_bh')

Perform differential expression analysis (DEA).

Parameters:

obs_countspandas.Series

The observed gene counts in a single sample.

This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.

pred_meanspandas.Series

The predicted means of the distributions modelling the genes’ counts in a single sample.

This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.

If the genes’ counts were modelled using negative binomial distributions, the predicted means are scaled by the corresponding distributions’ r-values.

r_valuespandas.Series, optional

The predicted r-values of the negative binomial distributions modelling the genes’ counts in a single sample, if the genes’ counts were modelled using negative binomial distributions.

This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.

If r_values is not provided, it is assumed that the genes’ counts were modelled using Poisson distributions.

sample_namestr, optional

The name of the sample under consideration.

It is returned together with the results of the analysis to facilitate the identification of the sample when running the analysis in parallel for multiple samples (i.e., launching the function in parallel on multiple samples).

statisticslist, {["p_values", "q_values", "log2_fold_changes"]}

The statistics to be computed. By default, all of them will be computed.

resolutionint, optional

How accurate the calculation of the p-values should be.

The resolution corresponds to the coarseness of the sum over the probability mass function of each distribution to compute the corresponding p-value.

The higher the resolution, the more accurate (and more computationally expensive) the calculation of the p-values will be.

If not passed, the calculation will be exact.

alphafloat, 0.05

The family-wise error rate for the calculation of the q-values.

methodstr, "fdr_bh"

The method used to calculate the q-values (in other words, to adjust the p-values). The available methods are listed in the documentation for statsmodels.stats.multitest.multipletests.

Returns:

df_statspandas.DataFrame: A data frame whose rows represent the genes on which the DEA was performed, and whose columns contain the statistics computed (p-values, q_values, log2-fold changes). If not all statistics were computed, the columns corresponding to the missing ones will be empty.
sample_namestr or None: The name of the sample under consideration.