analysis.dea
Utilities to perform differential expression analysis (DEA).
- analysis.dea.get_log2_fold_changes(obs_counts, pred_means)
Get the log2-fold change of the expression of a set of genes.
- Parameters:
- obs_counts
pandas.Series The observed gene counts in a single sample.
This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.
- pred_means
pandas.Series The predicted means of the distributions modelling the genes’ counts in a single sample.
This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.
- obs_counts
- Returns:
- log2_fold_changes
pandas.Series The log2-fold change associated with each gene in the given sample.
This is a series whose index correspond to the one of
obs_countsandpred_means.
- log2_fold_changes
- analysis.dea.get_p_values(obs_counts, pred_means, r_values=None, resolution=None, return_pmf_values=False)
Given the observed gene counts in a single sample, and the predicted mean gene counts in a single sample, calculate the p-value ssociated with the predicted mean of each distribution modeling a gene’s counts by comparing it to the actual gene count.
- Parameters:
- obs_counts
pandas.Series The observed gene counts in a single sample.
This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.
- pred_means
pandas.Series The predicted means of the distributions modelling the genes’ counts in a single sample.
This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.
If the genes’ counts were modelled using negative binomial distributions, the predicted means are scaled by the corresponding distributions’ r-values.
- r_values
pandas.Series, optional The predicted r-values of the negative binomial distributions modelling the genes’ counts in a single sample, if the genes’ counts were modelled using negative binomial distributions.
This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.
If
r_valuesis not provided, it is assumed that the genes’ counts were modelled using Poisson distributions.- resolution
int, optional How accurate the calculation of the p-values should be.
The
resolutioncorresponds to the coarseness of the sum over the probability mass function of each distribution to compute the corresponding p-value.The higher the
resolution, the more accurate (and more computationally expensive) the calculation of the p-values will be.If not passed, the calculation will be exact.
- return_pmf_values
bool,False Return the points at which the log-probability mass function was evaluated and the corresponding values of the log- probability mass function, together with the p-values.
Set it to
Trueonly if you have a low resolution (for instance,1e3or lower) or a lot of RAM available since the arrays containing the points at which the log- probability mass function was evaluated and the corresponding values of the function will containresolutionfloating-point numbers for each gene.
- obs_counts
- Returns:
- p_values
pandas.Series A series containing one p-value per gene.
- ks
pandas.DataFrame A data frame containing the count values at which the log- probability mass function was evaluated to compute the p-values.
The data frame has as many rows as the number of genes and as many columns as the number of count values.
This is an empty data frame if
return_pmf_valuesisFalse.- pmfs
numpy.ndarray A data frame containing the value of the log-probability mass function for each count value at which it was evaluated.
The data frame has as many rows as the number of genes and as many columns as the number of count values.
This is an empty data frame if
return_pmf_valuesisFalse.
- p_values
- analysis.dea.get_q_values(p_values, alpha=0.05, method='fdr_bh')
Get the q-values associated with a set of p-values.
The q-values are the p-values adjusted for the false discovery rate.
- Parameters:
- p_values
pandas.Series The p-values.
- alpha
float,0.05 The family-wise error rate for the calculation of the q-values.
- method
str,"fdr_bh" The method used to adjust the p-values. The available methods are listed in the documentation for
statsmodels.stats.multitest.multipletests.
- p_values
- Returns:
- q_values
pandas.Series A series containing the q-values (adjusted p-values).
The index of the series is equal to the index of the input series of p-values.
- rejected
pandas.Series A series containing booleans indicating whether a p-value in the input data frame was rejected (
True) or not (False).The index of the series is equal to the index of the input series of p-values.
- q_values
- analysis.dea.perform_dea(obs_counts, pred_means, r_values=None, sample_name=None, statistics=['p_values', 'q_values', 'log2_fold_changes'], resolution=None, alpha=0.05, method='fdr_bh')
Perform differential expression analysis (DEA).
- Parameters:
- obs_counts
pandas.Series The observed gene counts in a single sample.
This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.
- pred_means
pandas.Series The predicted means of the distributions modelling the genes’ counts in a single sample.
This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.
If the genes’ counts were modelled using negative binomial distributions, the predicted means are scaled by the corresponding distributions’ r-values.
- r_values
pandas.Series, optional The predicted r-values of the negative binomial distributions modelling the genes’ counts in a single sample, if the genes’ counts were modelled using negative binomial distributions.
This is a series whose index contains either the genes’ Ensembl IDs or names of fields containing additional information about the sample.
If
r_valuesis not provided, it is assumed that the genes’ counts were modelled using Poisson distributions.- sample_name
str, optional The name of the sample under consideration.
It is returned together with the results of the analysis to facilitate the identification of the sample when running the analysis in parallel for multiple samples (i.e., launching the function in parallel on multiple samples).
- statistics
list, {["p_values", "q_values", "log2_fold_changes"]} The statistics to be computed. By default, all of them will be computed.
- resolution
int, optional How accurate the calculation of the p-values should be.
The
resolutioncorresponds to the coarseness of the sum over the probability mass function of each distribution to compute the corresponding p-value.The higher the
resolution, the more accurate (and more computationally expensive) the calculation of the p-values will be.If not passed, the calculation will be exact.
- alpha
float,0.05 The family-wise error rate for the calculation of the q-values.
- method
str,"fdr_bh" The method used to calculate the q-values (in other words, to adjust the p-values). The available methods are listed in the documentation for
statsmodels.stats.multitest.multipletests.
- obs_counts
- Returns:
- df_stats
pandas.DataFrame A data frame whose rows represent the genes on which the DEA was performed, and whose columns contain the statistics computed (p-values, q_values, log2-fold changes). If not all statistics were computed, the columns corresponding to the missing ones will be empty.
- sample_name
strorNone The name of the sample under consideration.
- df_stats