recount3 - interacting with Recount3

Utilities to interact with the Recount3 platform and manipulate the data retrieved from it.

recount3.load_samples_batches(samples_file)

Load a file with information about the batches of samples to be downloaded from Recount3.

The file can be either a CSV file or a YAML file.

See the Notes section below for more details about their format.

Parameters:
samples_filestr

The input file.

Returns:
dfpandas.DataFrame

A data frame containing the information parsed from the file.

Notes

CSV file

If the input file is a CSV file, it should contain a comma-separated data frame.

The data frame is expected to have at least two columns:

  • "recount3_project_name", containing the name of the project the samples belong to.

  • "recount3_samples_category", containing the name of the category the samples belong to (it is a tissue type for GTEx data, a cancer type for TCGA data, and a project code for SRA data)

These additional three columns may also be present:

  • "query_string", containing the query string that should be used to filter each batch of samples by their metadata. The query string is passed to the pandas.DataFrame.query() method.

    If no "query_string" column is present, the samples will not be filtered.

  • metadata_to_keep, containing a vertical line (|)-separated list of names of metadata columns that will be kept in the final data frames, together with the columns containing gene expression data.

    "recount3_project_name" and "recount3_samples_category" are valid column names, and, if passed, the final data frames will also include them (each data frame will, of course, contain only one repeated value for each of these columns, since it contains samples from a single category of a single project).

    By default, all metadata columns (plus the "recount3_project_name" and "recount3_samples_category" columns) are kept in the final data frames.

  • metadata_to_drop, containing a vertical line (|)-separated list of names of metadata columns that will be dropped from the final data frames.

    The reserved keyword '_all_' can be used to drop all metadata columns from the final data frame of a specific batch of samples.

    "recount3_project_name" and "recount3_samples_category" are valid column names and, if passed, will result in these columns being dropped.

YAML file

If the file is a YAML file, it should have the format exemplified below. We recommend using a YAML file over a CSV file when you have several studies for which different filtering conditions should be applied.

# SRA studies - it can be omitted if no SRA studies are
# included.
sra:

  # Conditions applied to all SRA studies.
  all:

    # Which metadata to keep in all studies (if found). It is
    # a vertical line (|)-separated list of names of metadata
    # columns that will be kept in the  final data frames,
    # together with the columns containing gene expression
    # data.
    #
    # "recount3_project_name"`` and "recount3_samples_category"
    # are valid column names, and, if passed, the final data
    # frames will also include them (each data frame will, of
    # course, contain only one repeated value for each of these
    # columns, since it contains samples from a single category
    # of a single project).
    #
    # By default, all metadata columns (plus the
    # "recount3_project_name" and `"recount3_samples_category"
    # columns) are kept in the final data frames.
    metadata_to_keep:

      # Keep in all studies.
      - source_name
      ...

    # Which metadata to drop from all studies (if found). It is
    # a vertical line (|)-separated list of names of metadata
    # columns that will be dropped from the final data frames.
    #
    # The reserved keyword '_all_' can be used to drop all
    # columns from the data frames.
    #
    # "recount3_project_name" and "recount3_samples_category"
    # are valid column names and, if passed, will result in
    # these columns being dropped.
    metadata_to_drop:

      # Found in all studies.
      - age
      ...

  # Conditions applied to SRA study SRP179061.
  SRP179061:

    # The query string that should be used to filter each batch
    # of samples by their metadata. The query string is passed
    # to the 'pandas.DataFrame.query()' method for filtering.

    # If no query string  is present, the samples will not
    # be filtered.
    query_string: diagnosis == 'Control'

    # Which metadata to keep in this study (if found), It
    # follows the same rules as the 'metadata_to_keep' field
    # in the 'all' section.
    metadata_to_keep:
    - tissue

    # Which metadata to drop from this study (if found), It
    # follows the same rules as the 'metadata_to_drop' field
    # in the 'all' section.
    metadata_to_drop:
    - Sex

# GTEx studies - it can be omitted if no GTEx studies are
# included.
gtex:

  # Same format as for SRA studies - single studies are
  # identified by the tissue type each study refers to.
  ...

# TCGA studies - it can be omitted if no TCGA studies are
# included.
tcga:

  # Same format as for SRA studies - single studies are
  # identified by the cancer type each study refers to.
  ...
recount3.get_query_string(query_string)

Get the string that will be used to filter the samples according to their metadata.

Parameters:
query_strstr

The query string or the path to a plain text file containing the query string.

Returns:
query_strstr

The query string.

recount3.get_gene_sums(project_name, samples_category, save_gene_sums=True, wd=None)

Get RNA-seq counts for samples deposited in the Recount3 platform.

Parameters:
project_namestr, {"gtex", "tcga", "sra"}

The name of the project of interest.

samples_categorystr

The category of samples requested.

save_gene_sumsbool, True

If True, save the original RNA-seq data file in the working directory.

The file name will be "{project_name}_{samples_category}_gene_sums.gz".

wdstr, optional

The working directory where the original RNA-seq data file will be saved, if save_gene_sums is True.

If not specified, it will be the current working directory.

Returns:
df_gene_sumspandas.DataFrame

A data frame containing the RNA-seq counts for the samples associated with the given category.

recount3.get_metadata(project_name, samples_category, save_metadata=True, wd=None)

Get samples’ metadata from the Recount3 platform.

Parameters:
project_namestr, {"gtex", "tcga", "sra"}

The name of the project of interest.

samples_categorystr

The category of samples requested.

save_metadatabool, True

If True, save the original metadata file in the working directory.

wdstr, optional

The working directory where the original metadata file will be saved, if save_metadata is True.

If not specified, it will be the current working directory.

Returns:
df_metadatapandas.DataFrame

A data frame containing the metadata for the samples associated with the given category.

Notes

The "recount3_project_name" and the "recount3_samples_category" columns are automatically added to the metadata returned by the function and contain the project_name and samples_category of the samples, respectively.

recount3.merge_gene_sums_and_metadata(df_gene_sums, df_metadata)

Add the metadata for samples deposited in the Recount3 platform.

Parameters:
df_gene_sumspandas.DataFrame

The data frame containing the RNA-seq counts for the samples.

df_metadatapandas.DataFrame

The data frame containing the metadata for the samples.

Returns:
df_mergedpandas.DataFrame

The data frame containing both RNA-seq counts and metadata for the samples.

recount3.filter_by_metadata(df, query_string)

Filter samples using the associated metadata.

Parameters:
dfpandas.DataFrame

A data frame containing both RNA-seq counts and metadata for a set of samples.

query_stringstr

A string to query the data frame with.

Returns:
df_filteredpandas.DataFrame

The filtered data frame.