selene_sdk.interpret

This module provides functions and classes for interpreting modules trained with Selene.

heatmap

selene_sdk.interpret.heatmap(score_matrix, mask=None, sequence_type=<class 'selene_sdk.sequences.genome.Genome'>, **kwargs)[source]

Plots the input matrix of scores, generally those produced by an in silico mutagenesis experiment, on a heatmap.

Parameters
  • score_matrix (numpy.ndarray) – An \(L \times N\) array (where \(L\) is the length of the sequence, and \(N\) is the size of the alphabet) containing the scores for each base change at each position.

  • mask (numpy.ndarray, dtype=bool or None, optional) – Default is None. An \(L \times N\) array (where \(L\) is the length of the sequence, and \(N\) is the size of the alphabet) containing True at positions in the heatmap to mask. If None, no masking will occur.

  • sequence_type (class, optional) – Default is selene_sdk.sequences.Genome. The class of sequence that the in silico mutagenesis results are associated with. This is generally a sub-class of selene_sdk.sequences.Sequence.

  • **kwargs (dict) – Keyword arguments to pass to seaborn.heatmap. Some useful ones to remember are:

    • cbar_kws - Keyword arguments to forward to the colorbar.

    • yticklabels - Manipulate the tick labels on the y axis.

    • cbar - If False, hide the color bar. If True, show the colorbar.

    • cmap - The color map to use for the heatmap.

Returns

The axes containing the heatmap plot.

Return type

matplotlib.pyplot.Axes

ISMResult

class selene_sdk.interpret.ISMResult(data_frame, sequence_type=<class 'selene_sdk.sequences.genome.Genome'>)[source]

Bases: object

An object storing the results of an in silico mutagenesis experiment.

Parameters
  • data_frame (pandas.DataFrame) – The data frame with the results from the in silico mutagenesis experiments.

  • sequence_type (class, optional) – Default is selene_sdk.sequences.Genome. The type of sequence that the in silico mutagenesis results are associated with. This should generally be a subclass of selene_sdk.sequences.Sequence

Raises
  • ValueError – If the input data frame contains a base not included in the alphabet of sequence_type.

  • Exception – If multiple reference positions are specified in the input data frame.

  • Exception – If the input data does not contain scores for every mutation at every position.

static from_file(input_path, sequence_type=<class 'selene_sdk.sequences.genome.Genome'>)[source]

Loads a selene_sdk.interpret.ISMResult from a pandas.DataFrame stored in a file of comma separated values (CSV).

Parameters
  • input_path (str) – A path to the file of comma separated input values.

  • sequence_type (class, optional) – Default is selene_sdk.sequences.Genome. The type of sequence that the in silico mutagenesis results are associated with. This should generally be a subclass of selene_sdk.sequences.Sequence.

Returns

The in silico mutagenesis results that were stored in the specified input file.

Return type

selene_sdk.interpret.ISMResult

get_score_matrix_for(feature, reference_mask=None, dtype=<class 'numpy.float64'>)[source]

Extracts a feature from the in silico mutagenesis results as a matrix, where the reference base positions hold the value for the reference prediction, and alternative positions hold the results for making a one-base change from the reference base to the specified alternative base.

Parameters
  • feature (str) – The name of the feature to extract as a matrix.

  • reference_mask (float or None, optional) – Default is None. A value to mask the reference entries with. If left as None, then no masking will be performed on the reference positions.

  • dtype (numpy.dtype, optional) – Default is numpy.float64. The data type to use for the returned matrix.

Returns

A \(L \times N\) shaped array (where \(L\) is the sequence length, and \(N\) is the size of the alphabet of sequence_type) that holds the results from the in silico mutagenesis experiment for the specified feature. The elements will be of type dtype.

Return type

numpy.ndarray

Raises

ValueError – If the input data frame contains a base not included in the alphabet of sequence_type.

property reference_sequence

The reference sequence that the in silico mutagenesis experiment was performed on.

Returns

The reference sequence (i.e. non-mutated input) as a string of characters.

Return type

str

property sequence_type

The type of underlying sequence. This should generally be a subclass of selene_sdk.sequences.Sequence.

Returns

The type of sequence that the in silico mutagenesis was performed on.

Return type

class

rescale_score_matrix

selene_sdk.interpret.rescale_score_matrix(score_matrix, base_scaling='identity', position_scaling='identity')[source]

Performs base-wise and position-wise scaling of a score matrix for a feature, usually produced from an in silico mutagenesis experiment.

Parameters
  • score_matrix (numpy.ndarray) – An \(L \times N\) matrix containing the scores for each position, where \(L\) is the length of the sequence, and \(N\) is the number of characters in the alphabet.

  • base_scaling ({‘identity’, ‘probability’, ‘max_effect’}) –

    The type of scaling performed on each base at a given position.

    • ‘identity’ - No transformation will be applied to the data.

    • ‘probability’ - The relative sizes of the bases will be the original input probabilities.

    • ‘max_effect’ - The relative sizes of the bases will be the max effect of the original input values.

  • position_scaling ({‘identity’, ‘probability’, ‘max_effect’}) –

    The type of scaling performed on each position.

    • ‘identity’ - No transformation will be applied to the data.

    • ‘probability’ - The sum of values at a position will be equal to the sum of the original input values at that position.

    • ‘max_effect’ - The sum of values at a position will be equal to the sum of the max effect values of the original input values at that position.

Returns

The transformed score matrix.

Return type

numpy.ndarray

Raises

ValueError – If an unsupported base_scaling or position_scaling is entered.

load_variant_abs_diff_scores

selene_sdk.interpret.load_variant_abs_diff_scores(input_path)[source]

Loads the variant data, labels, and feature names from a diff scores file output from variant effect prediction.

TODO: should we move this out of vis.py?

Parameters

input_path (str) – Path to the input file.

Returns

  • tuple[0] is the matrix of absolute difference scores. The rows are the variants and the columns are the features for which the model makes predictions.

  • tuple[1] is the list of variant labels. Each tuple contains (chrom, pos, name, ref, alt).

  • tuple[2] is the list of features.

Return type

tuple(np.ndarray, list(tuple(str)), list(str))

variant_diffs_scatter_plot

selene_sdk.interpret.variant_diffs_scatter_plot(data, labels, features, output_path, filter_features=None, labels_sort_fn=<function ordered_variants_and_indices>, nth_percentile=None, hg_reference_version=None, threshold_line=None, auto_open=False)[source]

Displays each variant’s max probability difference across features as a point in a scatter plot. The points in the scatter plot are ordered by the variant chromosome and position by default. Variants can be sorted differently by passing in a new labels_sort_fn.

Parameters
  • data (np.ndarray) – Absolute difference scores for variants across all features that a model predicts. This is the first value in the tuple returned by load_variant_abs_diff_scores.

  • labels (list(tuple(str))) – A list of variant labels. This is the second value in the tuple returned by load_variant_abs_diff_scores.

  • features (list(str)) – A list of the features the model predicts. This is the third value in the tuple returned by load_variant_abs_diff_scores.

  • output_path (str) – Path to output file. Must have ‘.html’ extension.

  • filter_features (types.FunctionType or None, optional) – Default is None. A function that takes in a list(str) of features and returns the list(int) of feature indices over which we would compute the max(probability difference) for each variant. For example, a user may only want to visualize the max probability difference for TF binding features. If None, uses all the features.

  • labels_sort_fn (types.FunctionType, optional) – Default is ordered_variants_and_indices. A function that takes in a list(tuple(str)) of labels corresponding to the rows in data and returns a tuple(list(tuple), list(int)), where the first value is the ordered list of variant labels and the second value is the ordered list of indices for those variant labels. By default, variants are sorted by chromosome and position.

  • nth_percentile (int [0, 100] or None, optional) – Default is None. If nth_percentile is not None, only displays the variants with a max absolute difference score within the nth_percentile of scores.

  • hg_reference_version (str {“hg19”, “hg38”} or None, optional) – Default is None. On hover, we can display the gene(s) closest to each variant if hg_reference_version is not None, where closest can be a variant within a gene interval (where genes and their coordinates are taken from level 1 & 2 protein-coding genes in gencode v28) or near a gene. In the future, we will allow users to specify their own genome file so that this information can be annotated to variants from other organisms, other genome versions, etc.

  • threshold_line (float or None, optional) – Default is None. If threshold_line is not None, draws a horizontal line at the specified threshold. Helps focus the visual on variants above a certain threshold.

  • auto_open (bool, optional) – Default is False. If auto_open, will automatically open a web browser that displays the plotted HTML file.

Returns

The generated Plotly figure.

Return type

plotly.graph_objs.graph_objs.Figure

ordered_variants_and_indices

selene_sdk.interpret.ordered_variants_and_indices(labels)[source]

Get the ordered variant labels, where the labels are sorted by chromosome and position, and the indices corresponding to the sort,

Parameters

labels (list(tuple(str))) – The list of variant labels. Each label is a tuple of (chrom, pos, name, ref, alt).

Returns

The first value is the ordered list of labels. Each label is a tuple of (chrom, pos, ref, alt). The second value is the ordered list of label indices.

Return type

tuple(list(tuple), list(int))

sort_standard_chrs

selene_sdk.interpret.sort_standard_chrs(chrom)[source]

Returns the value on which the standard chromosomes can be sorted.

Parameters

chrom (str) – The chromosome

Returns

The value on which to sort

Return type

int