selene_sdk.interpret¶

This module provides functions and classes for interpreting modules trained with Selene.

sequence_logo¶

selene_sdk.interpret.sequence_logo(score_matrix, order='value', width=1.0, ax=None, sequence_type=<class 'selene_sdk.sequences.genome.Genome'>, font_properties=None, color_scheme=None, **kwargs)[source]¶

Plots a sequence logo for visualizing motifs.

Parameters

score_matrix (np.ndarray) – An \(L \times N\) array (where \(L\) is the length of the sequence, and \(N\) is the size of the alphabet) containing the scores for each base occuring at each position.
order ({‘alpha’, ‘value’}) – The manner by which to sort the bases stacked at each position in the sequence logo plot.
- ‘alpha’ - Bases go in the order they are found in the sequence alphabet.
- ‘value’ - Bases go in the order of their value, with the largest at the bottom.
width (float, optional) – Default is 1. The width of each character in the plotted logo. A value of 1 will mean that there is no gap between each the characters at each position. A value of 0 will not draw any characters.
ax (matplotlib.pyplot.Axes or None, optional) – Default is None. The axes to plot on. If left as None, a new axis will be created.
sequence_type (class, optional) – Default is selene_sdk.sequences.Genome. The type of sequence that the in silico mutagenesis results are associated with. This should generally be a subclass of selene_sdk.sequences.Sequence.
font_properties (matplotlib.font_manager.FontProperties or None, optional) – Default is None. A matplotlib.font_manager.FontProperties object that specifies the properties of the font to use for plotting the motif. If None, no font will be used, and the text will be rendered by a path. This method of rendering paths is preferred, as it ensures all character heights correspond to the actual values, and that there are no extra gaps between the tops and bottoms of characters at each position in the sequence logo. If the user opts to use a value other than None, then no such guarantee can be made.
color_scheme (list(str) or None, optional) – Default is None. A list containing the hex codes or names of colors to use, appearing in the order of the bases of the sequence type. If left as None, a default palette will be made with seaborn.color_palette, and will have as many colors as there are characters in the input sequence alphabet.

Returns

The axes containing the sequence logo plot.

Return type

matplotlib.pyplot.Axes

Raises

ValueError – If the number of columns in score_matrix does not match the number of characters in the alphabet of sequence_type.
ValueError – If the number of colors in color_palette does not match the number of characters in the alphabet of sequence_type.

Examples

We have included an example of the output from a sequence_logo plot below:

heatmap¶

selene_sdk.interpret.heatmap(score_matrix, mask=None, sequence_type=<class 'selene_sdk.sequences.genome.Genome'>, **kwargs)[source]¶

Plots the input matrix of scores, generally those produced by an in silico mutagenesis experiment, on a heatmap.

Parameters

score_matrix (numpy.ndarray) – An \(L \times N\) array (where \(L\) is the length of the sequence, and \(N\) is the size of the alphabet) containing the scores for each base change at each position.
mask (numpy.ndarray, dtype=bool or None, optional) – Default is None. An \(L \times N\) array (where \(L\) is the length of the sequence, and \(N\) is the size of the alphabet) containing True at positions in the heatmap to mask. If None, no masking will occur.
sequence_type (class, optional) – Default is selene_sdk.sequences.Genome. The class of sequence that the in silico mutagenesis results are associated with. This is generally a sub-class of selene_sdk.sequences.Sequence.
**kwargs (dict) – Keyword arguments to pass to seaborn.heatmap. Some useful ones to remember are:
- cbar_kws - Keyword arguments to forward to the colorbar.
- yticklabels - Manipulate the tick labels on the y axis.
- cbar - If False, hide the color bar. If True, show the colorbar.
- cmap - The color map to use for the heatmap.

Returns

The axes containing the heatmap plot.

Return type

matplotlib.pyplot.Axes

ISMResult¶

class selene_sdk.interpret.ISMResult(data_frame, sequence_type=<class 'selene_sdk.sequences.genome.Genome'>)[source]¶

Bases: object

An object storing the results of an in silico mutagenesis experiment.

Parameters

data_frame (pandas.DataFrame) – The data frame with the results from the in silico mutagenesis experiments.
sequence_type (class, optional) – Default is selene_sdk.sequences.Genome. The type of sequence that the in silico mutagenesis results are associated with. This should generally be a subclass of selene_sdk.sequences.Sequence

Raises

ValueError – If the input data frame contains a base not included in the alphabet of sequence_type.
Exception – If multiple reference positions are specified in the input data frame.
Exception – If the input data does not contain scores for every mutation at every position.

static from_file(input_path, sequence_type=<class 'selene_sdk.sequences.genome.Genome'>)[source]¶

Loads a selene_sdk.interpret.ISMResult from a pandas.DataFrame stored in a file of comma separated values (CSV).

Parameters

input_path (str) – A path to the file of comma separated input values.
sequence_type (class, optional) – Default is selene_sdk.sequences.Genome. The type of sequence that the in silico mutagenesis results are associated with. This should generally be a subclass of selene_sdk.sequences.Sequence.

Returns

The in silico mutagenesis results that were stored in the specified input file.

Return type

selene_sdk.interpret.ISMResult

get_score_matrix_for(feature, reference_mask=None, dtype=<class 'numpy.float64'>)[source]¶

Extracts a feature from the in silico mutagenesis results as a matrix, where the reference base positions hold the value for the reference prediction, and alternative positions hold the results for making a one-base change from the reference base to the specified alternative base.

Parameters

feature (str) – The name of the feature to extract as a matrix.
reference_mask (float or None, optional) – Default is None. A value to mask the reference entries with. If left as None, then no masking will be performed on the reference positions.
dtype (numpy.dtype, optional) – Default is numpy.float64. The data type to use for the returned matrix.

Returns

A \(L \times N\) shaped array (where \(L\) is the sequence length, and \(N\) is the size of the alphabet of sequence_type) that holds the results from the in silico mutagenesis experiment for the specified feature. The elements will be of type dtype.

Return type

numpy.ndarray

Raises

ValueError – If the input data frame contains a base not included in the alphabet of sequence_type.

property reference_sequence¶

The reference sequence that the in silico mutagenesis experiment was performed on.

Returns: The reference sequence (i.e. non-mutated input) as a string of characters.
Return type: str

property sequence_type¶

The type of underlying sequence. This should generally be a subclass of selene_sdk.sequences.Sequence.

Returns: The type of sequence that the in silico mutagenesis was performed on.
Return type: class

rescale_score_matrix¶

selene_sdk.interpret.rescale_score_matrix(score_matrix, base_scaling='identity', position_scaling='identity')[source]¶

Performs base-wise and position-wise scaling of a score matrix for a feature, usually produced from an in silico mutagenesis experiment.

Parameters

score_matrix (numpy.ndarray) – An \(L \times N\) matrix containing the scores for each position, where \(L\) is the length of the sequence, and \(N\) is the number of characters in the alphabet.
base_scaling ({‘identity’, ‘probability’, ‘max_effect’}) –

The type of scaling performed on each base at a given position.
- ‘identity’ - No transformation will be applied to the data.
- ‘probability’ - The relative sizes of the bases will be the original input probabilities.
- ‘max_effect’ - The relative sizes of the bases will be the max effect of the original input values.
position_scaling ({‘identity’, ‘probability’, ‘max_effect’}) –

The type of scaling performed on each position.
- ‘identity’ - No transformation will be applied to the data.
- ‘probability’ - The sum of values at a position will be equal to the sum of the original input values at that position.
- ‘max_effect’ - The sum of values at a position will be equal to the sum of the max effect values of the original input values at that position.

Returns

The transformed score matrix.

Return type

numpy.ndarray

Raises

ValueError – If an unsupported base_scaling or position_scaling is entered.

load_variant_abs_diff_scores¶

selene_sdk.interpret.load_variant_abs_diff_scores(input_path)[source]¶

Loads the variant data, labels, and feature names from a diff scores file output from variant effect prediction.

TODO: should we move this out of vis.py?

Parameters

input_path (str) – Path to the input file.

Returns

tuple[0] is the matrix of absolute difference scores. The rows are the variants and the columns are the features for which the model makes predictions.
tuple[1] is the list of variant labels. Each tuple contains (chrom, pos, name, ref, alt).
tuple[2] is the list of features.

Return type

tuple(np.ndarray, list(tuple(str)), list(str))

variant_diffs_scatter_plot¶

selene_sdk.interpret.variant_diffs_scatter_plot(data, labels, features, output_path, filter_features=None, labels_sort_fn=<function ordered_variants_and_indices>, nth_percentile=None, hg_reference_version=None, threshold_line=None, auto_open=False)[source]¶

Displays each variant’s max probability difference across features as a point in a scatter plot. The points in the scatter plot are ordered by the variant chromosome and position by default. Variants can be sorted differently by passing in a new labels_sort_fn.

Parameters

data (np.ndarray) – Absolute difference scores for variants across all features that a model predicts. This is the first value in the tuple returned by load_variant_abs_diff_scores.
labels (list(tuple(str))) – A list of variant labels. This is the second value in the tuple returned by load_variant_abs_diff_scores.
features (list(str)) – A list of the features the model predicts. This is the third value in the tuple returned by load_variant_abs_diff_scores.
output_path (str) – Path to output file. Must have ‘.html’ extension.
filter_features (types.FunctionType or None, optional) – Default is None. A function that takes in a list(str) of features and returns the list(int) of feature indices over which we would compute the max(probability difference) for each variant. For example, a user may only want to visualize the max probability difference for TF binding features. If None, uses all the features.
labels_sort_fn (types.FunctionType, optional) – Default is ordered_variants_and_indices. A function that takes in a list(tuple(str)) of labels corresponding to the rows in data and returns a tuple(list(tuple), list(int)), where the first value is the ordered list of variant labels and the second value is the ordered list of indices for those variant labels. By default, variants are sorted by chromosome and position.
nth_percentile (int [0, 100] or None, optional) – Default is None. If nth_percentile is not None, only displays the variants with a max absolute difference score within the nth_percentile of scores.
hg_reference_version (str {“hg19”, “hg38”} or None, optional) – Default is None. On hover, we can display the gene(s) closest to each variant if hg_reference_version is not None, where closest can be a variant within a gene interval (where genes and their coordinates are taken from level 1 & 2 protein-coding genes in gencode v28) or near a gene. In the future, we will allow users to specify their own genome file so that this information can be annotated to variants from other organisms, other genome versions, etc.
threshold_line (float or None, optional) – Default is None. If threshold_line is not None, draws a horizontal line at the specified threshold. Helps focus the visual on variants above a certain threshold.
auto_open (bool, optional) – Default is False. If auto_open, will automatically open a web browser that displays the plotted HTML file.

Returns

The generated Plotly figure.

Return type

plotly.graph_objs.graph_objs.Figure

ordered_variants_and_indices¶

selene_sdk.interpret.ordered_variants_and_indices(labels)[source]¶

Get the ordered variant labels, where the labels are sorted by chromosome and position, and the indices corresponding to the sort,

Parameters: labels (list(tuple(str))) – The list of variant labels. Each label is a tuple of (chrom, pos, name, ref, alt).
Returns: The first value is the ordered list of labels. Each label is a tuple of (chrom, pos, ref, alt). The second value is the ordered list of label indices.
Return type: tuple(list(tuple), list(int))

sort_standard_chrs¶

selene_sdk.interpret.sort_standard_chrs(chrom)[source]¶

Returns the value on which the standard chromosomes can be sorted.

Parameters: chrom (str) – The chromosome
Returns: The value on which to sort
Return type: int