selene_sdk.utils

The utils module contains classes and methods that provide more general utilities that are used across the package. Most of this functionality cannot be appropriately confined to just one module, and thus is included here.

NonStrandSpecific

class selene_sdk.utils.NonStrandSpecific(model, mode='mean')[source]

Bases: torch.nn.modules.module.Module

A torch.nn.Module that wraps a user-specified model architecture if the architecture does not need to account for sequence strand-specificity.

Parameters
  • model (torch.nn.Module) – The user-specified model architecture.

  • mode ({‘mean’, ‘max’}, optional) – Default is ‘mean’. NonStrandSpecific will pass the input and the reverse-complement of the input into model. The mode specifies whether we should output the mean or max of the predictions as the non-strand specific prediction.

Variables
  • ~NonStrandSpecific.model (torch.nn.Module) – The user-specified model architecture.

  • ~NonStrandSpecific.mode ({'mean', 'max'}) – How to handle outputting a non-strand specific prediction.

PerformanceMetrics

class selene_sdk.utils.PerformanceMetrics(get_feature_from_index_fn, report_gt_feature_n_positives=10, metrics={'average_precision': <function average_precision_score>, 'roc_auc': <function roc_auc_score>})[source]

Bases: object

Tracks and calculates metrics to evaluate how closely a model’s predictions match the true values it was designed to predict.

Parameters
  • get_feature_from_index_fn (types.FunctionType) – A function that takes an index (int) and returns a feature name (str).

  • report_gt_feature_n_positives (int, optional) – Default is 10. The minimum number of positive examples for a feature in order to compute the score for it.

  • metrics (dict) – A dictionary that maps metric names (str) to metric functions. By default, this contains “roc_auc”, which maps to sklearn.metrics.roc_auc_score, and “average_precision”, which maps to sklearn.metrics.average_precision_score.

Variables
  • ~PerformanceMetrics.skip_threshold (int) – The minimum number of positive examples of a feature that must be included in an update for a metric score to be calculated for it.

  • ~PerformanceMetrics.get_feature_from_index (types.FunctionType) – A function that takes an index (int) and returns a feature name (str).

  • ~PerformanceMetrics.metrics (dict) – A dictionary that maps metric names (str) to metric objects (Metric). By default, this contains “roc_auc” and “average_precision”.

add_metric(name, metric_fn)[source]

Begins tracking of the specified metric.

Parameters
  • name (str) – The name of the metric.

  • metric_fn (types.FunctionType) – A metric function.

remove_metric(name)[source]

Ends the tracking of the specified metric, and returns the previous scores associated with that metric.

Parameters

name (str) – The name of the metric.

Returns

The list of feature-specific scores obtained by previous uses of the specified metric.

Return type

list(float)

update(prediction, target)[source]

Evaluates the tracked metrics on a model prediction and its target value, and adds this to the metric histories.

Parameters
  • prediction (numpy.ndarray) – Value predicted by user model.

  • target (numpy.ndarray) – True value that the user model was trying to predict.

Returns

A dictionary mapping each metric names (str) to the average score of that metric across all features (float).

Return type

dict

visualize(prediction, target, output_dir, **kwargs)[source]

Outputs ROC and PR curves. Does not support other metrics currently.

Parameters
  • prediction (numpy.ndarray) – Value predicted by user model.

  • target (numpy.ndarray) – True value that the user model was trying to predict.

  • output_dir (str) – The path to the directory to output the figures. Directories that do not currently exist will be automatically created.

  • **kwargs (dict) – Keyword arguments to pass to each visualization function. Each function accepts the following args:

    • style : str - Default is “seaborn-colorblind”. Specify a style available in matplotlib.pyplot.style.available to use.

    • dpi : int - Default is 500. Specify dots per inch (resolution) of the figure.

Returns

Outputs figures to output_dir.

Return type

None

write_feature_scores_to_file(output_path)[source]

Writes each metric’s score for each feature to a specified file.

Parameters

output_path (str) – The path to the output file where performance metrics will be written.

Returns

A dictionary mapping feature names (str) to sub-dictionaries (dict). Each sub-dictionary then maps metric names (str) to the score for that metric on the given feature. If a metric was not evaluated on a given feature, the score will be None.

Return type

dict

selene_sdk.utils.visualize_roc_curves(prediction, target, output_dir, report_gt_feature_n_positives=50, style='seaborn-colorblind', fig_title='Feature ROC curves', dpi=500)[source]

Output the ROC curves for each feature predicted by a model as an SVG.

Parameters
  • prediction (numpy.ndarray) – Value predicted by user model.

  • target (numpy.ndarray) – True value that the user model was trying to predict.

  • output_dir (str) – The path to the directory to output the figures. Directories that do not currently exist will be automatically created.

  • report_gt_feature_n_positives (int, optional) – Default is 50. Do not visualize an ROC curve for a feature with less than 50 positive examples in target.

  • style (str, optional) – Default is “seaborn-colorblind”. Specify a style available in matplotlib.pyplot.style.available to use.

  • fig_title (str, optional) – Default is “Feature ROC curves”. Set the figure title.

  • dpi (int, optional) – Default is 500. Specify dots per inch (resolution) of the figure.

Returns

Outputs the figure in output_dir.

Return type

None

selene_sdk.utils.visualize_precision_recall_curves(prediction, target, output_dir, report_gt_feature_n_positives=50, style='seaborn-colorblind', fig_title='Feature precision-recall curves', dpi=500)[source]

Output the precision-recall (PR) curves for each feature predicted by a model as an SVG.

Parameters
  • prediction (numpy.ndarray) – Value predicted by user model.

  • target (numpy.ndarray) – True value that the user model was trying to predict.

  • output_dir (str) – The path to the directory to output the figures. Directories that do not currently exist will be automatically created.

  • report_gt_feature_n_positives (int, optional) – Default is 50. Do not visualize an PR curve for a feature with less than 50 positive examples in target.

  • style (str, optional) – Default is “seaborn-colorblind”. Specify a style available in matplotlib.pyplot.style.available to use.

  • fig_title (str, optional) – Default is “Feature precision-recall curves”. Set the figure title.

  • dpi (int, optional) – Default is 500. Specify dots per inch (resolution) of the figure.

Returns

Outputs the figure in output_dir.

Return type

None

initialize_logger

selene_sdk.utils.initialize_logger(output_path, verbosity=2)[source]

Initializes the logger for Selene. This function can only be called successfully once. If the logger has already been initialized with handlers, the function exits. Otherwise, it proceeds to set the logger configurations.

Parameters
  • output_path (str) – The path to the output file where logs will be written.

  • verbosity (int, {2, 1, 0}) –

    Default is 2. The level of logging verbosity to use.

    • 0 - Only warnings will be logged.

    • 1 - Information and warnings will be logged.

    • 2 - Debug messages, information, and warnings will all be logged.

load_features_list

selene_sdk.utils.load_features_list(input_path)[source]

Reads in a file of distinct feature names line-by-line and returns these features as a list. Each feature name in the file must occur on a separate line.

Parameters

input_path (str) – Path to the features file. Each feature in the input file must be on its own line.

Returns

the same order they appeared in the file (reading from top to bottom).

Return type

list(str) The list of features. The features will appear in the list in

Examples

A file at “input_features.txt”, for the feature names \(YFP\) and \(YFG\) might look like this:

YFP
YFG

We can load these features from that file as follows:

>>> load_features_list("input_features.txt")
["YFP", "YFG"]

load_model_from_state_dict

selene_sdk.utils.load_model_from_state_dict(state_dict, model)[source]

Loads model weights that were saved to a file previously by torch.save. This is a helper function to reconcile state dict keys where a model was saved with/without torch.nn.DataParallel and now must be loaded without/with torch.nn.DataParallel.

Parameters
  • state_dict (collections.OrderedDict) – The state of the model.

  • model (torch.nn.Module) – The PyTorch model, a module composed of submodules.

Returns

Return type

torch.nn.Module The model with weights loaded from the state dict.

Raises

ValueError – If model state dict keys do not match the keys in state_dict.

get_indices_and_probabilities

selene_sdk.utils.get_indices_and_probabilities(interval_lengths, indices)[source]

Given a list of different interval lengths and the indices of interest in that list, weight the probability that we will sample one of the indices in indices based on the interval lengths in that sublist.

Parameters
  • interval_lengths (list(int)) – The list of lengths of intervals that we will draw from. This is used to weight the indices proportionally to interval length.

  • indices (list(int)) – The list of interval length indices to draw from.

Returns

indices, weights – weights of those intervals.

Return type

tuple(list(int), list(float)) Tuple of interval indices to sample from and the corresponding

load_path (for config.yml)

selene_sdk.utils.load_path(path, environ=None, instantiate=False, **kwargs)[source]

Convenience function for loading a YAML configuration from a file.

Parameters
  • path (str) – The path to the file to load on disk.

  • environ (dict, optional) – A dictionary used for ${FOO} substitutions in addition to environment variables. If a key appears both in os.environ and this dictionary, the value in this dictionary is used.

  • instantiate (bool, optional) – If False, do not actually instantiate the objects but instead produce a nested hierarchy of _Proxy objects.

  • **kwargs (dict) – Other keyword arguments, all of which are passed to yaml.load.

Returns

graph – The dictionary or object (if the top-level element specified a Python object to instantiate), or a nested hierarchy of _Proxy objects.

Return type

dict or object

Notes

Taken (with minor changes) from Pylearn2.

instantiate (configuration object)

selene_sdk.utils.instantiate(proxy, bindings=None)[source]

Instantiate a hierarchy of proxy objects.

Parameters
  • proxy (object) – A _Proxy object or list/dict/literal. Strings are run through _preprocess.

  • bindings (dict, optional) – A dictionary mapping previously instantiated _Proxy objects to their instantiated values.

Returns

obj – The result object from recursively instantiating the object DAG.

Return type

object

Notes

Taken (with minor changes) from Pylearn2.

initialize_model

selene_sdk.utils.initialize_model(model_configs, train=True, lr=None)[source]

Initialize model (and associated criterion, optimizer)

Parameters
  • model_configs (dict) – Model-specific configuration

  • train (bool, optional) – Default is True. If train, returns the user-specified optimizer and optimizer class that can be found within the input model file.

  • lr (float or None, optional) – If train, a learning rate must be specified. Otherwise, None.

Returns

model, criterion

  • torch.nn.Module - the model architecture

  • torch.nn._Loss - the loss function associated with the model

  • torch.optim - the optimizer associated with the model

  • dict - the optimizer arguments

The optimizer and its arguments are only returned if train is True.

Return type

tuple(torch.nn.Module, torch.nn._Loss) or model, criterion, optim_class, optim_kwargs : tuple(torch.nn.Module, torch.nn._Loss, torch.optim, dict)

Raises

ValueError – If train but the lr specified is not a float.

parse_configs_and_run

selene_sdk.utils.parse_configs_and_run(configs, create_subdirectory=True, lr=None)[source]

Method to parse the configuration YAML file and run each operation specified.

Parameters
  • configs (dict) – The dictionary of nested configuration parameters. Will look for the following top-level parameters:

    • ops: A list of 1 or more of the values {“train”, “evaluate”, “analyze”}. The operations specified determine what objects and information we expect to parse in order to run these operations. This is required.

    • output_dir: Output directory to use for all the operations. If no output_dir is specified, assumes that all constructors that will be initialized (which have their own configurations in configs) have their own output_dir specified. Optional.

    • random_seed: A random seed set for torch and torch.cuda for reproducibility. Optional.

    • lr: The learning rate, if one of the operations in the list is “train”.

    • load_test_set: If ops: [train, evaluate], you may set this parameter to True if you would like to load the test set into memory ahead of time–and therefore save the test data to a .bed file at the start of training. This is only useful if you have a machine that can support a large increase (on the order of GBs) in memory usage and if you want to create a test dataset early-on because you do not know if your model will finish training and evaluation within the allotted time that your job is run.

  • create_subdirectory (bool, optional) – Default is True. If create_subdirectory, will create a directory within output_dir with the name formatted as “%Y-%m-%d-%H-%M-%S”, the date/time this method was run.

  • lr (float or None, optional) – Default is None. If “lr” (learning rate) is already specified as a top-level key in configs, there is no need to set lr to a value unless you want to override the value in configs. Otherwise, set lr to the desired learning rate if “train” is one of the operations to be executed.

Returns

Executes the operations listed and outputs any files to the dirs specified in each operation’s configuration.

Return type

None

execute (run Selene operations)

selene_sdk.utils.execute(operations, configs, output_dir)[source]

Execute operations in _Selene_.

Parameters
  • operations (list(str)) – The list of operations to carry out in _Selene_.

  • configs (dict or object) – The loaded configurations from a YAML file.

  • output_dir (str or None) – The path to the directory where all outputs will be saved. If None, this means that an output_dir was not specified in the top-level configuration keys. output_dir must be specified in each class’s individual configuration wherever it is required.

Returns

Executes the operations listed and outputs any files to the dirs specified in each operation’s configuration.

Return type

None

Raises

ValueError – If an expected key in configuration is missing.