selene_sdk.utils¶
The utils module contains classes and methods that provide more general utilities that are used across the package. Most of this functionality cannot be appropriately confined to just one module, and thus is included here.
NonStrandSpecific¶
-
class
selene_sdk.utils.
NonStrandSpecific
(model, mode='mean')[source]¶ Bases:
torch.nn.modules.module.Module
A torch.nn.Module that wraps a user-specified model architecture if the architecture does not need to account for sequence strand-specificity.
- Parameters
model (torch.nn.Module) – The user-specified model architecture.
mode ({‘mean’, ‘max’}, optional) – Default is ‘mean’. NonStrandSpecific will pass the input and the reverse-complement of the input into model. The mode specifies whether we should output the mean or max of the predictions as the non-strand specific prediction.
- Variables
~NonStrandSpecific.model (torch.nn.Module) – The user-specified model architecture.
~NonStrandSpecific.mode ({'mean', 'max'}) – How to handle outputting a non-strand specific prediction.
PerformanceMetrics¶
-
class
selene_sdk.utils.
PerformanceMetrics
(get_feature_from_index_fn, report_gt_feature_n_positives=10, metrics={'average_precision': <function average_precision_score>, 'roc_auc': <function roc_auc_score>})[source]¶ Bases:
object
Tracks and calculates metrics to evaluate how closely a model’s predictions match the true values it was designed to predict.
- Parameters
get_feature_from_index_fn (types.FunctionType) – A function that takes an index (int) and returns a feature name (str).
report_gt_feature_n_positives (int, optional) – Default is 10. The minimum number of positive examples for a feature in order to compute the score for it.
metrics (dict) – A dictionary that maps metric names (str) to metric functions. By default, this contains “roc_auc”, which maps to sklearn.metrics.roc_auc_score, and “average_precision”, which maps to sklearn.metrics.average_precision_score.
- Variables
~PerformanceMetrics.skip_threshold (int) – The minimum number of positive examples of a feature that must be included in an update for a metric score to be calculated for it.
~PerformanceMetrics.get_feature_from_index (types.FunctionType) – A function that takes an index (int) and returns a feature name (str).
~PerformanceMetrics.metrics (dict) – A dictionary that maps metric names (str) to metric objects (Metric). By default, this contains “roc_auc” and “average_precision”.
-
add_metric
(name, metric_fn)[source]¶ Begins tracking of the specified metric.
- Parameters
name (str) – The name of the metric.
metric_fn (types.FunctionType) – A metric function.
-
remove_metric
(name)[source]¶ Ends the tracking of the specified metric, and returns the previous scores associated with that metric.
-
update
(prediction, target)[source]¶ Evaluates the tracked metrics on a model prediction and its target value, and adds this to the metric histories.
- Parameters
prediction (numpy.ndarray) – Value predicted by user model.
target (numpy.ndarray) – True value that the user model was trying to predict.
- Returns
A dictionary mapping each metric names (str) to the average score of that metric across all features (float).
- Return type
-
visualize
(prediction, target, output_dir, **kwargs)[source]¶ Outputs ROC and PR curves. Does not support other metrics currently.
- Parameters
prediction (numpy.ndarray) – Value predicted by user model.
target (numpy.ndarray) – True value that the user model was trying to predict.
output_dir (str) – The path to the directory to output the figures. Directories that do not currently exist will be automatically created.
**kwargs (dict) – Keyword arguments to pass to each visualization function. Each function accepts the following args:
style : str - Default is “seaborn-colorblind”. Specify a style available in matplotlib.pyplot.style.available to use.
dpi : int - Default is 500. Specify dots per inch (resolution) of the figure.
- Returns
Outputs figures to output_dir.
- Return type
-
write_feature_scores_to_file
(output_path)[source]¶ Writes each metric’s score for each feature to a specified file.
- Parameters
output_path (str) – The path to the output file where performance metrics will be written.
- Returns
A dictionary mapping feature names (str) to sub-dictionaries (dict). Each sub-dictionary then maps metric names (str) to the score for that metric on the given feature. If a metric was not evaluated on a given feature, the score will be None.
- Return type
-
selene_sdk.utils.
visualize_roc_curves
(prediction, target, output_dir, report_gt_feature_n_positives=50, style='seaborn-colorblind', fig_title='Feature ROC curves', dpi=500)[source]¶ Output the ROC curves for each feature predicted by a model as an SVG.
- Parameters
prediction (numpy.ndarray) – Value predicted by user model.
target (numpy.ndarray) – True value that the user model was trying to predict.
output_dir (str) – The path to the directory to output the figures. Directories that do not currently exist will be automatically created.
report_gt_feature_n_positives (int, optional) – Default is 50. Do not visualize an ROC curve for a feature with less than 50 positive examples in target.
style (str, optional) – Default is “seaborn-colorblind”. Specify a style available in matplotlib.pyplot.style.available to use.
fig_title (str, optional) – Default is “Feature ROC curves”. Set the figure title.
dpi (int, optional) – Default is 500. Specify dots per inch (resolution) of the figure.
- Returns
Outputs the figure in output_dir.
- Return type
-
selene_sdk.utils.
visualize_precision_recall_curves
(prediction, target, output_dir, report_gt_feature_n_positives=50, style='seaborn-colorblind', fig_title='Feature precision-recall curves', dpi=500)[source]¶ Output the precision-recall (PR) curves for each feature predicted by a model as an SVG.
- Parameters
prediction (numpy.ndarray) – Value predicted by user model.
target (numpy.ndarray) – True value that the user model was trying to predict.
output_dir (str) – The path to the directory to output the figures. Directories that do not currently exist will be automatically created.
report_gt_feature_n_positives (int, optional) – Default is 50. Do not visualize an PR curve for a feature with less than 50 positive examples in target.
style (str, optional) – Default is “seaborn-colorblind”. Specify a style available in matplotlib.pyplot.style.available to use.
fig_title (str, optional) – Default is “Feature precision-recall curves”. Set the figure title.
dpi (int, optional) – Default is 500. Specify dots per inch (resolution) of the figure.
- Returns
Outputs the figure in output_dir.
- Return type
initialize_logger¶
-
selene_sdk.utils.
initialize_logger
(output_path, verbosity=2)[source]¶ Initializes the logger for Selene. This function can only be called successfully once. If the logger has already been initialized with handlers, the function exits. Otherwise, it proceeds to set the logger configurations.
- Parameters
output_path (str) – The path to the output file where logs will be written.
verbosity (int, {2, 1, 0}) –
Default is 2. The level of logging verbosity to use.
0 - Only warnings will be logged.
1 - Information and warnings will be logged.
2 - Debug messages, information, and warnings will all be logged.
load_features_list¶
-
selene_sdk.utils.
load_features_list
(input_path)[source]¶ Reads in a file of distinct feature names line-by-line and returns these features as a list. Each feature name in the file must occur on a separate line.
- Parameters
input_path (str) – Path to the features file. Each feature in the input file must be on its own line.
- Returns
the same order they appeared in the file (reading from top to bottom).
- Return type
list(str) The list of features. The features will appear in the list in
Examples
A file at “input_features.txt”, for the feature names \(YFP\) and \(YFG\) might look like this:
YFP YFG
We can load these features from that file as follows:
>>> load_features_list("input_features.txt") ["YFP", "YFG"]
load_model_from_state_dict¶
-
selene_sdk.utils.
load_model_from_state_dict
(state_dict, model)[source]¶ Loads model weights that were saved to a file previously by torch.save. This is a helper function to reconcile state dict keys where a model was saved with/without torch.nn.DataParallel and now must be loaded without/with torch.nn.DataParallel.
- Parameters
state_dict (collections.OrderedDict) – The state of the model.
model (torch.nn.Module) – The PyTorch model, a module composed of submodules.
- Returns
- Return type
torch.nn.Module The model with weights loaded from the state dict.
- Raises
ValueError – If model state dict keys do not match the keys in state_dict.
get_indices_and_probabilities¶
-
selene_sdk.utils.
get_indices_and_probabilities
(interval_lengths, indices)[source]¶ Given a list of different interval lengths and the indices of interest in that list, weight the probability that we will sample one of the indices in indices based on the interval lengths in that sublist.
- Parameters
interval_lengths (list(int)) – The list of lengths of intervals that we will draw from. This is used to weight the indices proportionally to interval length.
indices (list(int)) – The list of interval length indices to draw from.
- Returns
indices, weights – weights of those intervals.
- Return type
tuple(list(int), list(float)) Tuple of interval indices to sample from and the corresponding
load_path (for config.yml)¶
-
selene_sdk.utils.
load_path
(path, environ=None, instantiate=False, **kwargs)[source]¶ Convenience function for loading a YAML configuration from a file.
- Parameters
path (str) – The path to the file to load on disk.
environ (dict, optional) – A dictionary used for ${FOO} substitutions in addition to environment variables. If a key appears both in os.environ and this dictionary, the value in this dictionary is used.
instantiate (bool, optional) – If False, do not actually instantiate the objects but instead produce a nested hierarchy of _Proxy objects.
**kwargs (dict) – Other keyword arguments, all of which are passed to yaml.load.
- Returns
graph – The dictionary or object (if the top-level element specified a Python object to instantiate), or a nested hierarchy of _Proxy objects.
- Return type
Notes
Taken (with minor changes) from Pylearn2.
instantiate (configuration object)¶
-
selene_sdk.utils.
instantiate
(proxy, bindings=None)[source]¶ Instantiate a hierarchy of proxy objects.
- Parameters
proxy (object) – A _Proxy object or list/dict/literal. Strings are run through _preprocess.
bindings (dict, optional) – A dictionary mapping previously instantiated _Proxy objects to their instantiated values.
- Returns
obj – The result object from recursively instantiating the object DAG.
- Return type
Notes
Taken (with minor changes) from Pylearn2.
initialize_model¶
-
selene_sdk.utils.
initialize_model
(model_configs, train=True, lr=None)[source]¶ Initialize model (and associated criterion, optimizer)
- Parameters
model_configs (dict) – Model-specific configuration
train (bool, optional) – Default is True. If train, returns the user-specified optimizer and optimizer class that can be found within the input model file.
lr (float or None, optional) – If train, a learning rate must be specified. Otherwise, None.
- Returns
model, criterion –
torch.nn.Module - the model architecture
torch.nn._Loss - the loss function associated with the model
torch.optim - the optimizer associated with the model
dict - the optimizer arguments
The optimizer and its arguments are only returned if train is True.
- Return type
tuple(torch.nn.Module, torch.nn._Loss) or model, criterion, optim_class, optim_kwargs : tuple(torch.nn.Module, torch.nn._Loss, torch.optim, dict)
- Raises
ValueError – If train but the lr specified is not a float.
parse_configs_and_run¶
-
selene_sdk.utils.
parse_configs_and_run
(configs, create_subdirectory=True, lr=None)[source]¶ Method to parse the configuration YAML file and run each operation specified.
- Parameters
configs (dict) – The dictionary of nested configuration parameters. Will look for the following top-level parameters:
ops: A list of 1 or more of the values {“train”, “evaluate”, “analyze”}. The operations specified determine what objects and information we expect to parse in order to run these operations. This is required.
output_dir: Output directory to use for all the operations. If no output_dir is specified, assumes that all constructors that will be initialized (which have their own configurations in configs) have their own output_dir specified. Optional.
random_seed: A random seed set for torch and torch.cuda for reproducibility. Optional.
lr: The learning rate, if one of the operations in the list is “train”.
load_test_set: If ops: [train, evaluate], you may set this parameter to True if you would like to load the test set into memory ahead of time–and therefore save the test data to a .bed file at the start of training. This is only useful if you have a machine that can support a large increase (on the order of GBs) in memory usage and if you want to create a test dataset early-on because you do not know if your model will finish training and evaluation within the allotted time that your job is run.
create_subdirectory (bool, optional) – Default is True. If create_subdirectory, will create a directory within output_dir with the name formatted as “%Y-%m-%d-%H-%M-%S”, the date/time this method was run.
lr (float or None, optional) – Default is None. If “lr” (learning rate) is already specified as a top-level key in configs, there is no need to set lr to a value unless you want to override the value in configs. Otherwise, set lr to the desired learning rate if “train” is one of the operations to be executed.
- Returns
Executes the operations listed and outputs any files to the dirs specified in each operation’s configuration.
- Return type
execute (run Selene operations)¶
-
selene_sdk.utils.
execute
(operations, configs, output_dir)[source]¶ Execute operations in _Selene_.
- Parameters
operations (list(str)) – The list of operations to carry out in _Selene_.
configs (dict or object) – The loaded configurations from a YAML file.
output_dir (str or None) – The path to the directory where all outputs will be saved. If None, this means that an output_dir was not specified in the top-level configuration keys. output_dir must be specified in each class’s individual configuration wherever it is required.
- Returns
Executes the operations listed and outputs any files to the dirs specified in each operation’s configuration.
- Return type
- Raises
ValueError – If an expected key in configuration is missing.