selene_sdk

This is the main module for Selene.

TrainModel

class selene_sdk.TrainModel(model, data_sampler, loss_criterion, optimizer_class, optimizer_kwargs, batch_size, max_steps, report_stats_every_n_steps, output_dir, save_checkpoint_every_n_steps=1000, save_new_checkpoints_after_n_steps=None, report_gt_feature_n_positives=10, n_validation_samples=None, n_test_samples=None, cpu_n_threads=1, use_cuda=False, data_parallel=False, logging_verbosity=2, checkpoint_resume=None, metrics={'average_precision': <function average_precision_score>, 'roc_auc': <function roc_auc_score>}, use_scheduler=True, deterministic=False, scheduler_kwargs={'factor': 0.8, 'patience': 16, 'verbose': True}, stopping_criteria=None)[source]

Bases: object

This class ties together the various objects and methods needed to train and validate a model.

TrainModel saves a checkpoint model (overwriting it after save_checkpoint_every_n_steps) as well as a best-performing model (overwriting it after report_stats_every_n_steps if the latest validation performance is better than the previous best-performing model) to output_dir.

TrainModel also outputs 2 files that can be used to monitor training as Selene runs: selene_sdk.train_model.train.txt (training loss) and selene_sdk.train_model.validation.txt (validation loss & average ROC AUC). The columns in these files can be used to quickly visualize training history (e.g. you can use matplotlib, plt.plot(auc_list)) and see, for example, whether the model is still improving, if there are signs of overfitting, etc.

Parameters
  • model (torch.nn.Module) – The model to train.

  • data_sampler (selene_sdk.samplers.Sampler) – The example generator.

  • loss_criterion (torch.nn._Loss) – The loss function to optimize.

  • optimizer_class (torch.optim.Optimizer) – The optimizer to minimize loss with.

  • optimizer_kwargs (dict) – The dictionary of keyword arguments to pass to the optimizer’s constructor.

  • batch_size (int) – Specify the batch size to process examples. Should be a power of 2.

  • max_steps (int) – The maximum number of mini-batches to iterate over.

  • report_stats_every_n_steps (int) – The frequency with which to report summary statistics. You can set this value to be equivalent to a training epoch (n_steps * batch_size) being the total number of samples seen by the model so far. Selene evaluates the model on the validation dataset every report_stats_every_n_steps and, if the model obtains the best performance so far (based on the user-specified loss function), Selene saves the model state to a file called best_model.pth.tar in output_dir.

  • output_dir (str) – The output directory to save model checkpoints and logs in.

  • save_checkpoint_every_n_steps (int or None, optional) – Default is 1000. If None, set to the same value as report_stats_every_n_steps

  • save_new_checkpoints_after_n_steps (int or None, optional) – Default is None. The number of steps after which Selene will continually save new checkpoint model weights files (checkpoint-<TIMESTAMP>.pth.tar) every save_checkpoint_every_n_steps. Before this point, the file checkpoint.pth.tar is overwritten every save_checkpoint_every_n_steps to limit the memory requirements.

  • n_validation_samples (int or None, optional) – Default is None. Specify the number of validation samples in the validation set. If n_validation_samples is None and the data sampler used is the selene_sdk.samplers.IntervalsSampler or selene_sdk.samplers.RandomSampler, we will retrieve 32000 validation samples. If None and using selene_sdk.samplers.MultiSampler, we will use all available validation samples from the appropriate data file.

  • n_test_samples (int or None, optional) – Default is None. Specify the number of test samples in the test set. If n_test_samples is None and

    • the sampler you specified has no test partition, you should not specify evaluate as one of the operations in the ops list. That is, Selene will not automatically evaluate your trained model on a test dataset, because the sampler you are using does not have any test data.

    • the sampler you use is of type selene_sdk.samplers.OnlineSampler (and the test partition exists), we will retrieve 640000 test samples.

    • the sampler you use is of type selene_sdk.samplers.MultiSampler (and the test partition exists), we will use all the test samples available in the appropriate data file.

  • cpu_n_threads (int, optional) – Default is 1. Sets the number of OpenMP threads used for parallelizing CPU operations.

  • use_cuda (bool, optional) – Default is False. Specify whether a CUDA-enabled GPU is available for torch to use during training.

  • data_parallel (bool, optional) – Default is False. Specify whether multiple GPUs are available for torch to use during training.

  • logging_verbosity ({0, 1, 2}, optional) –

    Default is 2. Set the logging verbosity level.

    • 0 - Only warnings will be logged.

    • 1 - Information and warnings will be logged.

    • 2 - Debug messages, information, and warnings will all be logged.

  • checkpoint_resume (str or None, optional) – Default is None. If checkpoint_resume is not None, it should be the path to a model file generated by torch.save that can now be read using torch.load.

  • use_scheduler (bool, optional) – Default is True. If True, learning rate scheduler is used to reduce learning rate on plateau. PyTorch ReduceLROnPlateau scheduler with patience=16 and factor=0.8 is used. Different scheduler parameters can be specified with scheduler_kwargs.

  • deterministic (bool, optional) – Default is False. If True, will set torch.backends.cudnn.deterministic to True and torch.backends.cudnn.benchmark = False. In Selene CLI, if random_seed is set in the configuration YAML, Selene automatically passes in deterministic=True to the TrainModel class.

  • scheduler_kwargs (dict, optional) – Default is patience=16, verbose=True, and factor=0.8. Set the parameters for the PyTorch ReduceLROnPlateau scheduler.

  • stopping_criteria (list or None, optional) – Default is None. If stopping_criteria is not None, it should be a list specifying how to use early stopping. The first value should be a str corresponding to one of metrics. The second value should be an int indicating the patience. If the specified metric does not improve in the given patience (usually corresponding to the number of epochs), training stops early.

Variables
  • ~TrainModel.model (torch.nn.Module) – The model to train.

  • ~TrainModel.sampler (selene_sdk.samplers.Sampler) – The example generator.

  • ~TrainModel.criterion (torch.nn._Loss) – The loss function to optimize.

  • ~TrainModel.optimizer (torch.optim.Optimizer) – The optimizer to minimize loss with.

  • ~TrainModel.batch_size (int) – The size of the mini-batch to use during training.

  • ~TrainModel.max_steps (int) – The maximum number of mini-batches to iterate over.

  • ~TrainModel.nth_step_report_stats (int) – The frequency with which to report summary statistics.

  • ~TrainModel.nth_step_save_checkpoint (int) – The frequency with which to save a model checkpoint.

  • ~TrainModel.use_cuda (bool) – If True, use a CUDA-enabled GPU. If False, use the CPU.

  • ~TrainModel.data_parallel (bool) – Whether to use multiple GPUs or not.

  • ~TrainModel.output_dir (str) – The directory to save model checkpoints and logs.

create_test_set()[source]

Loads the set of test samples. We do not create the test set in the TrainModel object until this method is called, so that we avoid having to load it into memory until the model has been trained and is ready to be evaluated.

evaluate()[source]

Measures the model test performance.

Returns

A dictionary, where keys are the names of the loss metrics, and the values are the average value for that metric over the test set.

Return type

dict

train()[source]

Trains the model on a batch of data.

Returns

The training loss.

Return type

float

train_and_validate()[source]

Trains the model and measures validation performance.

validate()[source]

Measures model validation performance.

Returns

A dictionary, where keys are the names of the loss metrics, and the values are the average value for that metric over the validation set.

Return type

dict

EvaluateModel

class selene_sdk.EvaluateModel(model, criterion, data_sampler, features, trained_model_path, output_dir, batch_size=64, n_test_samples=None, report_gt_feature_n_positives=10, use_cuda=False, data_parallel=False, use_features_ord=None, metrics={'average_precision': <function average_precision_score>, 'roc_auc': <function roc_auc_score>})[source]

Bases: object

Evaluate model on a test set of sequences with known targets.

Parameters
  • model (torch.nn.Module) – The model architecture.

  • criterion (torch.nn._Loss) – The loss function that was optimized during training.

  • data_sampler (selene_sdk.samplers.Sampler) – Used to retrieve samples from the test set for evaluation.

  • features (list(str)) – List of distinct features the model predicts.

  • trained_model_path (str) – Path to the trained model file, saved using torch.save.

  • output_dir (str) – The output directory in which to save model evaluation and logs.

  • batch_size (int, optional) – Default is 64. Specify the batch size to process examples. Should be a power of 2.

  • n_test_samples (int or None, optional) – Default is None. Use n_test_samples if you want to limit the number of samples on which you evaluate your model. If you are using a sampler of type selene_sdk.samplers.OnlineSampler, by default it will draw 640000 samples if n_test_samples is None.

  • report_gt_feature_n_positives (int, optional) – Default is 10. In the final test set, each class/feature must have more than report_gt_feature_n_positives positive samples in order to be considered in the test performance computation. The output file that states each class’ performance will report ‘NA’ for classes that do not have enough positive samples.

  • use_cuda (bool, optional) – Default is False. Specify whether a CUDA-enabled GPU is available for torch to use during training.

  • data_parallel (bool, optional) – Default is False. Specify whether multiple GPUs are available for torch to use during training.

  • use_features_ord (list(str) or None, optional) – Default is None. Specify an ordered list of features for which to run the evaluation. The features in this list must be identical to or a subset of features, and in the order you want the resulting test_targets.npz and test_predictions.npz to be saved. If using a FileSampler or H5DataLoader for the evaluation, you can pass in a dataset with the targets matrix only containing these features, but note that this subsetted targets matrix MUST be ordered the same way as features, and the predictions and targets .npz output will be reordered according to use_features_ord.

Variables
  • ~EvaluateModel.model (torch.nn.Module) – The trained model.

  • ~EvaluateModel.criterion (torch.nn._Loss) – The model was trained using this loss function.

  • ~EvaluateModel.sampler (selene_sdk.samplers.Sampler) – The example generator.

  • ~EvaluateModel.features (list(str)) – List of distinct features the model predicts.

  • ~EvaluateModel.batch_size (int) – The batch size to process examples. Should be a power of 2.

  • ~EvaluateModel.use_cuda (bool) – If True, use a CUDA-enabled GPU. If False, use the CPU.

  • ~EvaluateModel.data_parallel (bool) – Whether to use multiple GPUs or not.

  • ~EvaluateModel.metrics (dict) – A dictionary that maps metric names (str) to metric functions. By default, this contains “roc_auc”, which maps to sklearn.metrics.roc_auc_score, and “average_precision”, which maps to sklearn.metrics.average_precision_score.

evaluate()[source]

Passes all samples retrieved from the sampler to the model in batches and returns the predictions. Also reports the model’s performance on these examples.

Returns

A dictionary, where keys are the features and the values are each a dict of the performance metrics (currently ROC AUC and AUPR) reported for each feature the model predicts.

Return type

dict