selene.predict.predict_handlers

This module provides the classes and methods for prediction handlers, which generally are used for logging and saving outputs from models.

PredictionsHandler

class selene_sdk.predict.predict_handlers.PredictionsHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]

Bases: object

The abstract base class for handlers, which “handle” model predictions. Handlers are responsible for accepting predictions, storing these predictions or scores derived from the predictions, and then returning them in a user-specified output format (Selene currently supports TSV and HDF5 file outputs)

Parameters
  • features (list(str)) – List of sequence-level features, in the same order that the model will return its predictions.

  • columns_for_ids (list(str)) – Columns in the file that will help to identify the sequence or variant to which the model prediction scores correspond.

  • output_path_prefix (str) – Path to the file to which Selene will write the absolute difference scores. The path may contain a filename prefix. Selene will append a handler-specific name to the end of the path/prefix.

  • output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if the final file should be easily perused (e.g. viewed in a text editor/Excel). However, saving to a TSV file is much slower than saving to an HDF5 file.

  • output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.

  • write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.

  • write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.

Variables

~PredictionsHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output

abstract handle_batch_predictions(*args, **kwargs)[source]

Must be able to handle a batch of model predictions.

write_to_file()[source]

Writes accumulated handler results to file.

DiffScoreHandler

class selene_sdk.predict.predict_handlers.DiffScoreHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]

Bases: selene_sdk.predict.predict_handlers.handler.PredictionsHandler

The “diff score” is the difference between alt and ref predictions (alt - ref).

Parameters
  • features (list(str)) – List of sequence-level features, in the same order that the model will return its predictions.

  • columns_for_ids (list(str)) – Columns in the file that help to identify the input sequence or variant to which the model prediction scores correspond.

  • output_path_prefix (str) – Path to the file to which Selene will write the difference scores. The path may contain a filename prefix. Selene will append diffs to the end of the prefix if specified (otherwise the file will be named diffs.tsv/.h5).

  • output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if you would like the final file to be easily perused (e.g. viewed in a text editor/Excel). However, saving to a TSV file is much slower than saving to an HDF5 file.

  • output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.

  • write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.

  • write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.

Variables

~DiffScoreHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output

handle_batch_predictions(batch_predictions, batch_ids, baseline_predictions)[source]

Handles the model predictions for a batch of sequences. Computes the difference between the predictions for 1 or a batch of reference sequences and a batch of alternate sequences (i.e. sequences slightly changed/mutated from the reference).

Parameters
  • batch_predictions (arraylike) – The predictions for a batch of sequences. This should have dimensions of \(B \times N\) (where \(B\) is the size of the mini-batch and \(N\) is the number of features).

  • batch_ids (list(arraylike)) – Batch of sequence identifiers. Each element is arraylike because it may contain more than one column (written to file) that together make up a unique identifier for a sequence.

  • base_predictions (arraylike) – The baseline prediction(s) used to compute the diff scores. Must either be a vector of dimension \(N\) values or a matrix of dimensions \(B \times N\) (where \(B\) is the size of the mini-batch, and \(N\) is the number of features).

write_to_file()[source]

Writes stored scores to a file.

LogitScoreHandler

class selene_sdk.predict.predict_handlers.LogitScoreHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]

Bases: selene_sdk.predict.predict_handlers.handler.PredictionsHandler

The logit score handler calculates and records the difference between logit(alt) and logit(ref) predictions (logit(alt) - logit(ref)). For reference, if some event occurs with probability \(p\), then the log-odds is the logit of p, or

\[\mathrm{logit}(p) = \log\left(\frac{p}{1 - p}\right) = \log(p) - \log(1 - p)\]
Parameters
  • features (list of str) – List of sequence-level features, in the same order that the model will return its predictions.

  • columns_for_ids (list of str) – Columns in the file that help to identify the input sequence to which the features data corresponds.

  • output_path_prefix (str) – Path to the file to which Selene will write the absolute difference scores. The path may contain a filename prefix. Selene will append logits to the end of the prefix.

  • output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if you would like the final file to be easily perused. However, saving to a TSV file is much slower than saving to an HDF5 file.

  • output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.

  • write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.

  • write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.

Variables

~LogitScoreHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output

handle_batch_predictions(batch_predictions, batch_ids, baseline_predictions)[source]

Handles the model predications for a batch of sequences.

Parameters
  • batch_predictions (arraylike) – The predictions for a batch of sequences. This should have dimensions of \(B \times N\) (where \(B\) is the size of the mini-batch and \(N\) is the number of features).

  • batch_ids (list(arraylike)) – Batch of sequence identifiers. Each element is arraylike because it may contain more than one column (written to file) that together make up a unique identifier for a sequence.

  • base_predictions (arraylike) – The baseline prediction(s) used to compute the logit scores. This must either be a vector of \(N\) values, or a matrix of shape \(B \times N\) (where \(B\) is the size of the mini-batch, and \(N\) is the number of features).

write_to_file()[source]

Write the stored scores to file.

WritePredictionsHandler

class selene_sdk.predict.predict_handlers.WritePredictionsHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]

Bases: selene_sdk.predict.predict_handlers.handler.PredictionsHandler

Collects batches of model predictions and writes all of them to file at the end.

Parameters
  • features (list(str)) – List of sequence-level features, in the same order that the model will return its predictions.

  • columns_for_ids (list(str)) – Columns in the file that help to identify the input sequence to which the features data corresponds.

  • output_path_prefix (str) – Path to the file to which Selene will write the absolute difference scores. The path may contain a filename prefix. Selene will append predictions to the end of the prefix.

  • output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if you would like the final file to be easily perused. However, saving to a TSV file is much slower than saving to an HDF5 file.

  • output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.

  • write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.

  • write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.

Variables

~WritePredictionsHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output

handle_batch_predictions(batch_predictions, batch_ids)[source]

Handles the predictions for a batch of sequences.

Parameters
  • batch_predictions (arraylike) – The predictions for a batch of sequences. This should have dimensions of \(B \times N\) (where \(B\) is the size of the mini-batch and \(N\) is the number of features).

  • batch_ids (list(arraylike)) – Batch of sequence identifiers. Each element is arraylike because it may contain more than one column (written to file) that together make up a unique identifier for a sequence.

write_to_file()[source]

Writes the stored scores to a file.

WriteRefAltHandler

class selene_sdk.predict.predict_handlers.WriteRefAltHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]

Bases: selene_sdk.predict.predict_handlers.handler.PredictionsHandler

Used during variant effect prediction. This handler records the predicted values for the reference and alternate sequences, and stores these values in two separate files.

Parameters
  • features (list(str)) – List of sequence-level features, in the same order that the model will return its predictions.

  • columns_for_ids (list(str)) – Columns in the file that help to identify the input sequence to which the features data corresponds.

  • output_path_prefix (str) – Path for the file(s) to which Selene will write the ref alt predictions. The path may contain a filename prefix. Selene will append ref_predictions and alt_predictions to the end of the prefix to distinguish between reference and alternate predictions files written.

  • output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if you would like the final file to be easily perused. However, saving to a TSV file is much slower than saving to an HDF5 file.

  • output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.

  • write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.

  • write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.

Variables

~WriteRefAltHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output

handle_batch_predictions(batch_predictions, batch_ids, base_predictions)[source]

Handles the predictions for a batch of sequences.

Parameters
  • batch_predictions (arraylike) – The predictions for a batch of sequences. This should have dimensions of \(B \times N\) (where \(B\) is the size of the mini-batch and \(N\) is the number of features).

  • batch_ids (list(arraylike)) – Batch of sequence identifiers. Each element is arraylike because it may contain more than one column (written to file) that together make up a unique identifier for a sequence.

  • base_predictions (arraylike) – The baseline prediction(s) used to compute the logit scores. This must either be a vector of \(N\) values, or a matrix of shape \(B \times N\) (where \(B\) is the size of the mini-batch, and \(N\) is the number of features).

write_to_file()[source]

Writes the stored scores to 2 files (1 for ref, 1 for alt).

write_to_file