selene.predict.predict_handlers¶
This module provides the classes and methods for prediction handlers, which generally are used for logging and saving outputs from models.
PredictionsHandler¶
- class selene_sdk.predict.predict_handlers.PredictionsHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]¶
Bases:
object
The abstract base class for handlers, which “handle” model predictions. Handlers are responsible for accepting predictions, storing these predictions or scores derived from the predictions, and then returning them in a user-specified output format (Selene currently supports TSV and HDF5 file outputs)
- Parameters
features (list(str)) – List of sequence-level features, in the same order that the model will return its predictions.
columns_for_ids (list(str)) – Columns in the file that will help to identify the sequence or variant to which the model prediction scores correspond.
output_path_prefix (str) – Path to the file to which Selene will write the absolute difference scores. The path may contain a filename prefix. Selene will append a handler-specific name to the end of the path/prefix.
output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if the final file should be easily perused (e.g. viewed in a text editor/Excel). However, saving to a TSV file is much slower than saving to an HDF5 file.
output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.
write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.
write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.
- Variables
~PredictionsHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output
DiffScoreHandler¶
- class selene_sdk.predict.predict_handlers.DiffScoreHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]¶
Bases:
selene_sdk.predict.predict_handlers.handler.PredictionsHandler
The “diff score” is the difference between alt and ref predictions (alt - ref).
- Parameters
features (list(str)) – List of sequence-level features, in the same order that the model will return its predictions.
columns_for_ids (list(str)) – Columns in the file that help to identify the input sequence or variant to which the model prediction scores correspond.
output_path_prefix (str) – Path to the file to which Selene will write the difference scores. The path may contain a filename prefix. Selene will append diffs to the end of the prefix if specified (otherwise the file will be named diffs.tsv/.h5).
output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if you would like the final file to be easily perused (e.g. viewed in a text editor/Excel). However, saving to a TSV file is much slower than saving to an HDF5 file.
output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.
write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.
write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.
- Variables
~DiffScoreHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output
- handle_batch_predictions(batch_predictions, batch_ids, baseline_predictions)[source]¶
Handles the model predictions for a batch of sequences. Computes the difference between the predictions for 1 or a batch of reference sequences and a batch of alternate sequences (i.e. sequences slightly changed/mutated from the reference).
- Parameters
batch_predictions (arraylike) – The predictions for a batch of sequences. This should have dimensions of \(B \times N\) (where \(B\) is the size of the mini-batch and \(N\) is the number of features).
batch_ids (list(arraylike)) – Batch of sequence identifiers. Each element is arraylike because it may contain more than one column (written to file) that together make up a unique identifier for a sequence.
base_predictions (arraylike) – The baseline prediction(s) used to compute the diff scores. Must either be a vector of dimension \(N\) values or a matrix of dimensions \(B \times N\) (where \(B\) is the size of the mini-batch, and \(N\) is the number of features).
LogitScoreHandler¶
- class selene_sdk.predict.predict_handlers.LogitScoreHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]¶
Bases:
selene_sdk.predict.predict_handlers.handler.PredictionsHandler
The logit score handler calculates and records the difference between logit(alt) and logit(ref) predictions (logit(alt) - logit(ref)). For reference, if some event occurs with probability \(p\), then the log-odds is the logit of p, or
\[\mathrm{logit}(p) = \log\left(\frac{p}{1 - p}\right) = \log(p) - \log(1 - p)\]- Parameters
features (list of str) – List of sequence-level features, in the same order that the model will return its predictions.
columns_for_ids (list of str) – Columns in the file that help to identify the input sequence to which the features data corresponds.
output_path_prefix (str) – Path to the file to which Selene will write the absolute difference scores. The path may contain a filename prefix. Selene will append logits to the end of the prefix.
output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if you would like the final file to be easily perused. However, saving to a TSV file is much slower than saving to an HDF5 file.
output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.
write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.
write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.
- Variables
~LogitScoreHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output
- handle_batch_predictions(batch_predictions, batch_ids, baseline_predictions)[source]¶
Handles the model predications for a batch of sequences.
- Parameters
batch_predictions (arraylike) – The predictions for a batch of sequences. This should have dimensions of \(B \times N\) (where \(B\) is the size of the mini-batch and \(N\) is the number of features).
batch_ids (list(arraylike)) – Batch of sequence identifiers. Each element is arraylike because it may contain more than one column (written to file) that together make up a unique identifier for a sequence.
base_predictions (arraylike) – The baseline prediction(s) used to compute the logit scores. This must either be a vector of \(N\) values, or a matrix of shape \(B \times N\) (where \(B\) is the size of the mini-batch, and \(N\) is the number of features).
WritePredictionsHandler¶
- class selene_sdk.predict.predict_handlers.WritePredictionsHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]¶
Bases:
selene_sdk.predict.predict_handlers.handler.PredictionsHandler
Collects batches of model predictions and writes all of them to file at the end.
- Parameters
features (list(str)) – List of sequence-level features, in the same order that the model will return its predictions.
columns_for_ids (list(str)) – Columns in the file that help to identify the input sequence to which the features data corresponds.
output_path_prefix (str) – Path to the file to which Selene will write the absolute difference scores. The path may contain a filename prefix. Selene will append predictions to the end of the prefix.
output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if you would like the final file to be easily perused. However, saving to a TSV file is much slower than saving to an HDF5 file.
output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.
write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.
write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.
- Variables
~WritePredictionsHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output
- handle_batch_predictions(batch_predictions, batch_ids)[source]¶
Handles the predictions for a batch of sequences.
- Parameters
batch_predictions (arraylike) – The predictions for a batch of sequences. This should have dimensions of \(B \times N\) (where \(B\) is the size of the mini-batch and \(N\) is the number of features).
batch_ids (list(arraylike)) – Batch of sequence identifiers. Each element is arraylike because it may contain more than one column (written to file) that together make up a unique identifier for a sequence.
WriteRefAltHandler¶
- class selene_sdk.predict.predict_handlers.WriteRefAltHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]¶
Bases:
selene_sdk.predict.predict_handlers.handler.PredictionsHandler
Used during variant effect prediction. This handler records the predicted values for the reference and alternate sequences, and stores these values in two separate files.
- Parameters
features (list(str)) – List of sequence-level features, in the same order that the model will return its predictions.
columns_for_ids (list(str)) – Columns in the file that help to identify the input sequence to which the features data corresponds.
output_path_prefix (str) – Path for the file(s) to which Selene will write the ref alt predictions. The path may contain a filename prefix. Selene will append ref_predictions and alt_predictions to the end of the prefix to distinguish between reference and alternate predictions files written.
output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if you would like the final file to be easily perused. However, saving to a TSV file is much slower than saving to an HDF5 file.
output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.
write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.
write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.
- Variables
~WriteRefAltHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output
- handle_batch_predictions(batch_predictions, batch_ids, base_predictions)[source]¶
Handles the predictions for a batch of sequences.
- Parameters
batch_predictions (arraylike) – The predictions for a batch of sequences. This should have dimensions of \(B \times N\) (where \(B\) is the size of the mini-batch and \(N\) is the number of features).
batch_ids (list(arraylike)) – Batch of sequence identifiers. Each element is arraylike because it may contain more than one column (written to file) that together make up a unique identifier for a sequence.
base_predictions (arraylike) – The baseline prediction(s) used to compute the logit scores. This must either be a vector of \(N\) values, or a matrix of shape \(B \times N\) (where \(B\) is the size of the mini-batch, and \(N\) is the number of features).