selene.predict.predict_handlers¶
This module provides the classes and methods for prediction handlers, which generally are used for logging and saving outputs from models.
PredictionsHandler¶
- class selene_sdk.predict.predict_handlers.PredictionsHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]¶
Bases:
objectThe abstract base class for handlers, which “handle” model predictions. Handlers are responsible for accepting predictions, storing these predictions or scores derived from the predictions, and then returning them in a user-specified output format (Selene currently supports TSV and HDF5 file outputs)
- Parameters
features (list(str)) – List of sequence-level features, in the same order that the model will return its predictions.
columns_for_ids (list(str)) – Columns in the file that will help to identify the sequence or variant to which the model prediction scores correspond.
output_path_prefix (str) – Path to the file to which Selene will write the absolute difference scores. The path may contain a filename prefix. Selene will append a handler-specific name to the end of the path/prefix.
output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if the final file should be easily perused (e.g. viewed in a text editor/Excel). However, saving to a TSV file is much slower than saving to an HDF5 file.
output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.
write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.
write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.
- Variables
~PredictionsHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output
DiffScoreHandler¶
- class selene_sdk.predict.predict_handlers.DiffScoreHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]¶
Bases:
selene_sdk.predict.predict_handlers.handler.PredictionsHandlerThe “diff score” is the difference between alt and ref predictions (alt - ref).
- Parameters
features (list(str)) – List of sequence-level features, in the same order that the model will return its predictions.
columns_for_ids (list(str)) – Columns in the file that help to identify the input sequence or variant to which the model prediction scores correspond.
output_path_prefix (str) – Path to the file to which Selene will write the difference scores. The path may contain a filename prefix. Selene will append diffs to the end of the prefix if specified (otherwise the file will be named diffs.tsv/.h5).
output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if you would like the final file to be easily perused (e.g. viewed in a text editor/Excel). However, saving to a TSV file is much slower than saving to an HDF5 file.
output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.
write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.
write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.
- Variables
~DiffScoreHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output
- handle_batch_predictions(batch_predictions, batch_ids, baseline_predictions)[source]¶
Handles the model predictions for a batch of sequences. Computes the difference between the predictions for 1 or a batch of reference sequences and a batch of alternate sequences (i.e. sequences slightly changed/mutated from the reference).
- Parameters
batch_predictions (arraylike) – The predictions for a batch of sequences. This should have dimensions of \(B \times N\) (where \(B\) is the size of the mini-batch and \(N\) is the number of features).
batch_ids (list(arraylike)) – Batch of sequence identifiers. Each element is arraylike because it may contain more than one column (written to file) that together make up a unique identifier for a sequence.
base_predictions (arraylike) – The baseline prediction(s) used to compute the diff scores. Must either be a vector of dimension \(N\) values or a matrix of dimensions \(B \times N\) (where \(B\) is the size of the mini-batch, and \(N\) is the number of features).
LogitScoreHandler¶
- class selene_sdk.predict.predict_handlers.LogitScoreHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]¶
Bases:
selene_sdk.predict.predict_handlers.handler.PredictionsHandlerThe logit score handler calculates and records the difference between logit(alt) and logit(ref) predictions (logit(alt) - logit(ref)). For reference, if some event occurs with probability \(p\), then the log-odds is the logit of p, or
\[\mathrm{logit}(p) = \log\left(\frac{p}{1 - p}\right) = \log(p) - \log(1 - p)\]- Parameters
features (list of str) – List of sequence-level features, in the same order that the model will return its predictions.
columns_for_ids (list of str) – Columns in the file that help to identify the input sequence to which the features data corresponds.
output_path_prefix (str) – Path to the file to which Selene will write the absolute difference scores. The path may contain a filename prefix. Selene will append logits to the end of the prefix.
output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if you would like the final file to be easily perused. However, saving to a TSV file is much slower than saving to an HDF5 file.
output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.
write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.
write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.
- Variables
~LogitScoreHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output
- handle_batch_predictions(batch_predictions, batch_ids, baseline_predictions)[source]¶
Handles the model predications for a batch of sequences.
- Parameters
batch_predictions (arraylike) – The predictions for a batch of sequences. This should have dimensions of \(B \times N\) (where \(B\) is the size of the mini-batch and \(N\) is the number of features).
batch_ids (list(arraylike)) – Batch of sequence identifiers. Each element is arraylike because it may contain more than one column (written to file) that together make up a unique identifier for a sequence.
base_predictions (arraylike) – The baseline prediction(s) used to compute the logit scores. This must either be a vector of \(N\) values, or a matrix of shape \(B \times N\) (where \(B\) is the size of the mini-batch, and \(N\) is the number of features).
WritePredictionsHandler¶
- class selene_sdk.predict.predict_handlers.WritePredictionsHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]¶
Bases:
selene_sdk.predict.predict_handlers.handler.PredictionsHandlerCollects batches of model predictions and writes all of them to file at the end.
- Parameters
features (list(str)) – List of sequence-level features, in the same order that the model will return its predictions.
columns_for_ids (list(str)) – Columns in the file that help to identify the input sequence to which the features data corresponds.
output_path_prefix (str) – Path to the file to which Selene will write the absolute difference scores. The path may contain a filename prefix. Selene will append predictions to the end of the prefix.
output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if you would like the final file to be easily perused. However, saving to a TSV file is much slower than saving to an HDF5 file.
output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.
write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.
write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.
- Variables
~WritePredictionsHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output
- handle_batch_predictions(batch_predictions, batch_ids)[source]¶
Handles the predictions for a batch of sequences.
- Parameters
batch_predictions (arraylike) – The predictions for a batch of sequences. This should have dimensions of \(B \times N\) (where \(B\) is the size of the mini-batch and \(N\) is the number of features).
batch_ids (list(arraylike)) – Batch of sequence identifiers. Each element is arraylike because it may contain more than one column (written to file) that together make up a unique identifier for a sequence.
WriteRefAltHandler¶
- class selene_sdk.predict.predict_handlers.WriteRefAltHandler(features, columns_for_ids, output_path_prefix, output_format, output_size=None, write_mem_limit=1500, write_labels=True)[source]¶
Bases:
selene_sdk.predict.predict_handlers.handler.PredictionsHandlerUsed during variant effect prediction. This handler records the predicted values for the reference and alternate sequences, and stores these values in two separate files.
- Parameters
features (list(str)) – List of sequence-level features, in the same order that the model will return its predictions.
columns_for_ids (list(str)) – Columns in the file that help to identify the input sequence to which the features data corresponds.
output_path_prefix (str) – Path for the file(s) to which Selene will write the ref alt predictions. The path may contain a filename prefix. Selene will append ref_predictions and alt_predictions to the end of the prefix to distinguish between reference and alternate predictions files written.
output_format ({‘tsv’, ‘hdf5’}) – Specify the desired output format. TSV can be specified if you would like the final file to be easily perused. However, saving to a TSV file is much slower than saving to an HDF5 file.
output_size (int, optional) – The total number of rows in the output. Must be specified when the output_format is hdf5.
write_mem_limit (int, optional) – Default is 1500. Specify the amount of memory you can allocate to storing model predictions/scores for this particular handler, in MB. Handler will write to file whenever this memory limit is reached.
write_labels (bool, optional) – Default is True. If you initialize multiple write handlers for the same set of inputs with output format hdf5, set write_label to False on all handlers except 1 so that only 1 handler writes the row labels to an output file.
- Variables
~WriteRefAltHandler.needs_base_pred (bool) – Whether the handler needs the base (reference) prediction as input to compute the final output
- handle_batch_predictions(batch_predictions, batch_ids, base_predictions)[source]¶
Handles the predictions for a batch of sequences.
- Parameters
batch_predictions (arraylike) – The predictions for a batch of sequences. This should have dimensions of \(B \times N\) (where \(B\) is the size of the mini-batch and \(N\) is the number of features).
batch_ids (list(arraylike)) – Batch of sequence identifiers. Each element is arraylike because it may contain more than one column (written to file) that together make up a unique identifier for a sequence.
base_predictions (arraylike) – The baseline prediction(s) used to compute the logit scores. This must either be a vector of \(N\) values, or a matrix of shape \(B \times N\) (where \(B\) is the size of the mini-batch, and \(N\) is the number of features).