selene_sdk.samplers

This module provides classes and methods for sampling labeled data examples.

Sampler

class selene_sdk.samplers.Sampler(features, save_datasets=[], output_dir=None)[source]

Bases: object

The base class for sampler currently enforces that all samplers have modes for drawing training and validation samples to train a model.

Parameters
  • features (list(str)) – The list of features (classes) the model predicts.

  • save_datasets (list(str), optional) – Default is [] the empty list. The list of modes for which we should save sampled data to file (1 or more of [‘train’, ‘validate’, ‘test’]).

  • output_dir (str or None, optional) – Default is None. Path to the output directory. Used if we save any of the data sampled. If save_datasets is non-empty, output_dir must be a valid path. If the directory does not yet exist, it will be created for you.

Variables
  • ~Sampler.modes (list(str)) – A list of the names of the modes that the object may operate in.

  • ~Sampler.mode (str or None) – The current mode that the object is operating in.

BASE_MODES = ('train', 'validate')

The types of modes that the Sampler object can run in.

abstract get_data_and_targets(batch_size, n_samples, mode=None)[source]

This method fetches a subset of the data from the sampler, divided into batches. This method also allows the user to specify what operating mode to run the sampler in when fetching the data.

Parameters
  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int) – The total number of samples to retrieve.

  • mode (str, optional) – Default is None. The operating mode that the object should run in. If None, will use the current mode self.mode.

abstract get_feature_from_index(index)[source]

Returns the feature corresponding to an index in the feature vector.

Parameters

index (int) – The index of the feature to retrieve the name for.

Returns

The name of the feature occurring at the specified index.

Return type

str

abstract get_test_set(batch_size, n_samples=None)[source]

This method returns a subset of testing data from the sampler, divided into batches.

Parameters
  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int or None, optional) – Default is None. The total number of validation examples to retrieve. If None, 640000 examples are retrieved.

Returns

sequences_and_targets, targets_matrix – Tuple containing the list of sequence-target pairs, as well as a single matrix with all targets in the same order. Note that sequences_and_targets’s sequence elements are of the shape \(B \times L \times N\) and its target elements are of the shape \(B \times F\), where \(B\) is batch_size, \(L\) is the sequence length, \(N\) is the size of the sequence type’s alphabet, and \(F\) is the number of features. Further, target_matrix is of the shape \(S \times F\), where \(S =\) n_samples.

Return type

tuple(list(tuple(numpy.ndarray, numpy.ndarray)), numpy.ndarray)

Raises

ValueError – If no test partition of the data was specified during sampler initialization.

abstract get_validation_set(batch_size, n_samples=None)[source]

This method returns a subset of validation data from the sampler, divided into batches.

Parameters
  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int, optional) – Default is None. The total number of validation examples to retrieve. Handling for n_samples=None should be done by all classes that subclass selene_sdk.samplers.Sampler.

abstract sample(batch_size=1)[source]

Fetches a mini-batch of the data from the sampler.

Parameters

batch_size (int, optional) – Default is 1. The size of the batch to retrieve.

abstract save_dataset_to_file(mode, close_filehandle=False)[source]

Save samples for each partition (i.e. train/validate/test) to disk.

Parameters
  • mode (str) – Must be one of the modes specified in save_datasets during sampler initialization.

  • close_filehandle (bool, optional) – Default is False. close_filehandle=True assumes that all data corresponding to the input mode has been saved to file and save_dataset_to_file will not be called with mode again.

set_mode(mode)[source]

Sets the sampling mode.

Parameters

mode (str) – The name of the mode to use. It must be one of Sampler.BASE_MODES.

Raises

ValueError – If mode is not a valid mode.

OnlineSampler

class selene_sdk.samplers.OnlineSampler(reference_sequence, target_path, features, seed=436, validation_holdout=['chr6', 'chr7'], test_holdout=['chr8', 'chr9'], sequence_length=1001, center_bin_to_predict=201, feature_thresholds=0.5, mode='train', save_datasets=[], output_dir=None)[source]

Bases: selene_sdk.samplers.sampler.Sampler

A sampler in which training/validation/test data is constructed from random sampling of the dataset for each batch passed to the model. This form of sampling may alleviate the problem of loading an extremely large dataset into memory when developing a new model.

Parameters
  • reference_sequence (selene_sdk.sequences.Sequence) – A reference sequence from which to create examples.

  • target_path (str) – Path to tabix-indexed, compressed BED file (*.bed.gz) of genomic coordinates mapped to the genomic features we want to predict.

  • features (list(str)) – List of distinct features that we aim to predict.

  • seed (int, optional) – Default is 436. Sets the random seed for sampling.

  • validation_holdout (list(str) or float, optional) – Default is [‘chr6’, ‘chr7’]. Holdout can be regional or proportional. If regional, expects a list (e.g. [‘X’, ‘Y’]). Regions must match those specified in the first column of the tabix-indexed BED file. If proportional, specify a percentage between (0.0, 1.0). Typically 0.10 or 0.20.

  • test_holdout (list(str) or float, optional) – Default is [‘chr8’, ‘chr9’]. See documentation for validation_holdout for additional information.

  • sequence_length (int, optional) – Default is 1000. Model is trained on sequences of sequence_length where genomic features are annotated to the center regions of these sequences.

  • center_bin_to_predict (int, optional) – Default is 200. Query the tabix-indexed file for a region of length center_bin_to_predict.

  • feature_thresholds (float [0.0, 1.0], optional) – Default is 0.5. The feature_threshold to pass to the GenomicFeatures object.

  • mode ({‘train’, ‘validate’, ‘test’}, optional) – Default is ‘train’. The mode to run the sampler in.

  • save_datasets (list(str), optional) – Default is [] the empty list. The list of modes for which we should save the sampled data to file (e.g. [“test”, “validate”]).

  • output_dir (str or None, optional) – Default is None. The path to the directory where we should save sampled examples for a mode. If save_datasets is a non-empty list, output_dir must be specified. If the path in output_dir does not exist it will be created automatically.

Variables
  • ~OnlineSampler.reference_sequence (selene_sdk.sequences.Sequence) – The reference sequence that examples are created from.

  • ~OnlineSampler.target (selene_sdk.targets.Target) – The selene_sdk.targets.Target object holding the features that we would like to predict.

  • ~OnlineSampler.validation_holdout (list(str) or float) – The samples to hold out for validating model performance. These can be “regional” or “proportional”. If regional, this is a list of region names (e.g. [‘chrX’, ‘chrY’]). These regions must match those specified in the first column of the tabix-indexed BED file. If proportional, this is the fraction of total samples that will be held out.

  • ~OnlineSampler.test_holdout (list(str) or float) – The samples to hold out for testing model performance. See the documentation for validation_holdout for more details.

  • ~OnlineSampler.sequence_length (int) – The length of the sequences to train the model on.

  • ~OnlineSampler.bin_radius (int) – From the center of the sequence, the radius in which to detect a feature annotation in order to include it as a sample’s label.

  • ~OnlineSampler.surrounding_sequence_radius (int) – The length of sequence falling outside of the feature detection bin (i.e. bin_radius) center, but still within the sequence_length.

  • ~OnlineSampler.modes (list(str)) – The list of modes that the sampler can be run in.

  • ~OnlineSampler.mode (str) – The current mode that the sampler is running in. Must be one of the modes listed in modes.

Raises
  • ValueError – If mode is not a valid mode.

  • ValueError – If the parities of sequence_length and center_bin_to_predict are not the same.

  • ValueError – If sequence_length is smaller than center_bin_to_predict is.

  • ValueError – If the types of validation_holdout and test_holdout are not the same.

STRAND_SIDES = ('+', '-')

Defines the strands that features can be sampled from.

get_data_and_targets(batch_size, n_samples=None, mode=None)[source]

This method fetches a subset of the data from the sampler, divided into batches. This method also allows the user to specify what operating mode to run the sampler in when fetching the data.

Parameters
  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int or None, optional) – Default is None. The total number of samples to retrieve. If n_samples is None and the mode is validate, will set n_samples to 32000; if the mode is test, will set n_samples to 640000 if it is None. If the mode is train you must have specified a value for n_samples.

  • mode (str, optional) – Default is None. The mode to run the sampler in when fetching the samples. See selene_sdk.samplers.IntervalsSampler.modes for more information. If None, will use the current mode self.mode.

Returns

sequences_and_targets, targets_matrix – Tuple containing the list of sequence-target pairs, as well as a single matrix with all targets in the same order. Note that sequences_and_targets’s sequence elements are of the shape \(B \times L \times N\) and its target elements are of the shape \(B \times F\), where \(B\) is batch_size, \(L\) is the sequence length, \(N\) is the size of the sequence type’s alphabet, and \(F\) is the number of features. Further, target_matrix is of the shape \(S \times F\), where \(S =\) n_samples.

Return type

tuple(list(tuple(numpy.ndarray, numpy.ndarray)), numpy.ndarray)

get_dataset_in_batches(mode, batch_size, n_samples=None)[source]

This method returns a subset of the data for a specified run mode, divided into mini-batches.

Parameters
  • mode ({‘test’, ‘validate’}) – The mode to run the sampler in when fetching the samples. See selene_sdk.samplers.IntervalsSampler.modes for more information.

  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int or None, optional) – Default is None. The total number of samples to retrieve. If None, it will retrieve 32000 samples if mode is validate or 640000 samples if mode is test or train.

Returns

sequences_and_targets, targets_matrix – Tuple containing the list of sequence-target pairs, as well as a single matrix with all targets in the same order. The list is length \(S\), where \(S =\) n_samples. Note that sequences_and_targets’s sequence elements are of the shape \(B \times L \times N\) and its target elements are of the shape \(B \times F\), where \(B\) is batch_size, \(L\) is the sequence length, \(N\) is the size of the sequence type’s alphabet, and \(F\) is the number of features. Further, target_matrix is of the shape \(S \times F\)

Return type

tuple(list(tuple(numpy.ndarray, numpy.ndarray)), numpy.ndarray)

get_feature_from_index(index)[source]

Returns the feature corresponding to an index in the feature vector.

Parameters

index (int) – The index of the feature to retrieve the name for.

Returns

The name of the feature occurring at the specified index.

Return type

str

get_sequence_from_encoding(encoding)[source]

Gets the string sequence from the one-hot encoding of the sequence.

Parameters

encoding (numpy.ndarray) – An \(L \times N\) array (where \(L\) is the length of the sequence and \(N\) is the size of the sequence type’s alphabet) containing the one-hot encoding of the sequence.

Returns

The sequence of \(L\) characters decoded from the input.

Return type

str

get_test_set(batch_size, n_samples=None)[source]

This method returns a subset of testing data from the sampler, divided into batches.

Parameters
  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int or None, optional) – Default is None. The total number of validation examples to retrieve. If None, 640000 examples are retrieved.

Returns

sequences_and_targets, targets_matrix – Tuple containing the list of sequence-target pairs, as well as a single matrix with all targets in the same order. Note that sequences_and_targets’s sequence elements are of the shape \(B \times L \times N\) and its target elements are of the shape \(B \times F\), where \(B\) is batch_size, \(L\) is the sequence length, \(N\) is the size of the sequence type’s alphabet, and \(F\) is the number of features. Further, target_matrix is of the shape \(S \times F\), where \(S =\) n_samples.

Return type

tuple(list(tuple(numpy.ndarray, numpy.ndarray)), numpy.ndarray)

Raises

ValueError – If no test partition of the data was specified during sampler initialization.

get_validation_set(batch_size, n_samples=None)[source]

This method returns a subset of validation data from the sampler, divided into batches.

Parameters
  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int or None, optional) – Default is None. The total number of validation examples to retrieve. If None, 32000 examples are retrieved.

Returns

sequences_and_targets, targets_matrix – Tuple containing the list of sequence-target pairs, as well as a single matrix with all targets in the same order. Note that sequences_and_targets’s sequence elements are of the shape \(B \times L \times N\) and its target elements are of the shape \(B \times F\), where \(B\) is batch_size, \(L\) is the sequence length, \(N\) is the size of the sequence type’s alphabet, and \(F\) is the number of features. Further, target_matrix is of the shape \(S \times F\), where \(S =\) n_samples.

Return type

tuple(list(tuple(numpy.ndarray, numpy.ndarray)), numpy.ndarray)

save_dataset_to_file(mode, close_filehandle=False)[source]

Save samples for each partition (i.e. train/validate/test) to disk.

Parameters
  • mode (str) – Must be one of the modes specified in save_datasets during sampler initialization.

  • close_filehandle (bool, optional) – Default is False. close_filehandle=True assumes that all data corresponding to the input mode has been saved to file and save_dataset_to_file will not be called with mode again.

IntervalsSampler

class selene_sdk.samplers.IntervalsSampler(reference_sequence, target_path, features, intervals_path, sample_negative=False, seed=436, validation_holdout=['chr6', 'chr7'], test_holdout=['chr8', 'chr9'], sequence_length=1000, center_bin_to_predict=200, feature_thresholds=0.5, mode='train', save_datasets=['test'], output_dir=None)[source]

Bases: selene_sdk.samplers.online_sampler.OnlineSampler

Draws samples from pre-specified windows in the reference sequence.

Parameters
  • reference_sequence (selene_sdk.sequences.Sequence) – A reference sequence from which to create examples.

  • target_path (str) – Path to tabix-indexed, compressed BED file (*.bed.gz) of genomic coordinates mapped to the genomic features we want to predict.

  • features (list(str)) – List of distinct features that we aim to predict.

  • intervals_path (str) – The path to the file that contains the intervals to sample from. In this file, each interval should occur on a separate line.

  • sample_negative (bool, optional) – Default is False. This tells the sampler whether negative examples (i.e. with no positive labels) should be drawn when generating samples. If True, both negative and positive samples will be drawn. If False, only samples with at least one positive label will be drawn.

  • seed (int, optional) – Default is 436. Sets the random seed for sampling.

  • validation_holdout (list(str) or float, optional) – Default is [‘chr6’, ‘chr7’]. Holdout can be regional or proportional. If regional, expects a list (e.g. [‘X’, ‘Y’]). Regions must match those specified in the first column of the tabix-indexed BED file. If proportional, specify a percentage between (0.0, 1.0). Typically 0.10 or 0.20.

  • test_holdout (list(str) or float, optional) – Default is [‘chr8’, ‘chr9’]. See documentation for validation_holdout for additional information.

  • sequence_length (int, optional) – Default is 1000. Model is trained on sequences of sequence_length where genomic features are annotated to the center regions of these sequences.

  • center_bin_to_predict (int, optional) – Default is 200. Query the tabix-indexed file for a region of length center_bin_to_predict.

  • feature_thresholds (float [0.0, 1.0] or None, optional) – Default is 0.5. The feature_threshold to pass to the GenomicFeatures object.

  • mode ({‘train’, ‘validate’, ‘test’}) – Default is ‘train’. The mode to run the sampler in.

  • save_datasets (list of str) – Default is [“test”]. The list of modes for which we should save the sampled data to file.

  • output_dir (str or None, optional) – Default is None. The path to the directory where we should save sampled examples for a mode. If save_datasets is a non-empty list, output_dir must be specified. If the path in output_dir does not exist it will be created automatically.

Variables
  • ~IntervalsSampler.reference_sequence (selene_sdk.sequences.Sequence) – The reference sequence that examples are created from.

  • ~IntervalsSampler.target (selene_sdk.targets.Target) – The selene_sdk.targets.Target object holding the features that we would like to predict.

  • ~IntervalsSampler.sample_from_intervals (list(tuple(str, int, int))) – A list of coordinates that specify the intervals we can draw samples from.

  • ~IntervalsSampler.interval_lengths (list(int)) – A list of the lengths of the intervals that we can draw samples from. The probability that we will draw a sample from an interval is a function of that interval’s length and the length of all other intervals.

  • ~IntervalsSampler.sample_negative (bool) – Whether negative examples (i.e. with no positive label) should be drawn when generating samples. If True, both negative and positive samples will be drawn. If False, only samples with at least one positive label will be drawn.

  • ~IntervalsSampler.validation_holdout (list(str) or float) – The samples to hold out for validating model performance. These can be “regional” or “proportional”. If regional, this is a list of region names (e.g. [‘chrX’, ‘chrY’]). These Regions must match those specified in the first column of the tabix-indexed BED file. If proportional, this is the fraction of total samples that will be held out.

  • ~IntervalsSampler.test_holdout (list(str) or float) – The samples to hold out for testing model performance. See the documentation for validation_holdout for more details.

  • ~IntervalsSampler.sequence_length (int) – The length of the sequences to train the model on.

  • ~IntervalsSampler.bin_radius (int) – From the center of the sequence, the radius in which to detect a feature annotation in order to include it as a sample’s label.

  • ~IntervalsSampler.surrounding_sequence_radius (int) – The length of sequence falling outside of the feature detection bin (i.e. bin_radius) center, but still within the sequence_length.

  • ~IntervalsSampler.modes (list(str)) – The list of modes that the sampler can be run in.

  • ~IntervalsSampler.mode (str) – The current mode that the sampler is running in. Must be one of the modes listed in modes.

sample(batch_size=1)[source]

Randomly draws a mini-batch of examples and their corresponding labels.

Parameters

batch_size (int, optional) – Default is 1. The number of examples to include in the mini-batch.

Returns

sequences, targets – A tuple containing the numeric representation of the sequence examples and their corresponding labels. The shape of sequences will be \(B \times L \times N\), where \(B\) is batch_size, \(L\) is the sequence length, and \(N\) is the size of the sequence type’s alphabet. The shape of targets will be \(B \times F\), where \(F\) is the number of features.

Return type

tuple(numpy.ndarray, numpy.ndarray)

RandomPositionsSampler

class selene_sdk.samplers.RandomPositionsSampler(reference_sequence, target_path, features, seed=436, validation_holdout=['chr6', 'chr7'], test_holdout=['chr8', 'chr9'], sequence_length=1000, center_bin_to_predict=200, feature_thresholds=0.5, mode='train', save_datasets=[], output_dir=None)[source]

Bases: selene_sdk.samplers.online_sampler.OnlineSampler

This sampler randomly selects a position in the genome and queries for a sequence centered at that position for input to the model.

TODO: generalize to selene_sdk.sequences.Sequence?

Parameters
  • reference_sequence (selene_sdk.sequences.Genome) – A reference sequence from which to create examples.

  • target_path (str) – Path to tabix-indexed, compressed BED file (*.bed.gz) of genomic coordinates mapped to the genomic features we want to predict.

  • features (list(str)) – List of distinct features that we aim to predict.

  • seed (int, optional) – Default is 436. Sets the random seed for sampling.

  • validation_holdout (list(str) or float, optional) – Default is [‘chr6’, ‘chr7’]. Holdout can be regional or proportional. If regional, expects a list (e.g. [‘chrX’, ‘chrY’]). Regions must match those specified in the first column of the tabix-indexed BED file. If proportional, specify a percentage between (0.0, 1.0). Typically 0.10 or 0.20.

  • test_holdout (list(str) or float, optional) – Default is [‘chr8’, ‘chr9’]. See documentation for validation_holdout for additional information.

  • sequence_length (int, optional) – Default is 1000. Model is trained on sequences of sequence_length where genomic features are annotated to the center regions of these sequences.

  • center_bin_to_predict (int, optional) – Default is 200. Query the tabix-indexed file for a region of length center_bin_to_predict.

  • feature_thresholds (float [0.0, 1.0], optional) – Default is 0.5. The feature_threshold to pass to the GenomicFeatures object.

  • mode ({‘train’, ‘validate’, ‘test’}) – Default is ‘train’. The mode to run the sampler in.

  • save_datasets (list(str), optional) – Default is [‘test’]. The list of modes for which we should save the sampled data to file.

  • output_dir (str or None, optional) – Default is None. The path to the directory where we should save sampled examples for a mode. If save_datasets is a non-empty list, output_dir must be specified. If the path in output_dir does not exist it will be created automatically.

Variables
  • ~RandomPositionsSampler.reference_sequence (selene_sdk.sequences.Genome) – The reference sequence that examples are created from.

  • ~RandomPositionsSampler.target (selene_sdk.targets.Target) – The selene_sdk.targets.Target object holding the features that we would like to predict.

  • ~RandomPositionsSampler.validation_holdout (list(str) or float) – The samples to hold out for validating model performance. These can be “regional” or “proportional”. If regional, this is a list of region names (e.g. [‘chrX’, ‘chrY’]). These regions must match those specified in the first column of the tabix-indexed BED file. If proportional, this is the fraction of total samples that will be held out.

  • ~RandomPositionsSampler.test_holdout (list(str) or float) – The samples to hold out for testing model performance. See the documentation for validation_holdout for more details.

  • ~RandomPositionsSampler.sequence_length (int) – The length of the sequences to train the model on.

  • ~RandomPositionsSampler.bin_radius (int) – From the center of the sequence, the radius in which to detect a feature annotation in order to include it as a sample’s label.

  • ~RandomPositionsSampler.surrounding_sequence_radius (int) – The length of sequence falling outside of the feature detection bin (i.e. bin_radius) center, but still within the sequence_length.

  • ~RandomPositionsSampler.modes (list(str)) – The list of modes that the sampler can be run in.

  • ~RandomPositionsSampler.mode (str) – The current mode that the sampler is running in. Must be one of the modes listed in modes.

sample(batch_size=1)[source]

Randomly draws a mini-batch of examples and their corresponding labels.

Parameters

batch_size (int, optional) – Default is 1. The number of examples to include in the mini-batch.

Returns

sequences, targets – A tuple containing the numeric representation of the sequence examples and their corresponding labels. The shape of sequences will be \(B \times L \times N\), where \(B\) is batch_size, \(L\) is the sequence length, and \(N\) is the size of the sequence type’s alphabet. The shape of targets will be \(B \times F\), where \(F\) is the number of features.

Return type

tuple(numpy.ndarray, numpy.ndarray)

MultiFileSampler

class selene_sdk.samplers.MultiFileSampler(train_sampler, validate_sampler, features, test_sampler=None, mode='train', save_datasets=[], output_dir=None)[source]

Bases: selene_sdk.samplers.sampler.Sampler

This sampler contains individual file samplers for each mode. The file samplers parse .bed/.mat files that correspond to training, validation, and testing and MultiFileSampler calls on the correct file sampler to draw samples for a given mode.

Variables
  • ~MultiFileSampler.train_sampler (selene_sdk.samplers.file_samplers.FileSampler) – Load your training data as a FileSampler before passing it into the MultiFileSampler constructor.

  • ~MultiFileSampler.validate_sampler (selene_sdk.samplers.file_samplers.FileSampler) – The validation dataset file sampler.

  • ~MultiFileSampler.features (list(str)) – The list of features the model should predict

  • ~MultiFileSampler.test_sampler (None or selene_sdk.samplers.file_samplers.FileSampler, optional) – Default is None. The test file sampler is optional.

  • ~MultiFileSampler.mode (str or None) – Default is “train”. Must be one of {train, validate, test}. The starting mode in which to run the sampler.

  • ~MultiFileSampler.save_datasets (list(str), optional) – Default is None. Currently, we are only including these parameters so that MultiFileSampler is consistent with Sampler. The save dataset functionality for MultiFileSampler has not been defined yet.

  • ~MultiFileSampler.output_dir (str or None, optional) – Default is None. Used if the sampler has any data or logging statements to save to file. Currently not useful for MultiFileSampler.

  • ~MultiFileSampler.modes (list(str)) – A list of the names of the modes that the object may operate in.

  • ~MultiFileSampler.mode – Default is None. The current mode that the object is operating in.

get_data_and_targets(batch_size, n_samples, mode=None)[source]

This method fetches a subset of the data from the sampler, divided into batches. This method also allows the user to specify what operating mode to run the sampler in when fetching the data.

Parameters
  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int) – The total number of samples to retrieve.

  • mode (str, optional) – Default is None. The operating mode that the sampler should run in. If None, will use the current self.mode.

get_feature_from_index(index)[source]

Returns the feature corresponding to an index in the feature vector.

Parameters

index (int) – The index of the feature to retrieve the name for.

Returns

The name of the feature occurring at the specified index.

Return type

str

get_test_set(batch_size, n_samples=None)[source]

This method returns a subset of testing data from the sampler, divided into batches.

Parameters
  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int or None, optional) – Default is None. The total number of validation examples to retrieve. If None, 640000 examples are retrieved.

Returns

sequences_and_targets, targets_matrix – Tuple containing the list of sequence-target pairs, as well as a single matrix with all targets in the same order. Note that sequences_and_targets’s sequence elements are of the shape \(B \times L \times N\) and its target elements are of the shape \(B \times F\), where \(B\) is batch_size, \(L\) is the sequence length, \(N\) is the size of the sequence type’s alphabet, and \(F\) is the number of features. Further, target_matrix is of the shape \(S \times F\), where \(S =\) n_samples.

Return type

tuple(list(tuple(numpy.ndarray, numpy.ndarray)), numpy.ndarray)

Raises

ValueError – If no test partition of the data was specified during sampler initialization.

get_validation_set(batch_size, n_samples=None)[source]

This method returns a subset of validation data from the sampler, divided into batches.

Parameters
  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int, optional) – Default is None. The total number of validation examples to retrieve. Handling for n_samples=None should be done by all classes that subclass selene_sdk.samplers.Sampler.

sample(batch_size=1)[source]

Fetches a mini-batch of the data from the sampler.

Parameters

batch_size (int, optional) – Default is 1. The size of the batch to retrieve.

save_dataset_to_file(mode, close_filehandle=False)[source]

We implement this function in this class only because the TrainModel class calls this method. In the future, we will likely remove this method or implement a different way of “saving the data” for file samplers. For example, we may only output the row numbers sampled so that users may reproduce exactly what order the data was sampled.

Parameters
  • mode (str) – Must be one of the modes specified in save_datasets during sampler initialization.

  • close_filehandle (bool, optional) – Default is False. close_filehandle=True assumes that all data corresponding to the input mode has been saved to file and save_dataset_to_file will not be called with mode again.

set_mode(mode)[source]

Sets the sampling mode.

Parameters

mode (str) – The name of the mode to use. It must be one of Sampler.BASE_MODES (“train”, “validate”) or “test” if the test data is supplied.

Raises

ValueError – If mode is not a valid mode.