# selene_sdk.samplers¶

This module provides classes and methods for sampling labeled data examples.

## Sampler¶

class selene_sdk.samplers.Sampler(features, save_datasets=[], output_dir=None)[source]

Bases: object

The base class for sampler currently enforces that all samplers have modes for drawing training and validation samples to train a model.

Parameters
• features (list(str)) – The list of features (classes) the model predicts.

• save_datasets (list(str), optional) – Default is [] the empty list. The list of modes for which we should save sampled data to file (1 or more of [‘train’, ‘validate’, ‘test’]).

• output_dir (str or None, optional) – Default is None. Path to the output directory. Used if we save any of the data sampled. If save_datasets is non-empty, output_dir must be a valid path. If the directory does not yet exist, it will be created for you.

Variables
• ~Sampler.modes (list(str)) – A list of the names of the modes that the object may operate in.

• ~Sampler.mode (str or None) – The current mode that the object is operating in.

BASE_MODES = ('train', 'validate')

The types of modes that the Sampler object can run in.

abstract get_data_and_targets(batch_size, n_samples, mode=None)[source]

This method fetches a subset of the data from the sampler, divided into batches. This method also allows the user to specify what operating mode to run the sampler in when fetching the data.

Parameters
• batch_size (int) – The size of the batches to divide the data into.

• n_samples (int) – The total number of samples to retrieve.

• mode (str, optional) – Default is None. The operating mode that the object should run in. If None, will use the current mode self.mode.

abstract get_feature_from_index(index)[source]

Returns the feature corresponding to an index in the feature vector.

Parameters

index (int) – The index of the feature to retrieve the name for.

Returns

The name of the feature occurring at the specified index.

Return type

str

abstract get_test_set(batch_size, n_samples=None)[source]

This method returns a subset of testing data from the sampler, divided into batches.

Parameters
• batch_size (int) – The size of the batches to divide the data into.

• n_samples (int or None, optional) – Default is None. Handling for n_samples=None should be done by all classes that subclass selene_sdk.samplers.Sampler.

Returns

sequences_and_targets, targets_matrix – Tuple containing the list of sequence-target pairs, as well as a single matrix with all targets in the same order. Note that sequences_and_targets’s sequence elements are of the shape $$B \times L \times N$$ and its target elements are of the shape $$B \times F$$, where $$B$$ is batch_size, $$L$$ is the sequence length, $$N$$ is the size of the sequence type’s alphabet, and $$F$$ is the number of features. Further, target_matrix is of the shape $$S \times F$$, where $$S =$$ n_samples.

Return type
Raises

ValueError – If no test partition of the data was specified during sampler initialization.

abstract get_validation_set(batch_size, n_samples=None)[source]

This method returns a subset of validation data from the sampler, divided into batches.

Parameters
• batch_size (int) – The size of the batches to divide the data into.

• n_samples (int, optional) – Default is None. The total number of validation examples to retrieve. Handling for n_samples=None should be done by all classes that subclass selene_sdk.samplers.Sampler.

abstract sample(batch_size=1, mode=None)[source]

Fetches a mini-batch of the data from the sampler.

Parameters
• batch_size (int, optional) – Default is 1. The size of the batch to retrieve.

• mode (str, optional) – Default is None. The operating mode that the object should run in. If None, will use the current mode self.mode.

abstract save_dataset_to_file(mode, close_filehandle=False)[source]

Save samples for each partition (i.e. train/validate/test) to disk.

Parameters
• mode (str) – Must be one of the modes specified in save_datasets during sampler initialization.

• close_filehandle (bool, optional) – Default is False. close_filehandle=True assumes that all data corresponding to the input mode has been saved to file and save_dataset_to_file will not be called with mode again.

set_mode(mode)[source]

Sets the sampling mode.

Parameters

mode (str) – The name of the mode to use. It must be one of Sampler.BASE_MODES.

Raises

ValueError – If mode is not a valid mode.

## OnlineSampler¶

class selene_sdk.samplers.OnlineSampler(reference_sequence, target_path, features, seed=436, validation_holdout=['chr6', 'chr7'], test_holdout=['chr8', 'chr9'], sequence_length=1001, center_bin_to_predict=201, feature_thresholds=0.5, mode='train', save_datasets=[], output_dir=None)[source]

A sampler in which training/validation/test data is constructed from random sampling of the dataset for each batch passed to the model. This form of sampling may alleviate the problem of loading an extremely large dataset into memory when developing a new model.

Parameters
• reference_sequence (selene_sdk.sequences.Sequence) – A reference sequence from which to create examples.

• target_path (str) – Path to tabix-indexed, compressed BED file (*.bed.gz) of genomic coordinates mapped to the genomic features we want to predict.

• features (list(str)) – List of distinct features that we aim to predict.

• seed (int, optional) – Default is 436. Sets the random seed for sampling.

• validation_holdout (list(str) or float, optional) – Default is [‘chr6’, ‘chr7’]. Holdout can be regional or proportional. If regional, expects a list (e.g. [‘X’, ‘Y’]). Regions must match those specified in the first column of the tabix-indexed BED file. If proportional, specify a percentage between (0.0, 1.0). Typically 0.10 or 0.20.

• test_holdout (list(str) or float, optional) – Default is [‘chr8’, ‘chr9’]. See documentation for validation_holdout for additional information.

• sequence_length (int, optional) – Default is 1000. Model is trained on sequences of sequence_length where genomic features are annotated to the center regions of these sequences.

• center_bin_to_predict (int, optional) – Default is 200. Query the tabix-indexed file for a region of length center_bin_to_predict.

• feature_thresholds (float [0.0, 1.0], optional) – Default is 0.5. The feature_threshold to pass to the GenomicFeatures object.

• mode ({‘train’, ‘validate’, ‘test’}, optional) – Default is ‘train’. The mode to run the sampler in.

• save_datasets (list(str), optional) – Default is [] the empty list. The list of modes for which we should save the sampled data to file (e.g. [“test”, “validate”]).

• output_dir (str or None, optional) – Default is None. The path to the directory where we should save sampled examples for a mode. If save_datasets is a non-empty list, output_dir must be specified. If the path in output_dir does not exist it will be created automatically.

Variables
• ~OnlineSampler.reference_sequence (selene_sdk.sequences.Sequence) – The reference sequence that examples are created from.

• ~OnlineSampler.target (selene_sdk.targets.Target) – The selene_sdk.targets.Target object holding the features that we would like to predict.

• ~OnlineSampler.validation_holdout (list(str) or float) – The samples to hold out for validating model performance. These can be “regional” or “proportional”. If regional, this is a list of region names (e.g. [‘chrX’, ‘chrY’]). These regions must match those specified in the first column of the tabix-indexed BED file. If proportional, this is the fraction of total samples that will be held out.

• ~OnlineSampler.test_holdout (list(str) or float) – The samples to hold out for testing model performance. See the documentation for validation_holdout for more details.

• ~OnlineSampler.sequence_length (int) – The length of the sequences to train the model on.

• ~OnlineSampler.modes (list(str)) – The list of modes that the sampler can be run in.

• ~OnlineSampler.mode (str) – The current mode that the sampler is running in. Must be one of the modes listed in modes.

Raises
• ValueError – If mode is not a valid mode.

• ValueError – If the parities of sequence_length and center_bin_to_predict are not the same.

• ValueError – If sequence_length is smaller than center_bin_to_predict is.

• ValueError – If the types of validation_holdout and test_holdout are not the same.

STRAND_SIDES = ('+', '-')

Defines the strands that features can be sampled from.

get_data_and_targets(batch_size, n_samples=None, mode=None)[source]

This method fetches a subset of the data from the sampler, divided into batches. This method also allows the user to specify what operating mode to run the sampler in when fetching the data.

Parameters
• batch_size (int) – The size of the batches to divide the data into.

• n_samples (int or None, optional) – Default is None. The total number of samples to retrieve. If n_samples is None and the mode is validate, will set n_samples to 32000; if the mode is test, will set n_samples to 640000 if it is None. If the mode is train you must have specified a value for n_samples.

• mode (str, optional) – Default is None. The mode to run the sampler in when fetching the samples. See selene_sdk.samplers.IntervalsSampler.modes for more information. If None, will use the current mode self.mode.

Returns

sequences_and_targets, targets_matrix – Tuple containing the list of sequence-target pairs, as well as a single matrix with all targets in the same order. Note that sequences_and_targets’s sequence elements are of the shape $$B \times L \times N$$ and its target elements are of the shape $$B \times F$$, where $$B$$ is batch_size, $$L$$ is the sequence length, $$N$$ is the size of the sequence type’s alphabet, and $$F$$ is the number of features. Further, target_matrix is of the shape $$S \times F$$, where $$S =$$ n_samples.

Return type
get_dataset_in_batches(mode, batch_size, n_samples=None)[source]

This method returns a subset of the data for a specified run mode, divided into mini-batches.

Parameters
• mode ({‘test’, ‘validate’}) – The mode to run the sampler in when fetching the samples. See selene_sdk.samplers.IntervalsSampler.modes for more information.

• batch_size (int) – The size of the batches to divide the data into.

• n_samples (int or None, optional) – Default is None. The total number of samples to retrieve. If None, it will retrieve 32000 samples if mode is validate or 640000 samples if mode is test or train.

Returns

sequences_and_targets, targets_matrix – Tuple containing the list of sequence-target pairs, as well as a single matrix with all targets in the same order. The list is length $$S$$, where $$S =$$ n_samples. Note that sequences_and_targets’s sequence elements are of the shape $$B \times L \times N$$ and its target elements are of the shape $$B \times F$$, where $$B$$ is batch_size, $$L$$ is the sequence length, $$N$$ is the size of the sequence type’s alphabet, and $$F$$ is the number of features. Further, target_matrix is of the shape $$S \times F$$

Return type
get_feature_from_index(index)[source]

Returns the feature corresponding to an index in the feature vector.

Parameters

index (int) – The index of the feature to retrieve the name for.

Returns

The name of the feature occurring at the specified index.

Return type

str

get_sequence_from_encoding(encoding)[source]

Gets the string sequence from the one-hot encoding of the sequence.

Parameters

encoding (numpy.ndarray) – An $$L \times N$$ array (where $$L$$ is the length of the sequence and $$N$$ is the size of the sequence type’s alphabet) containing the one-hot encoding of the sequence.

Returns

The sequence of $$L$$ characters decoded from the input.

Return type

str

get_test_set(batch_size, n_samples=None)[source]

This method returns a subset of testing data from the sampler, divided into batches.

Parameters
• batch_size (int) – The size of the batches to divide the data into.

• n_samples (int or None, optional) – Default is None. The total number of validation examples to retrieve. If None, 640000 examples are retrieved.

Returns

sequences_and_targets, targets_matrix – Tuple containing the list of sequence-target pairs, as well as a single matrix with all targets in the same order. Note that sequences_and_targets’s sequence elements are of the shape $$B \times L \times N$$ and its target elements are of the shape $$B \times F$$, where $$B$$ is batch_size, $$L$$ is the sequence length, $$N$$ is the size of the sequence type’s alphabet, and $$F$$ is the number of features. Further, target_matrix is of the shape $$S \times F$$, where $$S =$$ n_samples.

Return type
Raises

ValueError – If no test partition of the data was specified during sampler initialization.

get_validation_set(batch_size, n_samples=None)[source]

This method returns a subset of validation data from the sampler, divided into batches.

Parameters
• batch_size (int) – The size of the batches to divide the data into.

• n_samples (int or None, optional) – Default is None. The total number of validation examples to retrieve. If None, 32000 examples are retrieved.

Returns

sequences_and_targets, targets_matrix – Tuple containing the list of sequence-target pairs, as well as a single matrix with all targets in the same order. Note that sequences_and_targets’s sequence elements are of the shape $$B \times L \times N$$ and its target elements are of the shape $$B \times F$$, where $$B$$ is batch_size, $$L$$ is the sequence length, $$N$$ is the size of the sequence type’s alphabet, and $$F$$ is the number of features. Further, target_matrix is of the shape $$S \times F$$, where $$S =$$ n_samples.

Return type
save_dataset_to_file(mode, close_filehandle=False)[source]

Save samples for each partition (i.e. train/validate/test) to disk.

Parameters
• mode (str) – Must be one of the modes specified in save_datasets during sampler initialization.

• close_filehandle (bool, optional) – Default is False. close_filehandle=True assumes that all data corresponding to the input mode has been saved to file and save_dataset_to_file will not be called with mode again.

## IntervalsSampler¶

class selene_sdk.samplers.IntervalsSampler(reference_sequence, target_path, features, intervals_path, sample_negative=False, seed=436, validation_holdout=['chr6', 'chr7'], test_holdout=['chr8', 'chr9'], sequence_length=1000, center_bin_to_predict=200, feature_thresholds=0.5, mode='train', save_datasets=['test'], output_dir=None)[source]

Draws samples from pre-specified windows in the reference sequence.

Parameters
• reference_sequence (selene_sdk.sequences.Sequence) – A reference sequence from which to create examples.

• target_path (str) – Path to tabix-indexed, compressed BED file (*.bed.gz) of genomic coordinates mapped to the genomic features we want to predict.

• features (list(str)) – List of distinct features that we aim to predict.

• intervals_path (str) – The path to the file that contains the intervals to sample from. In this file, each interval should occur on a separate line.

• sample_negative (bool, optional) – Default is False. This tells the sampler whether negative examples (i.e. with no positive labels) should be drawn when generating samples. If True, both negative and positive samples will be drawn. If False, only samples with at least one positive label will be drawn.

• seed (int, optional) – Default is 436. Sets the random seed for sampling.

• validation_holdout (list(str) or float, optional) – Default is [‘chr6’, ‘chr7’]. Holdout can be regional or proportional. If regional, expects a list (e.g. [‘X’, ‘Y’]). Regions must match those specified in the first column of the tabix-indexed BED file. If proportional, specify a percentage between (0.0, 1.0). Typically 0.10 or 0.20.

• test_holdout (list(str) or float, optional) – Default is [‘chr8’, ‘chr9’]. See documentation for validation_holdout for additional information.

• sequence_length (int, optional) – Default is 1000. Model is trained on sequences of sequence_length where genomic features are annotated to the center regions of these sequences.

• center_bin_to_predict (int, optional) – Default is 200. Query the tabix-indexed file for a region of length center_bin_to_predict.

• feature_thresholds (float [0.0, 1.0] or None, optional) – Default is 0.5. The feature_threshold to pass to the GenomicFeatures object.

• mode ({‘train’, ‘validate’, ‘test’}) – Default is ‘train’. The mode to run the sampler in.

• save_datasets (list of str) – Default is [“test”]. The list of modes for which we should save the sampled data to file.

• output_dir (str or None, optional) – Default is None. The path to the directory where we should save sampled examples for a mode. If save_datasets is a non-empty list, output_dir must be specified. If the path in output_dir does not exist it will be created automatically.

Variables
• ~IntervalsSampler.reference_sequence (selene_sdk.sequences.Sequence) – The reference sequence that examples are created from.

• ~IntervalsSampler.target (selene_sdk.targets.Target) – The selene_sdk.targets.Target object holding the features that we would like to predict.

• ~IntervalsSampler.sample_from_intervals (list(tuple(str, int, int))) – A list of coordinates that specify the intervals we can draw samples from.

• ~IntervalsSampler.interval_lengths (list(int)) – A list of the lengths of the intervals that we can draw samples from. The probability that we will draw a sample from an interval is a function of that interval’s length and the length of all other intervals.

• ~IntervalsSampler.sample_negative (bool) – Whether negative examples (i.e. with no positive label) should be drawn when generating samples. If True, both negative and positive samples will be drawn. If False, only samples with at least one positive label will be drawn.

• ~IntervalsSampler.validation_holdout (list(str) or float) – The samples to hold out for validating model performance. These can be “regional” or “proportional”. If regional, this is a list of region names (e.g. [‘chrX’, ‘chrY’]). These Regions must match those specified in the first column of the tabix-indexed BED file. If proportional, this is the fraction of total samples that will be held out.

• ~IntervalsSampler.test_holdout (list(str) or float) – The samples to hold out for testing model performance. See the documentation for validation_holdout for more details.

• ~IntervalsSampler.sequence_length (int) – The length of the sequences to train the model on.

• ~IntervalsSampler.modes (list(str)) – The list of modes that the sampler can be run in.

• ~IntervalsSampler.mode (str) – The current mode that the sampler is running in. Must be one of the modes listed in modes.

sample(batch_size=1, mode=None)[source]

Randomly draws a mini-batch of examples and their corresponding labels.

Parameters
• batch_size (int, optional) – Default is 1. The number of examples to include in the mini-batch.

• mode (str, optional) – Default is None. The operating mode that the object should run in. If None, will use the current mode self.mode.

Returns

sequences, targets – A tuple containing the numeric representation of the sequence examples and their corresponding labels. The shape of sequences will be $$B \times L \times N$$, where $$B$$ is batch_size, $$L$$ is the sequence length, and $$N$$ is the size of the sequence type’s alphabet. The shape of targets will be $$B \times F$$, where $$F$$ is the number of features.

Return type

## RandomPositionsSampler¶

class selene_sdk.samplers.RandomPositionsSampler(reference_sequence, target_path, features, seed=436, validation_holdout=['chr6', 'chr7'], test_holdout=['chr8', 'chr9'], sequence_length=1000, center_bin_to_predict=200, feature_thresholds=0.5, mode='train', save_datasets=[], output_dir=None)[source]

This sampler randomly selects a position in the genome and queries for a sequence centered at that position for input to the model.

TODO: generalize to selene_sdk.sequences.Sequence?

Parameters
• reference_sequence (selene_sdk.sequences.Genome) – A reference sequence from which to create examples.

• target_path (str) – Path to tabix-indexed, compressed BED file (*.bed.gz) of genomic coordinates mapped to the genomic features we want to predict.

• features (list(str)) – List of distinct features that we aim to predict.

• seed (int, optional) – Default is 436. Sets the random seed for sampling.

• validation_holdout (list(str) or float, optional) – Default is [‘chr6’, ‘chr7’]. Holdout can be regional or proportional. If regional, expects a list (e.g. [‘chrX’, ‘chrY’]). Regions must match those specified in the first column of the tabix-indexed BED file. If proportional, specify a percentage between (0.0, 1.0). Typically 0.10 or 0.20.

• test_holdout (list(str) or float, optional) – Default is [‘chr8’, ‘chr9’]. See documentation for validation_holdout for additional information.

• sequence_length (int, optional) – Default is 1000. Model is trained on sequences of sequence_length where genomic features are annotated to the center regions of these sequences.

• center_bin_to_predict (int, optional) – Default is 200. Query the tabix-indexed file for a region of length center_bin_to_predict.

• feature_thresholds (float [0.0, 1.0], optional) – Default is 0.5. The feature_threshold to pass to the GenomicFeatures object.

• mode ({‘train’, ‘validate’, ‘test’}) – Default is ‘train’. The mode to run the sampler in.

• save_datasets (list(str), optional) – Default is [‘test’]. The list of modes for which we should save the sampled data to file.

• output_dir (str or None, optional) – Default is None. The path to the directory where we should save sampled examples for a mode. If save_datasets is a non-empty list, output_dir must be specified. If the path in output_dir does not exist it will be created automatically.

Variables
• ~RandomPositionsSampler.reference_sequence (selene_sdk.sequences.Genome) – The reference sequence that examples are created from.

• ~RandomPositionsSampler.target (selene_sdk.targets.Target) – The selene_sdk.targets.Target object holding the features that we would like to predict.

• ~RandomPositionsSampler.validation_holdout (list(str) or float) – The samples to hold out for validating model performance. These can be “regional” or “proportional”. If regional, this is a list of region names (e.g. [‘chrX’, ‘chrY’]). These regions must match those specified in the first column of the tabix-indexed BED file. If proportional, this is the fraction of total samples that will be held out.

• ~RandomPositionsSampler.test_holdout (list(str) or float) – The samples to hold out for testing model performance. See the documentation for validation_holdout for more details.

• ~RandomPositionsSampler.sequence_length (int) – The length of the sequences to train the model on.

• ~RandomPositionsSampler.modes (list(str)) – The list of modes that the sampler can be run in.

• ~RandomPositionsSampler.mode (str) – The current mode that the sampler is running in. Must be one of the modes listed in modes.

sample(batch_size=1, mode=None)[source]

Randomly draws a mini-batch of examples and their corresponding labels.

Parameters
• batch_size (int, optional) – Default is 1. The number of examples to include in the mini-batch.

• mode (str, optional) – Default is None. The operating mode that the object should run in. If None, will use the current mode self.mode.

Returns

sequences, targets – A tuple containing the numeric representation of the sequence examples and their corresponding labels. The shape of sequences will be $$B \times L \times N$$, where $$B$$ is batch_size, $$L$$ is the sequence length, and $$N$$ is the size of the sequence type’s alphabet. The shape of targets will be $$B \times F$$, where $$F$$ is the number of features.

Return type

## MultiFileSampler¶

class selene_sdk.samplers.MultiFileSampler(*args, **kwargs)[source]

MultiFileSampler is deprecated and will be removed from future versions of Selene. Please use MultiSampler instead. This function maintains backward compatibility for code that uses MultiFileSampler, but we will remove this function in future. Please refer to the MultiSampler documentation for usage.