selene_sdk.samplers.file_samplers

This module provides classes and methods for sampling labeled data examples from files.

BedFileSampler

class selene_sdk.samplers.file_samplers.BedFileSampler(filepath, reference_sequence, n_samples, sequence_length=None, targets_avail=False, n_features=None)[source]

Bases: selene_sdk.samplers.file_samplers.file_sampler.FileSampler

A sampler for which the dataset is loaded directly from a *.bed file.

Parameters
  • filepath (str) – The path to the file to load the data from.

  • reference_sequence (selene_sdk.sequences.Sequence) – A reference sequence from which to create examples.

  • n_samples (int) – Number of lines in the file. (wc -l <filepath>)

  • sequence_length (int or None, optional) – Default is None. If the coordinates of each sample in the BED file already account for the full sequence (that is, end - start = sequence_length), there is no need to specify the sequence length. If sequence_length is not None, the length of each sample will be checked to determine whether the sample coordinates need to be truncated or expanded to reach the sequence length specified in the model architecture.

  • targets_avail (bool, optional) – Default is False. If targets_avail, assumes that it is the last column of the *.bed file. The last column should contain the indices, separated by semicolons, of features (classes) found within a given sample’s coordinates (e.g. 0;1;45;60). This assumes that we are only looking for the absence/presence of each feature within the interval.

  • n_features (int or None, optional) – Default is None. If targets_avail is True, must specify n_features, the total number of features (classes).

Variables
  • ~BedFileSampler.filepath (str) – The path to the file to load the data from.

  • ~BedFileSampler.reference_sequence (selene_sdk.sequences.Sequence) – A reference sequence from which to create examples.

  • ~BedFileSampler.n_samples (int) – Number of lines in the file. (wc -l <filepath>)

  • ~BedFileSampler.sequence_length (int or None, optional) – Default is None. If the coordinates of each sample in the BED file already account for the full sequence (that is, end - start = sequence_length), there is no need to specify the sequence length. If sequence_length is not None, the length of each sample will be checked to determine whether the sample coordinates need to be truncated or expanded to reach the sequence length specified in the model architecture.

  • ~BedFileSampler.targets_avail (bool) – If targets_avail, assumes that it is the last column of the *.bed file. The last column should contain the indices, separated by semicolons, of features (classes) found within a given sample’s coordinates (e.g. 0;1;45;60). This assumes that we are only looking or the absence/presence of each feature within the interval.

  • ~BedFileSampler.n_features (int or None) – If targets_avail is True, must specify n_features, the total number of features (classes).

get_data(batch_size, n_samples=None)[source]

This method fetches a subset of the data from the sampler, divided into batches.

Parameters
  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int, optional) – Default is None. The total number of samples to retrieve.

Returns

sequences – The list of sequences grouped into batches. An element in the sequences list is of the shape \(B \times L \times N\), where \(B\) is batch_size, \(L\) is the sequence length, and \(N\) is the size of the sequence type’s alphabet.

Return type

list(np.ndarray)

get_data_and_targets(batch_size, n_samples=None)[source]

This method fetches a subset of the sequence data and targets from the sampler, divided into batches.

Parameters
  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int, optional) – Default is None. The total number of samples to retrieve.

Returns

sequences_and_targets, targets_matrix – Tuple containing the list of sequence-target pairs, as well as a single matrix with all targets in the same order. Note that sequences_and_targets’s sequence elements are of the shape \(B \times L \times N\) and its target elements are of the shape \(B \times F\), where \(B\) is batch_size, \(L\) is the sequence length, \(N\) is the size of the sequence type’s alphabet, and \(F\) is the number of features. Further, target_matrix is of the shape \(S \times F\), where \(S =\) n_samples.

Return type

tuple(list(tuple(numpy.ndarray, numpy.ndarray)), numpy.ndarray)

sample(batch_size=1)[source]

Draws a mini-batch of examples and their corresponding labels.

Parameters

batch_size (int, optional) – Default is 1. The number of examples to include in the mini-batch.

Returns

sequences, targets – A tuple containing the numeric representation of the sequence examples and their corresponding labels. The shape of sequences will be \(B \times L \times N\), where \(B\) is batch_size, \(L\) is the sequence length, and \(N\) is the size of the sequence type’s alphabet. The shape of targets will be \(B \times F\), where \(F\) is the number of features.

Return type

tuple(numpy.ndarray, numpy.ndarray)

MatFileSampler

class selene_sdk.samplers.file_samplers.MatFileSampler(filepath, sequence_key, targets_key=None, random_seed=436, shuffle=True, sequence_batch_axis=0, sequence_alphabet_axis=1, targets_batch_axis=0)[source]

Bases: selene_sdk.samplers.file_samplers.file_sampler.FileSampler

A sampler for which the dataset is loaded directly from a *.mat file.

Parameters
  • filepath (str) – The path to the file to load the data from.

  • sequence_key (str) – The key for the sequences data matrix.

  • targets_key (str, optional) – Default is None. The key for the targets data matrix.

  • random_seed (int, optional) – Default is 436. Sets the random seed for sampling.

  • shuffle (bool, optional) – Default is True. Shuffle the order of the samples in the matrix before sampling from it.

  • sequence_batch_axis (int, optional) – Default is 0. Specify the batch axis.

  • sequence_alphabet_axis (int, optional) – Default is 1. Specify the alphabet axis.

  • targets_batch_axis (int, optional) – Default is 0. Speciy the batch axis.

Variables

~MatFileSampler.n_samples (int) – The number of samples in the data matrix.

get_data(batch_size, n_samples=None)[source]

This method fetches a subset of the data from the sampler, divided into batches.

Parameters
  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int, optional) – Default is None. The total number of samples to retrieve.

Returns

sequences – The list of sequences grouped into batches. An element in the sequences list is of the shape \(B \times L \times N\), where \(B\) is batch_size, \(L\) is the sequence length, and \(N\) is the size of the sequence type’s alphabet.

Return type

list(np.ndarray)

get_data_and_targets(batch_size, n_samples=None)[source]

This method fetches a subset of the sequence data and targets from the sampler, divided into batches.

Parameters
  • batch_size (int) – The size of the batches to divide the data into.

  • n_samples (int, optional) – Default is None. The total number of samples to retrieve.

Returns

sequences_and_targets, targets_matrix – Tuple containing the list of sequence-target pairs, as well as a single matrix with all targets in the same order. Note that sequences_and_targets’s sequence elements are of the shape \(B \times L \times N\) and its target elements are of the shape \(B \times F\), where \(B\) is batch_size, \(L\) is the sequence length, \(N\) is the size of the sequence type’s alphabet, and \(F\) is the number of features. Further, target_matrix is of the shape \(S \times F\), where \(S =\) n_samples.

Return type

tuple(list(tuple(numpy.ndarray, numpy.ndarray)), numpy.ndarray)

sample(batch_size=1)[source]

Draws a mini-batch of examples and their corresponding labels.

Parameters

batch_size (int, optional) – Default is 1. The number of examples to include in the mini-batch.

Returns

sequences, targets – A tuple containing the numeric representation of the sequence examples and their corresponding labels. The shape of sequences will be \(B \times L \times N\), where \(B\) is batch_size, \(L\) is the sequence length, and \(N\) is the size of the sequence type’s alphabet. The shape of targets will be \(B \times F\), where \(F\) is the number of features.

Return type

tuple(numpy.ndarray, numpy.ndarray)