selene_sdk.sequences¶
This module provides the types for representing biological sequences.
Sequence¶

class
selene_sdk.sequences.
Sequence
[source]¶ Bases:
object
The abstract base class for biological sequence classes.

abstract property
BASES_ARR
¶ This is an array with the alphabet (i.e. all possible symbols that may occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.
 Returns
The array of all members of the alphabet.
 Return type
numpy.ndarray, dtype=str

abstract property
BASE_TO_INDEX
¶ A dictionary mapping members of the alphabet (i.e. all possible symbols that can occur in a sequence) to integers.
 Returns
The dictionary mapping the alphabet to integers.
 Return type

abstract property
INDEX_TO_BASE
¶ A dictionary mapping integers to members of the alphabet (i.e. all possible symbols that can occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.
 Returns
The dictionary mapping integers to the alphabet.
 Return type

abstract property
UNK_BASE
¶ This is a base used to represent unknown positions. This is not the same as a character from outside the sequence’s alphabet. A character from outside the alphabet is an error. A position with an unknown base signifies that the position is one of the bases from the alphabet, but we are uncertain which.
 Returns
The character representing an unknown base.
 Return type

abstract
coords_in_bounds
(*args, **kwargs)[source]¶ Checks if queried coordinates are valid.
 Returns
True if the coordinates are in bounds, otherwise False.
 Return type

abstract classmethod
encoding_to_sequence
(encoding)[source]¶ Transforms the input numerical representation of a sequence into a string representation.
 Parameters
encoding (numpy.ndarray, dtype=numpy.float32) – The \(L \times N\) encoding of the sequence, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence type’s alphabet.
 Returns
The sequence of bases decoded from the input array. This sequence will be of length \(L\).
 Return type

abstract
get_encoding_from_coords
(*args, **kwargs)[source]¶ Extracts the numerical encoding for a sequence occurring at the given coordinates.
 Returns
The \(L \times N\) encoding of the sequence occuring at queried coordinates, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence type’s alphabet. Behavior is undefined for invalid coordinates.
 Return type
numpy.ndarray, dtype=numpy.float32

abstract
get_sequence_from_coords
(*args, **kwargs)[source]¶ Extracts a string representation of a sequence at the given coordinates.
 Returns
The sequence of bases occuring at the queried coordinates. This sequence will be of length \(L\) normally, but only if the coordinates are valid. Behavior is undefined for invalid coordinates.
 Return type

abstract classmethod
sequence_to_encoding
(sequence)[source]¶ Transforms a biological sequence into a numerical representation.
 Parameters
sequence (str) – The input sequence of characters.
 Returns
The \(L \times N\) encoding of the sequence, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence type’s alphabet.
 Return type
numpy.ndarray, dtype=numpy.float32

abstract property
Genome¶

class
selene_sdk.sequences.
Genome
(input_path, blacklist_regions=None, bases_order=None)[source]¶ Bases:
selene_sdk.sequences.sequence.Sequence
This class provides access to an organism’s genomic sequence.
This class supports retrieving parts of the sequence and converting these parts into their onehot encodings. It is essentially a wrapper class around the pyfaidx.Fasta class.
 Parameters
input_path (str) – Path to an indexed FASTA file, that is, a *.fasta file with a corresponding *.fai file in the same directory. This file should contain the target organism’s genome sequence.
blacklist_regions (str or None, optional) – Default is None. Path to a tabixindexed list of regions from which we should not output sequences. This is used to ensure that we are not sampling from areas where we will never collect measurements. You can pass as input “hg19” or “hg38” to use the blacklist regions released by ENCODE. You can also pass in your own tabixindexed .gz file.
bases_order (list(str) or None, optional) – Default is None (use the default base ordering of [‘A’, ‘C’, ‘G’, ‘T’]). Specify a different ordering of DNA bases for onehot encoding.
 Variables

BASES_ARR
= ['A', 'C', 'G', 'T']¶ This is an array with the alphabet (i.e. all possible symbols that may occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

BASE_TO_INDEX
= {'A': 0, 'C': 1, 'G': 2, 'T': 3, 'a': 0, 'c': 1, 'g': 2, 't': 3}¶ A dictionary mapping members of the alphabet (i.e. all possible symbols that can occur in a sequence) to integers.

COMPLEMENTARY_BASE_DICT
= {'A': 'T', 'C': 'G', 'G': 'C', 'N': 'N', 'T': 'A', 'a': 'T', 'c': 'G', 'g': 'C', 'n': 'N', 't': 'A'}¶ A dictionary mapping each base to its complementary base.

INDEX_TO_BASE
= {0: 'A', 1: 'C', 2: 'G', 3: 'T'}¶ A dictionary mapping integers to members of the alphabet (i.e. all possible symbols that can occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

UNK_BASE
= 'N'¶ This is a base used to represent unknown positions. This is not the same as a character from outside the sequence’s alphabet. A character from outside the alphabet is an error. A position with an unknown base signifies that the position is one of the bases from the alphabet, but we are uncertain which.

coords_in_bounds
(chrom, start, end)[source]¶ Check if the region we want to query is within the bounds of the queried chromosome and nonoverlapping with blacklist regions (if given).
 Parameters
chrom (str) – The name of the chromosomes, e.g. “chr1”.
start (int) – The 0based start coordinate of the sequence.
end (int) – One past the 0based last position in the sequence.
 Returns
Whether we can retrieve a sequence from the bounds specified in the input.
 Return type

classmethod
encoding_to_sequence
(encoding)[source]¶ Converts an input onehot encoding to its DNA sequence.
 Parameters
encoding (numpy.ndarray, dtype=numpy.float32) – An \(L \times 4\) onehot encoding of the sequence, where \(L\) is the length of the output sequence.
 Returns
The sequence of \(L\) nucleotides decoded from the input array.
 Return type

get_encoding_from_coords
(chrom, start, end, strand='+', pad=False)[source]¶ Gets the onehot encoding of the genomic sequence at the queried coordinates.
 Parameters
chrom (str) – The name of the chromosome or region, e.g. “chr1”.
start (int) – The 0based start coordinate of the first position in the sequence.
end (int) – One past the 0based last position in the sequence.
strand ({‘+’, ‘‘, ‘.’}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘+’.
pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end  start.
 Returns
The \(L \times 4\) encoding of the sequence, where \(L = end  start\), unless chrom cannot be found in the input FASTA, start or end are out of bounds, or (if a blacklist exists) the region overlaps with a blacklist region. In these cases, it will return an empty encoding–that is, L = 0 for the NumPy array returned.
 Return type
numpy.ndarray, dtype=numpy.float32
 Raises
ValueError – If the input char to strand is not one of the specified choices. (Raised in the call to self.get_sequence_from_coords)

get_encoding_from_coords_check_unk
(chrom, start, end, strand='+', pad=False)[source]¶ Gets the onehot encoding of the genomic sequence at the queried coordinates and check whether the sequence contains unknown base(s).
 Parameters
chrom (str) – The name of the chromosome or region, e.g. “chr1”.
start (int) – The 0based start coordinate of the first position in the sequence.
end (int) – One past the 0based last position in the sequence.
strand ({‘+’, ‘‘, ‘.’}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘+’.
pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end  start.
 Returns
tuple[0] is the \(L \times 4\) encoding of the sequence
containing data of numpy.float32 type, where \(L = end  start\), unless chrom cannot be found in the input FASTA, start or end are out of bounds, or (if a blacklist exists) the region overlaps with a blacklist region. In these cases, it will return an empty encoding–that is, L = 0 for the NumPy array returned. * tuple[1] is the boolean value that indicates whether the sequence contains any unknown base(s) specified in self.UNK_BASE
 Return type
 Raises
ValueError – If the input char to strand is not one of the specified choices. (Raised in the call to self.get_sequence_from_coords)

get_sequence_from_coords
(chrom, start, end, strand='+', pad=False)[source]¶ Gets the queried chromosome’s sequence at the input coordinates.
 Parameters
chrom (str) – The name of the chromosomes, e.g. “chr1”.
start (int) – The 0based start coordinate of the sequence.
end (int) – One past the 0based last position in the sequence.
strand ({‘+’, ‘‘, ‘.’}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘.’.
pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end  start.
 Returns
The genomic sequence of length \(L\) where \(L = end  start\). If pad is False and one/both of start and end are out of bounds, will return an empty string. Also returns an empty string if chrom cannot be found in the input FASTA file. Otherwise, will return the sequence with padding at the start/end if appropriate.
 Return type
 Raises
ValueError – If the input char to strand is not one of the specified choices.

classmethod
sequence_to_encoding
(sequence)[source]¶ Converts an input sequence to its onehot encoding.
 Parameters
sequence (str) – A nucleotide sequence of length \(L\)
 Returns
The \(L \times 4\) onehot encoding of the sequence.
 Return type
numpy.ndarray, dtype=numpy.float32
Proteome¶

class
selene_sdk.sequences.
Proteome
(input_path)[source]¶ Bases:
selene_sdk.sequences.sequence.Sequence
Provides access to an organism’s proteomic sequence.
It supports retrieving parts of the sequence and converting these parts into their onehot encodings. It is essentially a wrapper class around the pyfaidx.Fasta class.
 Parameters
input_path (str) – Path to an indexed FASTA file containing amino acid sequences, that is, a *.fasta file with a corresponding *.fai file in the same directory. File should contain the sequences from which training examples will be created.
 Variables

BASES_ARR
= ['A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']¶ This is an array with the alphabet (i.e. all possible symbols that may occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

BASE_TO_INDEX
= {'A': 0, 'C': 4, 'D': 3, 'E': 5, 'F': 13, 'G': 7, 'H': 8, 'I': 9, 'K': 11, 'L': 10, 'M': 12, 'N': 2, 'P': 14, 'Q': 6, 'R': 1, 'S': 15, 'T': 16, 'V': 19, 'W': 17, 'Y': 18}¶ A dictionary mapping members of the alphabet (i.e. all possible symbols that can occur in a sequence) to integers.

INDEX_TO_BASE
= {0: 'A', 1: 'R', 2: 'N', 3: 'D', 4: 'C', 5: 'E', 6: 'Q', 7: 'G', 8: 'H', 9: 'I', 10: 'L', 11: 'K', 12: 'M', 13: 'F', 14: 'P', 15: 'S', 16: 'T', 17: 'W', 18: 'Y', 19: 'V'}¶ A dictionary mapping integers to members of the alphabet (i.e. all possible symbols that can occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

UNK_BASE
= 'X'¶ This is a base used to represent unknown positions. This is not the same as a character from outside the sequence’s alphabet. A character from outside the alphabet is an error. A position with an unknown base signifies that the position is one of the bases from the alphabet, but we are uncertain which.

coords_in_bounds
(prot, start, end)[source]¶ Check if the coordinates we want to query is valid.
 Parameters
prot (str) – The name of the protein, e.g. “YFP”.
start (int) – The 0based start coordinate of the first position in the sequence.
end (int) – One past the 0based last position in the sequence.
 Returns
A boolean indicating whether we can retrieve a sequence from the queried coordinates.
 Return type

classmethod
encoding_to_sequence
(encoding)[source]¶ Converts an input onehot encoding to its amino acid sequence.
 Parameters
encoding (numpy.ndarray, dtype=numpy.float32) – The \(L \times 20\) encoding of the sequence, where \(L\) is the length of the output amino acid sequence.
 Returns
The sequence of \(L\) amino acids decoded from the input array.
 Return type

get_encoding_from_coords
(prot, start, end)[source]¶ Gets the onehot encoding of the protein’s sequence at the input coordinates.
 Parameters
prot (str) – The name of the protein, e.g. “YFP”.
start (int) – The 0based start coordinate of the first position in the sequence.
end (int) – One past the 0based last position in the sequence.
 Returns
The \(L \times 20\) encoding of the sequence, where \(L = end  start\).
 Return type
numpy.ndarray, dtype=numpy.float32

get_sequence_from_coords
(prot, start, end)[source]¶ Gets the queried protein sequence at the input coordinates.
 Parameters
prot (str) – The protein name, e.g. “YFP”.
start (int) – The 0based start coordinate of the first position in the sequence.
end (int) – One past the 0based last position in the sequence.
 Returns
The sequence of \(L\) amino acids at the specified coordinates, where \(L = end  start\).
 Return type

classmethod
sequence_to_encoding
(sequence)[source]¶ Converts an input sequence to its onehot encoding.
 Parameters
sequence (str) – The input sequence of amino acids of length \(L\).
 Returns
The \(L \times 20\) array, where L was the length of the input sequence.
 Return type
numpy.ndarray, dtype=numpy.float32
sequence_to_encoding¶

selene_sdk.sequences.
sequence_to_encoding
(sequence, base_to_index, bases_arr)[source]¶ Converts an input sequence to its onehot encoding.
 Parameters
sequence (str) – The input sequence of length \(L\).
base_to_index (dict) – A dict that maps input characters to indices, where the indices specify the column to assign as 1 when a base exists at the current position in the input. If a base does not exist at the current position in the input, it’s corresponding column in the encoding is set as zero. Note that the rows correspond directly to the positions in the input sequence. For instance, with a a genome you would have each of [‘A’, ‘C’, ‘G’, ‘T’] as keys, mapping to values of [0, 1, 2, 3].
bases_arr (list(str)) – The characters in the sequence’s alphabet.
 Returns
The \(L \times N\) encoding of the sequence, where \(L\) is the length of the input sequence and \(N\) is the size of the sequence alphabet.
 Return type
numpy.ndarray, dtype=numpy.float32
encoding_to_sequence¶

selene_sdk.sequences.
encoding_to_sequence
(encoding, bases_arr, unk_base)[source]¶ Converts a sequence onehot encoding to its string sequence.
 Parameters
encoding (numpy.ndarray, dtype=numpy.float32) – The \(L \times N\) encoding of the sequence, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence alphabet.
bases_arr (list(str)) – A list of the bases in the sequence’s alphabet that corresponds to the correct columns for those bases in the encoding.
unk_base (str) – The base corresponding to the “unknown” character in this encoding. See selene_sdk.sequences.Sequence.UNK_BASE for more information.
 Returns
The sequence of \(L\) characters decoded from the input array.
 Return type
get_reverse_encoding¶

selene_sdk.sequences.
get_reverse_encoding
(encoding, bases_arr, base_to_index, complementary_base_dict)[source]¶ The Genome DNA bases encoding is created such that the reverse encoding can be quickly computed.
 Parameters
encoding (numpy.ndarray)
bases_arr (list(str))
base_to_index (dict)
complementary_base_dict (dict)
 Returns
 Return type