selene_sdk.sequences¶
This module provides the types for representing biological sequences.
Sequence¶
-
class
selene_sdk.sequences.
Sequence
[source]¶ Bases:
object
The abstract base class for biological sequence classes.
-
abstract property
BASES_ARR
¶ This is an array with the alphabet (i.e. all possible symbols that may occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.
- Returns
The array of all members of the alphabet.
- Return type
numpy.ndarray, dtype=str
-
abstract property
BASE_TO_INDEX
¶ A dictionary mapping members of the alphabet (i.e. all possible symbols that can occur in a sequence) to integers.
- Returns
The dictionary mapping the alphabet to integers.
- Return type
-
abstract property
INDEX_TO_BASE
¶ A dictionary mapping integers to members of the alphabet (i.e. all possible symbols that can occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.
- Returns
The dictionary mapping integers to the alphabet.
- Return type
-
abstract property
UNK_BASE
¶ This is a base used to represent unknown positions. This is not the same as a character from outside the sequence’s alphabet. A character from outside the alphabet is an error. A position with an unknown base signifies that the position is one of the bases from the alphabet, but we are uncertain which.
- Returns
The character representing an unknown base.
- Return type
-
abstract
coords_in_bounds
(*args, **kwargs)[source]¶ Checks if queried coordinates are valid.
- Returns
True if the coordinates are in bounds, otherwise False.
- Return type
-
abstract classmethod
encoding_to_sequence
(encoding)[source]¶ Transforms the input numerical representation of a sequence into a string representation.
- Parameters
encoding (numpy.ndarray, dtype=numpy.float32) – The \(L \times N\) encoding of the sequence, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence type’s alphabet.
- Returns
The sequence of bases decoded from the input array. This sequence will be of length \(L\).
- Return type
-
abstract
get_encoding_from_coords
(*args, **kwargs)[source]¶ Extracts the numerical encoding for a sequence occurring at the given coordinates.
- Returns
The \(L \times N\) encoding of the sequence occuring at queried coordinates, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence type’s alphabet. Behavior is undefined for invalid coordinates.
- Return type
numpy.ndarray, dtype=numpy.float32
-
abstract
get_sequence_from_coords
(*args, **kwargs)[source]¶ Extracts a string representation of a sequence at the given coordinates.
- Returns
The sequence of bases occuring at the queried coordinates. This sequence will be of length \(L\) normally, but only if the coordinates are valid. Behavior is undefined for invalid coordinates.
- Return type
-
abstract classmethod
sequence_to_encoding
(sequence)[source]¶ Transforms a biological sequence into a numerical representation.
- Parameters
sequence (str) – The input sequence of characters.
- Returns
The \(L \times N\) encoding of the sequence, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence type’s alphabet.
- Return type
numpy.ndarray, dtype=numpy.float32
-
abstract property
Genome¶
-
class
selene_sdk.sequences.
Genome
(input_path, blacklist_regions=None, bases_order=None)[source]¶ Bases:
selene_sdk.sequences.sequence.Sequence
This class provides access to an organism’s genomic sequence.
This class supports retrieving parts of the sequence and converting these parts into their one-hot encodings. It is essentially a wrapper class around the pyfaidx.Fasta class.
- Parameters
input_path (str) – Path to an indexed FASTA file, that is, a *.fasta file with a corresponding *.fai file in the same directory. This file should contain the target organism’s genome sequence.
blacklist_regions (str or None, optional) – Default is None. Path to a tabix-indexed list of regions from which we should not output sequences. This is used to ensure that we are not sampling from areas where we will never collect measurements. You can pass as input “hg19” or “hg38” to use the blacklist regions released by ENCODE. You can also pass in your own tabix-indexed .gz file.
bases_order (list(str) or None, optional) – Default is None (use the default base ordering of [‘A’, ‘C’, ‘G’, ‘T’]). Specify a different ordering of DNA bases for one-hot encoding.
- Variables
-
BASES_ARR
= ['A', 'C', 'G', 'T']¶ This is an array with the alphabet (i.e. all possible symbols that may occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.
-
BASE_TO_INDEX
= {'A': 0, 'C': 1, 'G': 2, 'T': 3, 'a': 0, 'c': 1, 'g': 2, 't': 3}¶ A dictionary mapping members of the alphabet (i.e. all possible symbols that can occur in a sequence) to integers.
-
COMPLEMENTARY_BASE_DICT
= {'A': 'T', 'C': 'G', 'G': 'C', 'N': 'N', 'T': 'A', 'a': 'T', 'c': 'G', 'g': 'C', 'n': 'N', 't': 'A'}¶ A dictionary mapping each base to its complementary base.
-
INDEX_TO_BASE
= {0: 'A', 1: 'C', 2: 'G', 3: 'T'}¶ A dictionary mapping integers to members of the alphabet (i.e. all possible symbols that can occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.
-
UNK_BASE
= 'N'¶ This is a base used to represent unknown positions. This is not the same as a character from outside the sequence’s alphabet. A character from outside the alphabet is an error. A position with an unknown base signifies that the position is one of the bases from the alphabet, but we are uncertain which.
-
coords_in_bounds
(chrom, start, end)[source]¶ Check if the region we want to query is within the bounds of the queried chromosome and non-overlapping with blacklist regions (if given).
- Parameters
chrom (str) – The name of the chromosomes, e.g. “chr1”.
start (int) – The 0-based start coordinate of the sequence.
end (int) – One past the 0-based last position in the sequence.
- Returns
Whether we can retrieve a sequence from the bounds specified in the input.
- Return type
-
classmethod
encoding_to_sequence
(encoding)[source]¶ Converts an input one-hot encoding to its DNA sequence.
- Parameters
encoding (numpy.ndarray, dtype=numpy.float32) – An \(L \times 4\) one-hot encoding of the sequence, where \(L\) is the length of the output sequence.
- Returns
The sequence of \(L\) nucleotides decoded from the input array.
- Return type
-
get_encoding_from_coords
(chrom, start, end, strand='+', pad=False)[source]¶ Gets the one-hot encoding of the genomic sequence at the queried coordinates.
- Parameters
chrom (str) – The name of the chromosome or region, e.g. “chr1”.
start (int) – The 0-based start coordinate of the first position in the sequence.
end (int) – One past the 0-based last position in the sequence.
strand ({‘+’, ‘-‘, ‘.’}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘+’.
pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end - start.
- Returns
The \(L \times 4\) encoding of the sequence, where \(L = end - start\), unless chrom cannot be found in the input FASTA, start or end are out of bounds, or (if a blacklist exists) the region overlaps with a blacklist region. In these cases, it will return an empty encoding–that is, L = 0 for the NumPy array returned.
- Return type
numpy.ndarray, dtype=numpy.float32
- Raises
ValueError – If the input char to strand is not one of the specified choices. (Raised in the call to self.get_sequence_from_coords)
-
get_encoding_from_coords_check_unk
(chrom, start, end, strand='+', pad=False)[source]¶ Gets the one-hot encoding of the genomic sequence at the queried coordinates and check whether the sequence contains unknown base(s).
- Parameters
chrom (str) – The name of the chromosome or region, e.g. “chr1”.
start (int) – The 0-based start coordinate of the first position in the sequence.
end (int) – One past the 0-based last position in the sequence.
strand ({‘+’, ‘-‘, ‘.’}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘+’.
pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end - start.
- Returns
tuple[0] is the \(L \times 4\) encoding of the sequence
containing data of numpy.float32 type, where \(L = end - start\), unless chrom cannot be found in the input FASTA, start or end are out of bounds, or (if a blacklist exists) the region overlaps with a blacklist region. In these cases, it will return an empty encoding–that is, L = 0 for the NumPy array returned. * tuple[1] is the boolean value that indicates whether the sequence contains any unknown base(s) specified in self.UNK_BASE
- Return type
- Raises
ValueError – If the input char to strand is not one of the specified choices. (Raised in the call to self.get_sequence_from_coords)
-
get_sequence_from_coords
(chrom, start, end, strand='+', pad=False)[source]¶ Gets the queried chromosome’s sequence at the input coordinates.
- Parameters
chrom (str) – The name of the chromosomes, e.g. “chr1”.
start (int) – The 0-based start coordinate of the sequence.
end (int) – One past the 0-based last position in the sequence.
strand ({‘+’, ‘-‘, ‘.’}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘.’.
pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end - start.
- Returns
The genomic sequence of length \(L\) where \(L = end - start\). If pad is False and one/both of start and end are out of bounds, will return an empty string. Also returns an empty string if chrom cannot be found in the input FASTA file. Otherwise, will return the sequence with padding at the start/end if appropriate.
- Return type
- Raises
ValueError – If the input char to strand is not one of the specified choices.
-
classmethod
sequence_to_encoding
(sequence)[source]¶ Converts an input sequence to its one-hot encoding.
- Parameters
sequence (str) – A nucleotide sequence of length \(L\)
- Returns
The \(L \times 4\) one-hot encoding of the sequence.
- Return type
numpy.ndarray, dtype=numpy.float32
Proteome¶
-
class
selene_sdk.sequences.
Proteome
(input_path)[source]¶ Bases:
selene_sdk.sequences.sequence.Sequence
Provides access to an organism’s proteomic sequence.
It supports retrieving parts of the sequence and converting these parts into their one-hot encodings. It is essentially a wrapper class around the pyfaidx.Fasta class.
- Parameters
input_path (str) – Path to an indexed FASTA file containing amino acid sequences, that is, a *.fasta file with a corresponding *.fai file in the same directory. File should contain the sequences from which training examples will be created.
- Variables
-
BASES_ARR
= ['A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']¶ This is an array with the alphabet (i.e. all possible symbols that may occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.
-
BASE_TO_INDEX
= {'A': 0, 'C': 4, 'D': 3, 'E': 5, 'F': 13, 'G': 7, 'H': 8, 'I': 9, 'K': 11, 'L': 10, 'M': 12, 'N': 2, 'P': 14, 'Q': 6, 'R': 1, 'S': 15, 'T': 16, 'V': 19, 'W': 17, 'Y': 18}¶ A dictionary mapping members of the alphabet (i.e. all possible symbols that can occur in a sequence) to integers.
-
INDEX_TO_BASE
= {0: 'A', 1: 'R', 2: 'N', 3: 'D', 4: 'C', 5: 'E', 6: 'Q', 7: 'G', 8: 'H', 9: 'I', 10: 'L', 11: 'K', 12: 'M', 13: 'F', 14: 'P', 15: 'S', 16: 'T', 17: 'W', 18: 'Y', 19: 'V'}¶ A dictionary mapping integers to members of the alphabet (i.e. all possible symbols that can occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.
-
UNK_BASE
= 'X'¶ This is a base used to represent unknown positions. This is not the same as a character from outside the sequence’s alphabet. A character from outside the alphabet is an error. A position with an unknown base signifies that the position is one of the bases from the alphabet, but we are uncertain which.
-
coords_in_bounds
(prot, start, end)[source]¶ Check if the coordinates we want to query is valid.
- Parameters
prot (str) – The name of the protein, e.g. “YFP”.
start (int) – The 0-based start coordinate of the first position in the sequence.
end (int) – One past the 0-based last position in the sequence.
- Returns
A boolean indicating whether we can retrieve a sequence from the queried coordinates.
- Return type
-
classmethod
encoding_to_sequence
(encoding)[source]¶ Converts an input one-hot encoding to its amino acid sequence.
- Parameters
encoding (numpy.ndarray, dtype=numpy.float32) – The \(L \times 20\) encoding of the sequence, where \(L\) is the length of the output amino acid sequence.
- Returns
The sequence of \(L\) amino acids decoded from the input array.
- Return type
-
get_encoding_from_coords
(prot, start, end)[source]¶ Gets the one-hot encoding of the protein’s sequence at the input coordinates.
- Parameters
prot (str) – The name of the protein, e.g. “YFP”.
start (int) – The 0-based start coordinate of the first position in the sequence.
end (int) – One past the 0-based last position in the sequence.
- Returns
The \(L \times 20\) encoding of the sequence, where \(L = end - start\).
- Return type
numpy.ndarray, dtype=numpy.float32
-
get_sequence_from_coords
(prot, start, end)[source]¶ Gets the queried protein sequence at the input coordinates.
- Parameters
prot (str) – The protein name, e.g. “YFP”.
start (int) – The 0-based start coordinate of the first position in the sequence.
end (int) – One past the 0-based last position in the sequence.
- Returns
The sequence of \(L\) amino acids at the specified coordinates, where \(L = end - start\).
- Return type
-
classmethod
sequence_to_encoding
(sequence)[source]¶ Converts an input sequence to its one-hot encoding.
- Parameters
sequence (str) – The input sequence of amino acids of length \(L\).
- Returns
The \(L \times 20\) array, where L was the length of the input sequence.
- Return type
numpy.ndarray, dtype=numpy.float32
sequence_to_encoding¶
-
selene_sdk.sequences.
sequence_to_encoding
(sequence, base_to_index, bases_arr)[source]¶ Converts an input sequence to its one-hot encoding.
- Parameters
sequence (str) – The input sequence of length \(L\).
base_to_index (dict) – A dict that maps input characters to indices, where the indices specify the column to assign as 1 when a base exists at the current position in the input. If a base does not exist at the current position in the input, it’s corresponding column in the encoding is set as zero. Note that the rows correspond directly to the positions in the input sequence. For instance, with a a genome you would have each of [‘A’, ‘C’, ‘G’, ‘T’] as keys, mapping to values of [0, 1, 2, 3].
bases_arr (list(str)) – The characters in the sequence’s alphabet.
- Returns
The \(L \times N\) encoding of the sequence, where \(L\) is the length of the input sequence and \(N\) is the size of the sequence alphabet.
- Return type
numpy.ndarray, dtype=numpy.float32
encoding_to_sequence¶
-
selene_sdk.sequences.
encoding_to_sequence
(encoding, bases_arr, unk_base)[source]¶ Converts a sequence one-hot encoding to its string sequence.
- Parameters
encoding (numpy.ndarray, dtype=numpy.float32) – The \(L \times N\) encoding of the sequence, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence alphabet.
bases_arr (list(str)) – A list of the bases in the sequence’s alphabet that corresponds to the correct columns for those bases in the encoding.
unk_base (str) – The base corresponding to the “unknown” character in this encoding. See selene_sdk.sequences.Sequence.UNK_BASE for more information.
- Returns
The sequence of \(L\) characters decoded from the input array.
- Return type
get_reverse_encoding¶
-
selene_sdk.sequences.
get_reverse_encoding
(encoding, bases_arr, base_to_index, complementary_base_dict)[source]¶ The Genome DNA bases encoding is created such that the reverse encoding can be quickly computed.
- Parameters
encoding (numpy.ndarray)
bases_arr (list(str))
base_to_index (dict)
complementary_base_dict (dict)
- Returns
- Return type