selene_sdk.sequences¶

This module provides the types for representing biological sequences.

Sequence¶

class selene_sdk.sequences.Sequence[source]¶

Bases: object

The abstract base class for biological sequence classes.

abstract property BASES_ARR¶

This is an array with the alphabet (i.e. all possible symbols that may occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

Returns: The array of all members of the alphabet.
Return type: numpy.ndarray, dtype=str

abstract property BASE_TO_INDEX¶

A dictionary mapping members of the alphabet (i.e. all possible symbols that can occur in a sequence) to integers.

Returns: The dictionary mapping the alphabet to integers.
Return type: dict

abstract property INDEX_TO_BASE¶

A dictionary mapping integers to members of the alphabet (i.e. all possible symbols that can occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

Returns: The dictionary mapping integers to the alphabet.
Return type: dict

abstract property UNK_BASE¶

This is a base used to represent unknown positions. This is not the same as a character from outside the sequence’s alphabet. A character from outside the alphabet is an error. A position with an unknown base signifies that the position is one of the bases from the alphabet, but we are uncertain which.

Returns: The character representing an unknown base.
Return type: str

abstract coords_in_bounds(*args, **kwargs)[source]¶

Checks if queried coordinates are valid.

Returns: True if the coordinates are in bounds, otherwise False.
Return type: bool

abstract classmethod encoding_to_sequence(encoding)[source]¶

Transforms the input numerical representation of a sequence into a string representation.

Parameters: encoding (numpy.ndarray, dtype=numpy.float32) – The \(L \times N\) encoding of the sequence, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence type’s alphabet.
Returns: The sequence of bases decoded from the input array. This sequence will be of length \(L\).
Return type: str

abstract get_encoding_from_coords(*args, **kwargs)[source]¶

Extracts the numerical encoding for a sequence occurring at the given coordinates.

Returns: The \(L \times N\) encoding of the sequence occuring at queried coordinates, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence type’s alphabet. Behavior is undefined for invalid coordinates.
Return type: numpy.ndarray, dtype=numpy.float32

abstract get_sequence_from_coords(*args, **kwargs)[source]¶

Extracts a string representation of a sequence at the given coordinates.

Returns: The sequence of bases occuring at the queried coordinates. This sequence will be of length \(L\) normally, but only if the coordinates are valid. Behavior is undefined for invalid coordinates.
Return type: str

abstract classmethod sequence_to_encoding(sequence)[source]¶

Transforms a biological sequence into a numerical representation.

Parameters: sequence (str) – The input sequence of characters.
Returns: The \(L \times N\) encoding of the sequence, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence type’s alphabet.
Return type: numpy.ndarray, dtype=numpy.float32

Genome¶

class selene_sdk.sequences.Genome(input_path, blacklist_regions=None, bases_order=None)[source]¶

Bases: selene_sdk.sequences.sequence.Sequence

This class provides access to an organism’s genomic sequence.

This class supports retrieving parts of the sequence and converting these parts into their one-hot encodings. It is essentially a wrapper class around the pyfaidx.Fasta class.

Parameters

input_path (str) – Path to an indexed FASTA file, that is, a *.fasta file with a corresponding *.fai file in the same directory. This file should contain the target organism’s genome sequence.
blacklist_regions (str or None, optional) – Default is None. Path to a tabix-indexed list of regions from which we should not output sequences. This is used to ensure that we are not sampling from areas where we will never collect measurements. You can pass as input “hg19” or “hg38” to use the blacklist regions released by ENCODE. You can also pass in your own tabix-indexed .gz file.
bases_order (list(str) or None, optional) – Default is None (use the default base ordering of [‘A’, ‘C’, ‘G’, ‘T’]). Specify a different ordering of DNA bases for one-hot encoding.

Variables

~Genome.genome (pyfaidx.Fasta) – The FASTA file containing the genome sequence.
~Genome.chrs (list(str)) – The list of chromosome names.
~Genome.len_chrs (dict) – A dictionary mapping the names of each chromosome in the file to the length of said chromosome.

BASES_ARR = ['A', 'C', 'G', 'T']¶: This is an array with the alphabet (i.e. all possible symbols that may occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

BASE_TO_INDEX = {'A': 0, 'C': 1, 'G': 2, 'T': 3, 'a': 0, 'c': 1, 'g': 2, 't': 3}¶: A dictionary mapping members of the alphabet (i.e. all possible symbols that can occur in a sequence) to integers.

COMPLEMENTARY_BASE_DICT = {'A': 'T', 'C': 'G', 'G': 'C', 'N': 'N', 'T': 'A', 'a': 'T', 'c': 'G', 'g': 'C', 'n': 'N', 't': 'A'}¶: A dictionary mapping each base to its complementary base.

INDEX_TO_BASE = {0: 'A', 1: 'C', 2: 'G', 3: 'T'}¶: A dictionary mapping integers to members of the alphabet (i.e. all possible symbols that can occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

UNK_BASE = 'N'¶: This is a base used to represent unknown positions. This is not the same as a character from outside the sequence’s alphabet. A character from outside the alphabet is an error. A position with an unknown base signifies that the position is one of the bases from the alphabet, but we are uncertain which.

coords_in_bounds(chrom, start, end)[source]¶

Check if the region we want to query is within the bounds of the queried chromosome and non-overlapping with blacklist regions (if given).

Parameters

chrom (str) – The name of the chromosomes, e.g. “chr1”.
start (int) – The 0-based start coordinate of the sequence.
end (int) – One past the 0-based last position in the sequence.

Returns

Whether we can retrieve a sequence from the bounds specified in the input.

Return type

bool

classmethod encoding_to_sequence(encoding)[source]¶

Converts an input one-hot encoding to its DNA sequence.

Parameters: encoding (numpy.ndarray, dtype=numpy.float32) – An \(L \times 4\) one-hot encoding of the sequence, where \(L\) is the length of the output sequence.
Returns: The sequence of \(L\) nucleotides decoded from the input array.
Return type: str

get_chr_lens()[source]¶

Gets the name and length of each chromosome sequence in the file.

Returns: A list of tuples of the chromosome names and lengths.
Return type: list(tuple(str, int))

get_chrs()[source]¶

Gets the list of chromosome names.

Returns: A list of the chromosome names.
Return type: list(str)

get_encoding_from_coords(chrom, start, end, strand='+', pad=False)[source]¶

Gets the one-hot encoding of the genomic sequence at the queried coordinates.

Parameters

chrom (str) – The name of the chromosome or region, e.g. “chr1”.
start (int) – The 0-based start coordinate of the first position in the sequence.
end (int) – One past the 0-based last position in the sequence.
strand ({‘+’, ‘-‘, ‘.’}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘+’.
pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end - start.

Returns

The \(L \times 4\) encoding of the sequence, where \(L = end - start\), unless chrom cannot be found in the input FASTA, start or end are out of bounds, or (if a blacklist exists) the region overlaps with a blacklist region. In these cases, it will return an empty encoding–that is, L = 0 for the NumPy array returned.

Return type

numpy.ndarray, dtype=numpy.float32

Raises

ValueError – If the input char to strand is not one of the specified choices. (Raised in the call to self.get_sequence_from_coords)

get_encoding_from_coords_check_unk(chrom, start, end, strand='+', pad=False)[source]¶

Gets the one-hot encoding of the genomic sequence at the queried coordinates and check whether the sequence contains unknown base(s).

Parameters

chrom (str) – The name of the chromosome or region, e.g. “chr1”.
start (int) – The 0-based start coordinate of the first position in the sequence.
end (int) – One past the 0-based last position in the sequence.
strand ({‘+’, ‘-‘, ‘.’}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘+’.
pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end - start.

Returns

tuple[0] is the \(L \times 4\) encoding of the sequence

containing data of numpy.float32 type, where \(L = end - start\), unless chrom cannot be found in the input FASTA, start or end are out of bounds, or (if a blacklist exists) the region overlaps with a blacklist region. In these cases, it will return an empty encoding–that is, L = 0 for the NumPy array returned. * tuple[1] is the boolean value that indicates whether the sequence contains any unknown base(s) specified in self.UNK_BASE

Return type

tuple(numpy.ndarray, bool)

Raises

ValueError – If the input char to strand is not one of the specified choices. (Raised in the call to self.get_sequence_from_coords)

get_sequence_from_coords(chrom, start, end, strand='+', pad=False)[source]¶

Gets the queried chromosome’s sequence at the input coordinates.

Parameters

chrom (str) – The name of the chromosomes, e.g. “chr1”.
start (int) – The 0-based start coordinate of the sequence.
end (int) – One past the 0-based last position in the sequence.
strand ({‘+’, ‘-‘, ‘.’}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘.’.
pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end - start.

Returns

The genomic sequence of length \(L\) where \(L = end - start\). If pad is False and one/both of start and end are out of bounds, will return an empty string. Also returns an empty string if chrom cannot be found in the input FASTA file. Otherwise, will return the sequence with padding at the start/end if appropriate.

Return type

str

Raises

ValueError – If the input char to strand is not one of the specified choices.

classmethod sequence_to_encoding(sequence)[source]¶

Converts an input sequence to its one-hot encoding.

Parameters: sequence (str) – A nucleotide sequence of length \(L\)
Returns: The \(L \times 4\) one-hot encoding of the sequence.
Return type: numpy.ndarray, dtype=numpy.float32

Proteome¶

class selene_sdk.sequences.Proteome(input_path)[source]¶

Bases: selene_sdk.sequences.sequence.Sequence

Provides access to an organism’s proteomic sequence.

It supports retrieving parts of the sequence and converting these parts into their one-hot encodings. It is essentially a wrapper class around the pyfaidx.Fasta class.

Parameters

input_path (str) – Path to an indexed FASTA file containing amino acid sequences, that is, a *.fasta file with a corresponding *.fai file in the same directory. File should contain the sequences from which training examples will be created.

Variables

~Proteome.proteome (pyfaidx.Fasta) – The FASTA or FAA file containing the protein sequences.
~Proteome.prots (list(str)) – The list of protein names.
~Proteome.len_prots (dict) – A dictionary that maps protein names to the lengths, and does so for all protein sequences in the proteome.

BASES_ARR = ['A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']¶: This is an array with the alphabet (i.e. all possible symbols that may occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

BASE_TO_INDEX = {'A': 0, 'C': 4, 'D': 3, 'E': 5, 'F': 13, 'G': 7, 'H': 8, 'I': 9, 'K': 11, 'L': 10, 'M': 12, 'N': 2, 'P': 14, 'Q': 6, 'R': 1, 'S': 15, 'T': 16, 'V': 19, 'W': 17, 'Y': 18}¶: A dictionary mapping members of the alphabet (i.e. all possible symbols that can occur in a sequence) to integers.

INDEX_TO_BASE = {0: 'A', 1: 'R', 2: 'N', 3: 'D', 4: 'C', 5: 'E', 6: 'Q', 7: 'G', 8: 'H', 9: 'I', 10: 'L', 11: 'K', 12: 'M', 13: 'F', 14: 'P', 15: 'S', 16: 'T', 17: 'W', 18: 'Y', 19: 'V'}¶: A dictionary mapping integers to members of the alphabet (i.e. all possible symbols that can occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

UNK_BASE = 'X'¶: This is a base used to represent unknown positions. This is not the same as a character from outside the sequence’s alphabet. A character from outside the alphabet is an error. A position with an unknown base signifies that the position is one of the bases from the alphabet, but we are uncertain which.

coords_in_bounds(prot, start, end)[source]¶

Check if the coordinates we want to query is valid.

Parameters

prot (str) – The name of the protein, e.g. “YFP”.
start (int) – The 0-based start coordinate of the first position in the sequence.
end (int) – One past the 0-based last position in the sequence.

Returns

A boolean indicating whether we can retrieve a sequence from the queried coordinates.

Return type

bool

classmethod encoding_to_sequence(encoding)[source]¶

Converts an input one-hot encoding to its amino acid sequence.

Parameters: encoding (numpy.ndarray, dtype=numpy.float32) – The \(L \times 20\) encoding of the sequence, where \(L\) is the length of the output amino acid sequence.
Returns: The sequence of \(L\) amino acids decoded from the input array.
Return type: str

get_encoding_from_coords(prot, start, end)[source]¶

Gets the one-hot encoding of the protein’s sequence at the input coordinates.

Parameters

prot (str) – The name of the protein, e.g. “YFP”.
start (int) – The 0-based start coordinate of the first position in the sequence.
end (int) – One past the 0-based last position in the sequence.

Returns

The \(L \times 20\) encoding of the sequence, where \(L = end - start\).

Return type

numpy.ndarray, dtype=numpy.float32

get_prot_lens()[source]¶

Gets the name and length of each protein sequence in the file.

Returns: A list of tuples of protein names and protein lengths.
Return type: list(tuple(str, int))

get_prots()[source]¶

Gets the list of protein names.

Returns: A list of the protein names.
Return type: list(str)

get_sequence_from_coords(prot, start, end)[source]¶

Gets the queried protein sequence at the input coordinates.

Parameters

prot (str) – The protein name, e.g. “YFP”.
start (int) – The 0-based start coordinate of the first position in the sequence.
end (int) – One past the 0-based last position in the sequence.

Returns

The sequence of \(L\) amino acids at the specified coordinates, where \(L = end - start\).

Return type

str

classmethod sequence_to_encoding(sequence)[source]¶

Converts an input sequence to its one-hot encoding.

Parameters: sequence (str) – The input sequence of amino acids of length \(L\).
Returns: The \(L \times 20\) array, where L was the length of the input sequence.
Return type: numpy.ndarray, dtype=numpy.float32

sequence_to_encoding¶

selene_sdk.sequences.sequence_to_encoding(sequence, base_to_index, bases_arr)[source]¶

Converts an input sequence to its one-hot encoding.

Parameters

sequence (str) – The input sequence of length \(L\).
base_to_index (dict) – A dict that maps input characters to indices, where the indices specify the column to assign as 1 when a base exists at the current position in the input. If a base does not exist at the current position in the input, it’s corresponding column in the encoding is set as zero. Note that the rows correspond directly to the positions in the input sequence. For instance, with a a genome you would have each of [‘A’, ‘C’, ‘G’, ‘T’] as keys, mapping to values of [0, 1, 2, 3].
bases_arr (list(str)) – The characters in the sequence’s alphabet.

Returns

The \(L \times N\) encoding of the sequence, where \(L\) is the length of the input sequence and \(N\) is the size of the sequence alphabet.

Return type

numpy.ndarray, dtype=numpy.float32

encoding_to_sequence¶

selene_sdk.sequences.encoding_to_sequence(encoding, bases_arr, unk_base)[source]¶

Converts a sequence one-hot encoding to its string sequence.

Parameters

encoding (numpy.ndarray, dtype=numpy.float32) – The \(L \times N\) encoding of the sequence, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence alphabet.
bases_arr (list(str)) – A list of the bases in the sequence’s alphabet that corresponds to the correct columns for those bases in the encoding.
unk_base (str) – The base corresponding to the “unknown” character in this encoding. See selene_sdk.sequences.Sequence.UNK_BASE for more information.

Returns

The sequence of \(L\) characters decoded from the input array.

Return type

str

get_reverse_encoding¶

selene_sdk.sequences.get_reverse_encoding(encoding, bases_arr, base_to_index, complementary_base_dict)[source]¶

The Genome DNA bases encoding is created such that the reverse encoding can be quickly computed.

Parameters

encoding (numpy.ndarray)
bases_arr (list(str))
base_to_index (dict)
complementary_base_dict (dict)

Returns

Return type

numpy.ndarray