selene_sdk.sequences

This module provides the types for representing biological sequences.

Sequence

class selene_sdk.sequences.Sequence[source]

Bases: object

The abstract base class for biological sequence classes.

abstract property BASES_ARR

This is an array with the alphabet (i.e. all possible symbols that may occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

Returns

The array of all members of the alphabet.

Return type

numpy.ndarray, dtype=str

abstract property BASE_TO_INDEX

A dictionary mapping members of the alphabet (i.e. all possible symbols that can occur in a sequence) to integers.

Returns

The dictionary mapping the alphabet to integers.

Return type

dict

abstract property INDEX_TO_BASE

A dictionary mapping integers to members of the alphabet (i.e. all possible symbols that can occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

Returns

The dictionary mapping integers to the alphabet.

Return type

dict

abstract property UNK_BASE

This is a base used to represent unknown positions. This is not the same as a character from outside the sequence’s alphabet. A character from outside the alphabet is an error. A position with an unknown base signifies that the position is one of the bases from the alphabet, but we are uncertain which.

Returns

The character representing an unknown base.

Return type

str

abstract coords_in_bounds(*args, **kwargs)[source]

Checks if queried coordinates are valid.

Returns

True if the coordinates are in bounds, otherwise False.

Return type

bool

abstract classmethod encoding_to_sequence(encoding)[source]

Transforms the input numerical representation of a sequence into a string representation.

Parameters

encoding (numpy.ndarray, dtype=numpy.float32) – The \(L \times N\) encoding of the sequence, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence type’s alphabet.

Returns

The sequence of bases decoded from the input array. This sequence will be of length \(L\).

Return type

str

abstract get_encoding_from_coords(*args, **kwargs)[source]

Extracts the numerical encoding for a sequence occurring at the given coordinates.

Returns

The \(L \times N\) encoding of the sequence occuring at queried coordinates, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence type’s alphabet. Behavior is undefined for invalid coordinates.

Return type

numpy.ndarray, dtype=numpy.float32

abstract get_sequence_from_coords(*args, **kwargs)[source]

Extracts a string representation of a sequence at the given coordinates.

Returns

The sequence of bases occuring at the queried coordinates. This sequence will be of length \(L\) normally, but only if the coordinates are valid. Behavior is undefined for invalid coordinates.

Return type

str

abstract classmethod sequence_to_encoding(sequence)[source]

Transforms a biological sequence into a numerical representation.

Parameters

sequence (str) – The input sequence of characters.

Returns

The \(L \times N\) encoding of the sequence, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence type’s alphabet.

Return type

numpy.ndarray, dtype=numpy.float32

Genome

class selene_sdk.sequences.Genome(input_path, blacklist_regions=None, bases_order=None, init_unpicklable=False)[source]

Bases: selene_sdk.sequences.sequence.Sequence

This class provides access to an organism’s genomic sequence.

This class supports retrieving parts of the sequence and converting these parts into their one-hot encodings. It is essentially a wrapper class around the pyfaidx.Fasta class.

Parameters
  • input_path (str) – Path to an indexed FASTA file, that is, a *.fasta file with a corresponding *.fai file in the same directory. This file should contain the target organism’s genome sequence.

  • blacklist_regions (str or None, optional) – Default is None. Path to a tabix-indexed list of regions from which we should not output sequences. This is used to ensure that we are not sampling from areas where we will never collect measurements. You can pass as input “hg19” or “hg38” to use the blacklist regions released by ENCODE. You can also pass in your own tabix-indexed .gz file.

  • bases_order (list(str) or None, optional) – Default is None (use the default base ordering of [‘A’, ‘C’, ‘G’, ‘T’]). Specify a different ordering of DNA bases for one-hot encoding.

  • init_unpicklable (bool, optional) – Default is False. Delays initialization until a relevant method is called. This enables the object to be pickled after instantiation. init_unpicklable must be False when multi-processing is needed e.g. DataLoader. Set init_unpicklable to True if you are using this class directly through Selene’s API and want to access class attributes without having to call on a specific method in Genome.

Variables
  • ~Genome.genome (pyfaidx.Fasta) – The FASTA file containing the genome sequence.

  • ~Genome.chrs (list(str)) – The list of chromosome names.

  • ~Genome.len_chrs (dict) – A dictionary mapping the names of each chromosome in the file to the length of said chromosome.

BASES_ARR = ['A', 'C', 'G', 'T']

This is an array with the alphabet (i.e. all possible symbols that may occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

BASE_TO_INDEX = {'A': 0, 'C': 1, 'G': 2, 'T': 3, 'a': 0, 'c': 1, 'g': 2, 't': 3}

A dictionary mapping members of the alphabet (i.e. all possible symbols that can occur in a sequence) to integers.

COMPLEMENTARY_BASE_DICT = {'A': 'T', 'C': 'G', 'G': 'C', 'N': 'N', 'T': 'A', 'a': 'T', 'c': 'G', 'g': 'C', 'n': 'N', 't': 'A'}

A dictionary mapping each base to its complementary base.

INDEX_TO_BASE = {0: 'A', 1: 'C', 2: 'G', 3: 'T'}

A dictionary mapping integers to members of the alphabet (i.e. all possible symbols that can occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

UNK_BASE = 'N'

This is a base used to represent unknown positions. This is not the same as a character from outside the sequence’s alphabet. A character from outside the alphabet is an error. A position with an unknown base signifies that the position is one of the bases from the alphabet, but we are uncertain which.

coords_in_bounds(chrom, start, end)[source]

Check if the region we want to query is within the bounds of the queried chromosome and non-overlapping with blacklist regions (if given).

Parameters
  • chrom (str) – The name of the chromosomes, e.g. “chr1”.

  • start (int) – The 0-based start coordinate of the sequence.

  • end (int) – One past the 0-based last position in the sequence.

Returns

Whether we can retrieve a sequence from the bounds specified in the input.

Return type

bool

classmethod encoding_to_sequence(encoding)[source]

Converts an input one-hot encoding to its DNA sequence.

Parameters

encoding (numpy.ndarray, dtype=numpy.float32) – An \(L \times 4\) one-hot encoding of the sequence, where \(L\) is the length of the output sequence.

Returns

The sequence of \(L\) nucleotides decoded from the input array.

Return type

str

get_chr_lens()[source]

Gets the name and length of each chromosome sequence in the file.

Returns

A list of tuples of the chromosome names and lengths.

Return type

list(tuple(str, int))

get_chrs()[source]

Gets the list of chromosome names.

Returns

A list of the chromosome names.

Return type

list(str)

get_encoding_from_coords(chrom, start, end, strand='+', pad=False)[source]

Gets the one-hot encoding of the genomic sequence at the queried coordinates.

Parameters
  • chrom (str) – The name of the chromosome or region, e.g. “chr1”.

  • start (int) – The 0-based start coordinate of the first position in the sequence.

  • end (int) – One past the 0-based last position in the sequence.

  • strand ({‘+’, ‘-‘, ‘.’}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘+’.

  • pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end - start.

Returns

The \(L \times 4\) encoding of the sequence, where \(L = end - start\), unless chrom cannot be found in the input FASTA, start or end are out of bounds, or (if a blacklist exists) the region overlaps with a blacklist region. In these cases, it will return an empty encoding–that is, L = 0 for the NumPy array returned.

Return type

numpy.ndarray, dtype=numpy.float32

Raises

ValueError – If the input char to strand is not one of the specified choices. (Raised in the call to self.get_sequence_from_coords)

get_encoding_from_coords_check_unk(chrom, start, end, strand='+', pad=False)[source]

Gets the one-hot encoding of the genomic sequence at the queried coordinates and check whether the sequence contains unknown base(s).

Parameters
  • chrom (str) – The name of the chromosome or region, e.g. “chr1”.

  • start (int) – The 0-based start coordinate of the first position in the sequence.

  • end (int) – One past the 0-based last position in the sequence.

  • strand ({‘+’, ‘-‘, ‘.’}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘+’.

  • pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end - start.

Returns

  • tuple[0] is the \(L \times 4\) encoding of the sequence

containing data of numpy.float32 type, where \(L = end - start\), unless chrom cannot be found in the input FASTA, start or end are out of bounds, or (if a blacklist exists) the region overlaps with a blacklist region. In these cases, it will return an empty encoding–that is, L = 0 for the NumPy array returned. * tuple[1] is the boolean value that indicates whether the sequence contains any unknown base(s) specified in self.UNK_BASE

Return type

tuple(numpy.ndarray, bool)

Raises

ValueError – If the input char to strand is not one of the specified choices. (Raised in the call to self.get_sequence_from_coords)

get_sequence_from_coords(chrom, start, end, strand='+', pad=False)[source]

Gets the queried chromosome’s sequence at the input coordinates.

Parameters
  • chrom (str) – The name of the chromosomes, e.g. “chr1”.

  • start (int) – The 0-based start coordinate of the sequence.

  • end (int) – One past the 0-based last position in the sequence.

  • strand ({‘+’, ‘-‘, ‘.’}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘.’.

  • pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end - start.

Returns

The genomic sequence of length \(L\) where \(L = end - start\). If pad is False and one/both of start and end are out of bounds, will return an empty string. Also returns an empty string if chrom cannot be found in the input FASTA file. Otherwise, will return the sequence with padding at the start/end if appropriate.

Return type

str

Raises

ValueError – If the input char to strand is not one of the specified choices.

classmethod sequence_to_encoding(sequence)[source]

Converts an input sequence to its one-hot encoding.

Parameters

sequence (str) – A nucleotide sequence of length \(L\)

Returns

The \(L \times 4\) one-hot encoding of the sequence.

Return type

numpy.ndarray, dtype=numpy.float32

Proteome

class selene_sdk.sequences.Proteome(input_path)[source]

Bases: selene_sdk.sequences.sequence.Sequence

Provides access to an organism’s proteomic sequence.

It supports retrieving parts of the sequence and converting these parts into their one-hot encodings. It is essentially a wrapper class around the pyfaidx.Fasta class.

Parameters

input_path (str) – Path to an indexed FASTA file containing amino acid sequences, that is, a *.fasta file with a corresponding *.fai file in the same directory. File should contain the sequences from which training examples will be created.

Variables
  • ~Proteome.proteome (pyfaidx.Fasta) – The FASTA or FAA file containing the protein sequences.

  • ~Proteome.prots (list(str)) – The list of protein names.

  • ~Proteome.len_prots (dict) – A dictionary that maps protein names to the lengths, and does so for all protein sequences in the proteome.

BASES_ARR = ['A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']

This is an array with the alphabet (i.e. all possible symbols that may occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

BASE_TO_INDEX = {'A': 0, 'C': 4, 'D': 3, 'E': 5, 'F': 13, 'G': 7, 'H': 8, 'I': 9, 'K': 11, 'L': 10, 'M': 12, 'N': 2, 'P': 14, 'Q': 6, 'R': 1, 'S': 15, 'T': 16, 'V': 19, 'W': 17, 'Y': 18}

A dictionary mapping members of the alphabet (i.e. all possible symbols that can occur in a sequence) to integers.

INDEX_TO_BASE = {0: 'A', 1: 'R', 2: 'N', 3: 'D', 4: 'C', 5: 'E', 6: 'Q', 7: 'G', 8: 'H', 9: 'I', 10: 'L', 11: 'K', 12: 'M', 13: 'F', 14: 'P', 15: 'S', 16: 'T', 17: 'W', 18: 'Y', 19: 'V'}

A dictionary mapping integers to members of the alphabet (i.e. all possible symbols that can occur in a sequence). We expect that INDEX_TO_BASE[i]==BASES_ARR[i] is True for all valid i.

UNK_BASE = 'X'

This is a base used to represent unknown positions. This is not the same as a character from outside the sequence’s alphabet. A character from outside the alphabet is an error. A position with an unknown base signifies that the position is one of the bases from the alphabet, but we are uncertain which.

coords_in_bounds(prot, start, end)[source]

Check if the coordinates we want to query is valid.

Parameters
  • prot (str) – The name of the protein, e.g. “YFP”.

  • start (int) – The 0-based start coordinate of the first position in the sequence.

  • end (int) – One past the 0-based last position in the sequence.

Returns

A boolean indicating whether we can retrieve a sequence from the queried coordinates.

Return type

bool

classmethod encoding_to_sequence(encoding)[source]

Converts an input one-hot encoding to its amino acid sequence.

Parameters

encoding (numpy.ndarray, dtype=numpy.float32) – The \(L \times 20\) encoding of the sequence, where \(L\) is the length of the output amino acid sequence.

Returns

The sequence of \(L\) amino acids decoded from the input array.

Return type

str

get_encoding_from_coords(prot, start, end)[source]

Gets the one-hot encoding of the protein’s sequence at the input coordinates.

Parameters
  • prot (str) – The name of the protein, e.g. “YFP”.

  • start (int) – The 0-based start coordinate of the first position in the sequence.

  • end (int) – One past the 0-based last position in the sequence.

Returns

The \(L \times 20\) encoding of the sequence, where \(L = end - start\).

Return type

numpy.ndarray, dtype=numpy.float32

get_prot_lens()[source]

Gets the name and length of each protein sequence in the file.

Returns

A list of tuples of protein names and protein lengths.

Return type

list(tuple(str, int))

get_prots()[source]

Gets the list of protein names.

Returns

A list of the protein names.

Return type

list(str)

get_sequence_from_coords(prot, start, end)[source]

Gets the queried protein sequence at the input coordinates.

Parameters
  • prot (str) – The protein name, e.g. “YFP”.

  • start (int) – The 0-based start coordinate of the first position in the sequence.

  • end (int) – One past the 0-based last position in the sequence.

Returns

The sequence of \(L\) amino acids at the specified coordinates, where \(L = end - start\).

Return type

str

classmethod sequence_to_encoding(sequence)[source]

Converts an input sequence to its one-hot encoding.

Parameters

sequence (str) – The input sequence of amino acids of length \(L\).

Returns

The \(L \times 20\) array, where L was the length of the input sequence.

Return type

numpy.ndarray, dtype=numpy.float32

sequence_to_encoding

selene_sdk.sequences.sequence_to_encoding(sequence, base_to_index, bases_arr)[source]

Converts an input sequence to its one-hot encoding.

Parameters
  • sequence (str) – The input sequence of length \(L\).

  • base_to_index (dict) – A dict that maps input characters to indices, where the indices specify the column to assign as 1 when a base exists at the current position in the input. If a base does not exist at the current position in the input, it’s corresponding column in the encoding is set as zero. Note that the rows correspond directly to the positions in the input sequence. For instance, with a a genome you would have each of [‘A’, ‘C’, ‘G’, ‘T’] as keys, mapping to values of [0, 1, 2, 3].

  • bases_arr (list(str)) – The characters in the sequence’s alphabet.

Returns

The \(L \times N\) encoding of the sequence, where \(L\) is the length of the input sequence and \(N\) is the size of the sequence alphabet.

Return type

numpy.ndarray, dtype=numpy.float32

encoding_to_sequence

selene_sdk.sequences.encoding_to_sequence(encoding, bases_arr, unk_base)[source]

Converts a sequence one-hot encoding to its string sequence.

Parameters
  • encoding (numpy.ndarray, dtype=numpy.float32) – The \(L \times N\) encoding of the sequence, where \(L\) is the length of the sequence, and \(N\) is the size of the sequence alphabet.

  • bases_arr (list(str)) – A list of the bases in the sequence’s alphabet that corresponds to the correct columns for those bases in the encoding.

  • unk_base (str) – The base corresponding to the “unknown” character in this encoding. See selene_sdk.sequences.Sequence.UNK_BASE for more information.

Returns

The sequence of \(L\) characters decoded from the input array.

Return type

str

get_reverse_encoding

selene_sdk.sequences.get_reverse_encoding(encoding, bases_arr, base_to_index, complementary_base_dict)[source]

The Genome DNA bases encoding is created such that the reverse encoding can be quickly computed.

Parameters
  • encoding (numpy.ndarray)

  • bases_arr (list(str))

  • base_to_index (dict)

  • complementary_base_dict (dict)

Returns

Return type

numpy.ndarray