selene_sdk.targets

This module contains classes and methods for target feature classes. These are classes which define a way to access a “target feature” such as a label or annotation on an input sequence.

Target

class selene_sdk.targets.Target[source]

Bases: object

The abstract base class for all target feature classes. Target features classes are classes which define a way to access a “target feature” such as a label or annotation on an input sequence.

abstract get_feature_data(*args, **kwargs)[source]

Retrieve the feature data for some coordinate.

GenomicFeatures

class selene_sdk.targets.GenomicFeatures(input_path, features, feature_thresholds=None)[source]

Bases: selene_sdk.targets.target.Target

Stores the dataset specifying sequence regions and features. Accepts a tabix-indexed *.bed file with the following columns, in order:

[chrom, start, end, strand, feature]

Note that chrom is interchangeable with any sort of region (e.g. a protein in a FAA file). Further, start is 0-based. Lastly, any addition columns following the five shown above will be ignored.

Parameters
  • input_path (str) – Path to the tabix-indexed dataset. Note that for the file to be tabix-indexed, it must have been compressed with bgzip. Thus, input_path should be a *.gz file with a corresponding *.tbi file in the same directory.

  • features (list(str)) – The non-redundant list of genomic features (i.e. labels) that will be predicted.

  • feature_thresholds (float or dict or types.FunctionType or None) – Default is None. A genomic region is determined to be a positive sample if at least one genomic feature peak takes up a proportion of the region greater than or equal to the threshold specified for that feature.

    • None - No thresholds specified. All features found in a query region are annotated to that region.

    • float - A single threshold applies to all the features in the dataset.

    • dict - A dictionary mapping feature names (str) to threshold values (float), which thereby assigns different thresholds to different features. If a feature’s threshold is not specified in this dictionary, then we assume that a key “default” exists in the dictionary that has the default threshold value we should assign to the feature name that is absent from the dictionary keys.

    • types.FunctionType - define a function that takes as input the feature name and returns the feature’s threshold.

Variables
  • ~GenomicFeatures.data (tabix.open) – The data stored in a tabix-indexed *.bed file.

  • ~GenomicFeatures.n_features (int) – The number of distinct features.

  • ~GenomicFeatures.feature_index_dict (dict) – A dictionary mapping feature names (str) to indices (int), where the index is the position of the feature in features.

  • ~GenomicFeatures.index_feature_dict (dict) – A dictionary mapping indices (int) to feature names (str), where the index is the position of the feature in the input features.

  • ~GenomicFeatures.feature_thresholds (dict or None) –

    • dict - A dictionary mapping feature names (str) to thresholds (float), where the threshold is the minimum overlap that a feature annotation must have with a query region to be considered a positive example of that feature.

    • None - No threshold specifications. Assumes that all features returned by a tabix query are annotated to the query region.

get_feature_data(chrom, start, end)[source]

For a sequence of length \(L = end - start\), return the features’ one-hot encoding corresponding to that region. For instance, for n_features, each position in that sequence will have a binary vector specifying whether the genomic feature’s coordinates overlap with that position. @TODO: Clarify with an example, as this is hard to read right now.

Parameters
  • chrom (str) – The name of the region (e.g. ‘1’, ‘2’, …, ‘X’, ‘Y’).

  • start (int) – The 0-based first position in the region.

  • end (int) – One past the 0-based last position in the region.

Returns

\(L \times N\) array, where \(L = end - start\) and \(N =\) self.n_features. Note that if we catch a tabix.TabixError, we assume the error was the result of there being no features present in the queried region and return a numpy.ndarray of zeros.

Return type

numpy.ndarray

is_positive(chrom, start, end)[source]

Determines whether the query the chrom queried contains any genomic features within the \([start, end)\) region. If so, the query is considered positive.

Parameters
  • chrom (str) – The name of the region (e.g. ‘1’, ‘2’, …, ‘X’, ‘Y’).

  • start (int) – The 0-based first position in the region.

  • end (int) – One past the 0-based last position in the region.

Returns

True if this meets the criterion for a positive example, False otherwise. Note that if we catch a tabix.TabixError exception, we assume the error was the result of no features being present in the queried region and return False.

Return type

bool