selene_sdk.targets¶
This module contains classes and methods for target feature classes. These are classes which define a way to access a “target feature” such as a label or annotation on an input sequence.
Target¶
GenomicFeatures¶
-
class
selene_sdk.targets.
GenomicFeatures
(input_path, features, feature_thresholds=None)[source]¶ Bases:
selene_sdk.targets.target.Target
Stores the dataset specifying sequence regions and features. Accepts a tabix-indexed *.bed file with the following columns, in order:
[chrom, start, end, strand, feature]
Note that chrom is interchangeable with any sort of region (e.g. a protein in a FAA file). Further, start is 0-based. Lastly, any addition columns following the five shown above will be ignored.
- Parameters
input_path (str) – Path to the tabix-indexed dataset. Note that for the file to be tabix-indexed, it must have been compressed with bgzip. Thus, input_path should be a *.gz file with a corresponding *.tbi file in the same directory.
features (list(str)) – The non-redundant list of genomic features (i.e. labels) that will be predicted.
feature_thresholds (float or dict or types.FunctionType or None) – Default is None. A genomic region is determined to be a positive sample if at least one genomic feature peak takes up a proportion of the region greater than or equal to the threshold specified for that feature.
None - No thresholds specified. All features found in a query region are annotated to that region.
float - A single threshold applies to all the features in the dataset.
dict - A dictionary mapping feature names (str) to threshold values (float), which thereby assigns different thresholds to different features. If a feature’s threshold is not specified in this dictionary, then we assume that a key “default” exists in the dictionary that has the default threshold value we should assign to the feature name that is absent from the dictionary keys.
types.FunctionType - define a function that takes as input the feature name and returns the feature’s threshold.
- Variables
~GenomicFeatures.data (tabix.open) – The data stored in a tabix-indexed *.bed file.
~GenomicFeatures.n_features (int) – The number of distinct features.
~GenomicFeatures.feature_index_dict (dict) – A dictionary mapping feature names (str) to indices (int), where the index is the position of the feature in features.
~GenomicFeatures.index_feature_dict (dict) – A dictionary mapping indices (int) to feature names (str), where the index is the position of the feature in the input features.
~GenomicFeatures.feature_thresholds (dict or None) –
dict - A dictionary mapping feature names (str) to thresholds (float), where the threshold is the minimum overlap that a feature annotation must have with a query region to be considered a positive example of that feature.
None - No threshold specifications. Assumes that all features returned by a tabix query are annotated to the query region.
-
get_feature_data
(chrom, start, end)[source]¶ For a sequence of length \(L = end - start\), return the features’ one-hot encoding corresponding to that region. For instance, for n_features, each position in that sequence will have a binary vector specifying whether the genomic feature’s coordinates overlap with that position. @TODO: Clarify with an example, as this is hard to read right now.
- Parameters
chrom (str) – The name of the region (e.g. ‘1’, ‘2’, …, ‘X’, ‘Y’).
start (int) – The 0-based first position in the region.
end (int) – One past the 0-based last position in the region.
- Returns
\(L \times N\) array, where \(L = end - start\) and \(N =\) self.n_features. Note that if we catch a tabix.TabixError, we assume the error was the result of there being no features present in the queried region and return a numpy.ndarray of zeros.
- Return type
-
is_positive
(chrom, start, end)[source]¶ Determines whether the query the chrom queried contains any genomic features within the \([start, end)\) region. If so, the query is considered positive.
- Parameters
chrom (str) – The name of the region (e.g. ‘1’, ‘2’, …, ‘X’, ‘Y’).
start (int) – The 0-based first position in the region.
end (int) – One past the 0-based last position in the region.
- Returns
True if this meets the criterion for a positive example, False otherwise. Note that if we catch a tabix.TabixError exception, we assume the error was the result of no features being present in the queried region and return False.
- Return type