selene_sdk.targets¶
This module contains classes and methods for target feature classes. These are classes which define a way to access a “target feature” such as a label or annotation on an input sequence.
Target¶
GenomicFeatures¶
- class selene_sdk.targets.GenomicFeatures(input_path, features, feature_thresholds=None, init_unpicklable=False)[source]¶
Bases:
selene_sdk.targets.target.Target
Stores the dataset specifying sequence regions and features. Accepts a tabix-indexed *.bed file with the following columns, in order:
[chrom, start, end, strand, feature]
Note that chrom is interchangeable with any sort of region (e.g. a protein in a FAA file). Further, start is 0-based. Lastly, any addition columns following the five shown above will be ignored.
- Parameters
input_path (str) – Path to the tabix-indexed dataset. Note that for the file to be tabix-indexed, it must have been compressed with bgzip. Thus, input_path should be a *.gz file with a corresponding *.tbi file in the same directory.
features (list(str)) – The non-redundant list of genomic features (i.e. labels) that will be predicted.
feature_thresholds (float or dict or types.FunctionType or None) – Default is None. A genomic region is determined to be a positive sample if at least one genomic feature peak takes up a proportion of the region greater than or equal to the threshold specified for that feature.
None - No thresholds specified. All features found in a query region are annotated to that region.
float - A single threshold applies to all the features in the dataset.
dict - A dictionary mapping feature names (str) to threshold values (float), which thereby assigns different thresholds to different features. If a feature’s threshold is not specified in this dictionary, then we assume that a key “default” exists in the dictionary that has the default threshold value we should assign to the feature name that is absent from the dictionary keys.
types.FunctionType - define a function that takes as input the feature name and returns the feature’s threshold.
init_unpicklable (bool, optional) – Default is False. Delays initialization until a relevant method is called. This enables the object to be pickled after instantiation. init_unpicklable must be False when multi-processing is needed e.g. DataLoader. Set init_unpicklable to True if you are using this class directly through Selene’s API and want to access class attributes without having to call on a specific method in GenomicFeatures.
- Variables
~GenomicFeatures.data (tabix.open) – The data stored in a tabix-indexed *.bed file.
~GenomicFeatures.n_features (int) – The number of distinct features.
~GenomicFeatures.feature_index_dict (dict) – A dictionary mapping feature names (str) to indices (int), where the index is the position of the feature in features.
~GenomicFeatures.index_feature_dict (dict) – A dictionary mapping indices (int) to feature names (str), where the index is the position of the feature in the input features.
~GenomicFeatures.feature_thresholds (dict or None) –
dict - A dictionary mapping feature names (str) to thresholds (float), where the threshold is the minimum overlap that a feature annotation must have with a query region to be considered a positive example of that feature.
None - No threshold specifications. Assumes that all features returned by a tabix query are annotated to the query region.
- get_feature_data(chrom, start, end)[source]¶
Computes which features overlap with the given region.
- Parameters
chrom (str) – The name of the region (e.g. ‘1’, ‘2’, …, ‘X’, ‘Y’).
start (int) – The 0-based first position in the region.
end (int) – One past the 0-based last position in the region.
- Returns
A target vector of size self.n_features where the `i`th position is equal to one if the `i`th feature is positive, and zero otherwise.
NOTE: If we catch a tabix.TabixError, we assume the error was the result of there being no features present in the queried region and return a numpy.ndarray of zeros.
- Return type
- is_positive(chrom, start, end)[source]¶
Determines whether the query the chrom queried contains any genomic features within the \([start, end)\) region. If so, the query is considered positive.
- Parameters
chrom (str) – The name of the region (e.g. ‘1’, ‘2’, …, ‘X’, ‘Y’).
start (int) – The 0-based first position in the region.
end (int) – One past the 0-based last position in the region.
- Returns
True if this meets the criterion for a positive example, False otherwise. Note that if we catch a tabix.TabixError exception, we assume the error was the result of no features being present in the queried region and return False.
- Return type