selene_sdk.predict

This module contains classes and methods for making and analyzing predictions with models that have already been trained.

AnalyzeSequences

class selene_sdk.predict.AnalyzeSequences(model, trained_model_path, sequence_length, features, batch_size=64, use_cuda=False, data_parallel=False, reference_sequence=<class 'selene_sdk.sequences.genome.Genome'>, write_mem_limit=1500)[source]

Bases: object

Score sequences and their variants using the predictions made by a trained model.

Parameters
  • model (torch.nn.Module) – A sequence-based model architecture.

  • trained_model_path (str or list(str)) – The path(s) to the weights file for a trained sequence-based model. For a single path, the model architecture must match model. For a list of paths, assumes that the model passed in is of type selene_sdk.utils.MultiModelWrapper, which takes in a list of models. The paths must be ordered the same way the models are ordered in that list. list(str) input is an API-only function– Selene’s config file CLI does not support the MultiModelWrapper functionality at this time.

  • sequence_length (int) – The length of sequences that the model is expecting.

  • features (list(str)) – The names of the features that the model is predicting.

  • batch_size (int, optional) – Default is 64. The size of the mini-batches to use.

  • use_cuda (bool, optional) – Default is False. Specifies whether CUDA-enabled GPUs are available for torch to use.

  • data_parallel (bool, optional) – Default is False. Specify whether multiple GPUs are available for torch to use during training.

  • reference_sequence (class, optional) – Default is selene_sdk.sequences.Genome. The type of sequence on which this analysis will be performed. Please note that if you need to use variant effect prediction, you cannot only pass in the class–you must pass in the constructed selene_sdk.sequences.Sequence object with a particular sequence version (e.g. Genome(“hg19.fa”)). This version does NOT have to be the same sequence version that the model was trained on. That is, if the sequences in your variants file are hg19 but your model was trained on hg38 sequences, you should pass in hg19.

  • write_mem_limit (int, optional) – Default is 5000. Specify, in MB, the amount of memory you want to allocate to storing model predictions/scores. When running one of _in silico_ mutagenesis, variant effect prediction, or prediction, prediction/score handlers will accumulate data in memory and only write this data to files periodically. By default, Selene will write to files when the total amount of data (across all handlers) takes up 5000MB of space. Please keep in mind that Selene will not monitor the memory needed to actually carry out the operations (e.g. variant effect prediction) or load the model, so write_mem_limit should always be less than the total amount of CPU memory you have available on your machine. For example, for variant effect prediction, we load all the variants in 1 file into memory before getting the predictions, so your machine must have enough memory to accommodate that. Another possible consideration is your model size and whether you are using it on the CPU or a CUDA-enabled GPU (i.e. setting use_cuda to True).

Variables
  • ~AnalyzeSequences.model (torch.nn.Module) – A sequence-based model that has already been trained.

  • ~AnalyzeSequences.sequence_length (int) – The length of sequences that the model is expecting.

  • ~AnalyzeSequences.batch_size (int) – The size of the mini-batches to use.

  • ~AnalyzeSequences.features (list(str)) – The names of the features that the model is predicting.

  • ~AnalyzeSequences.use_cuda (bool) – Specifies whether to use a CUDA-enabled GPU or not.

  • ~AnalyzeSequences.data_parallel (bool) – Whether to use multiple GPUs or not.

  • ~AnalyzeSequences.reference_sequence (class) – The type of sequence on which this analysis will be performed.

get_predictions(input, output_dir=None, output_format='tsv', strand_index=None)[source]

Get model predictions for sequences specified as a raw sequence, FASTA, or BED file.

Parameters
  • input (str) – A single sequence, or a path to the FASTA or BED file input.

  • output_dir (str, optional) – Default is None. Output directory to write the model predictions. If this is left blank a raw sequence input will be assumed, though an output directory is required for FASTA and BED inputs.

  • output_format ({‘tsv’, ‘hdf5’}, optional) – Default is ‘tsv’. Choose whether to save TSV or HDF5 output files. TSV is easier to access (i.e. open with text editor/Excel) and quickly peruse, whereas HDF5 files must be accessed through specific packages/viewers that support this format (e.g. h5py Python package). Choose

    • ‘tsv’ if your list of sequences is relatively small (\(10^4\) or less in order of magnitude) and/or your model has a small number of features (<1000).

    • ‘hdf5’ for anything larger and/or if you would like to access the predictions/scores as a matrix that you can easily filter, apply computations, or use in a subsequent classifier/model. In this case, you may access the matrix using mat[“data”] after opening the HDF5 file using mat = h5py.File(“<output.h5>”, ‘r’). The matrix columns are the features and will match the same ordering as your features .txt file (same as the order your model outputs its predictions) and the matrix rows are the sequences. Note that the row labels (FASTA description/IDs) will be output as a separate .txt file (should match the ordering of the sequences in the input FASTA).

  • strand_index (int or None, optional) – Default is None. If the trained model makes strand-specific predictions, your input BED file may include a column with strand information (strand must be one of {‘+’, ‘-‘, ‘.’}). Specify the index (0-based) to use it. Otherwise, by default ‘+’ is used. (This parameter is ignored if FASTA file is used as input.)

Returns

Writes the output to file(s) in output_dir. Filename will match that specified in the filepath. In addition, if any base in the given or retrieved sequence is unknown, the row labels .txt file or .tsv file will mark this sequence or region as contains_unk = True.

Return type

None

get_predictions_for_bed_file(input_path, output_dir, output_format='tsv', strand_index=None)[source]

Get model predictions for sequences specified as genome coordinates in a BED file. Coordinates do not need to be the same length as the model expected sequence input–predictions will be centered at the midpoint of the specified start and end coordinates.

Parameters
  • input_path (str) – Input path to the BED file.

  • output_dir (str) – Output directory to write the model predictions.

  • output_format ({‘tsv’, ‘hdf5’}, optional) – Default is ‘tsv’. Choose whether to save TSV or HDF5 output files. TSV is easier to access (i.e. open with text editor/Excel) and quickly peruse, whereas HDF5 files must be accessed through specific packages/viewers that support this format (e.g. h5py Python package). Choose

    • ‘tsv’ if your list of sequences is relatively small (\(10^4\) or less in order of magnitude) and/or your model has a small number of features (<1000).

    • ‘hdf5’ for anything larger and/or if you would like to access the predictions/scores as a matrix that you can easily filter, apply computations, or use in a subsequent classifier/model. In this case, you may access the matrix using mat[“data”] after opening the HDF5 file using mat = h5py.File(“<output.h5>”, ‘r’). The matrix columns are the features and will match the same ordering as your features .txt file (same as the order your model outputs its predictions) and the matrix rows are the sequences. Note that the row labels (FASTA description/IDs) will be output as a separate .txt file (should match the ordering of the sequences in the input FASTA).

  • strand_index (int or None, optional) – Default is None. If the trained model makes strand-specific predictions, your input file may include a column with strand information (strand must be one of {‘+’, ‘-‘, ‘.’}). Specify the index (0-based) to use it. Otherwise, by default ‘+’ is used.

Returns

Writes the output to file(s) in output_dir. Filename will match that specified in the filepath.

Return type

None

get_predictions_for_fasta_file(input_path, output_dir, output_format='tsv')[source]

Get model predictions for sequences in a FASTA file.

Parameters
  • input_path (str) – Input path to the FASTA file.

  • output_dir (str) – Output directory to write the model predictions.

  • output_format ({‘tsv’, ‘hdf5’}, optional) – Default is ‘tsv’. Choose whether to save TSV or HDF5 output files. TSV is easier to access (i.e. open with text editor/Excel) and quickly peruse, whereas HDF5 files must be accessed through specific packages/viewers that support this format (e.g. h5py Python package). Choose

    • ‘tsv’ if your list of sequences is relatively small (\(10^4\) or less in order of magnitude) and/or your model has a small number of features (<1000).

    • ‘hdf5’ for anything larger and/or if you would like to access the predictions/scores as a matrix that you can easily filter, apply computations, or use in a subsequent classifier/model. In this case, you may access the matrix using mat[“data”] after opening the HDF5 file using mat = h5py.File(“<output.h5>”, ‘r’). The matrix columns are the features and will match the same ordering as your features .txt file (same as the order your model outputs its predictions) and the matrix rows are the sequences. Note that the row labels (FASTA description/IDs) will be output as a separate .txt file (should match the ordering of the sequences in the input FASTA).

Returns

Writes the output to file(s) in output_dir.

Return type

None

in_silico_mutagenesis(sequence, save_data, output_path_prefix='ism', mutate_n_bases=1, output_format='tsv', start_position=0, end_position=None)[source]

Applies in silico mutagenesis to a sequence.

Parameters
  • sequence (str) – The sequence to mutate.

  • save_data (list(str)) – A list of the data files to output. Must input 1 or more of the following options: [“abs_diffs”, “diffs”, “logits”, “predictions”].

  • output_path_prefix (str, optional) – The path to which the data files are written. If directories in the path do not yet exist they will be automatically created.

  • mutate_n_bases (int, optional) – The number of bases to mutate at one time. We recommend leaving this parameter set to 1 at this time, as we have not yet optimized operations for double and triple mutations.

  • output_format ({‘tsv’, ‘hdf5’}, optional) – Default is ‘tsv’. The desired output format.

  • start_position (int, optional) – Default is 0. The starting position of the subsequence to be mutated.

  • end_position (int or None, optional) – Default is None. The ending position of the subsequence to be mutated. If left as None, then self.sequence_length will be used.

Returns

Outputs data files from in silico mutagenesis to output_dir. For HDF5 output and ‘predictions’ in save_data, an additional file named *_ref_predictions.h5 will be outputted with the model prediction for the original input sequence.

Return type

None

Raises
  • ValueError – If the value of start_position or end_position is negative.

  • ValueError – If there are fewer than mutate_n_bases between start_position and end_position.

  • ValueError – If start_position is greater or equal to end_position.

  • ValueError – If start_position is not less than self.sequence_length.

  • ValueError – If end_position is greater than self.sequence_length.

in_silico_mutagenesis_from_file(input_path, save_data, output_dir, mutate_n_bases=1, use_sequence_name=True, output_format='tsv', start_position=0, end_position=None)[source]

Apply in silico mutagenesis to all sequences in a FASTA file.

Please note that we have not parallelized this function yet, so runtime increases exponentially when you increase mutate_n_bases.

Parameters
  • input_path (str) – The path to the FASTA file of sequences.

  • save_data (list(str)) – A list of the data files to output. Must input 1 or more of the following options: [“abs_diffs”, “diffs”, “logits”, “predictions”].

  • output_dir (str) – The path to the output directory. Directories in the path will be created if they do not currently exist.

  • mutate_n_bases (int, optional) – Default is 1. The number of bases to mutate at one time in in silico mutagenesis.

  • use_sequence_name (bool, optional.) – Default is True. If use_sequence_name, output files are prefixed by the sequence name/description corresponding to each sequence in the FASTA file. Spaces in the sequence name are replaced with underscores ‘_’. If not use_sequence_name, output files are prefixed with an index \(i\) (starting with 0) corresponding to the :math:`i`th sequence in the FASTA file.

  • output_format ({‘tsv’, ‘hdf5’}, optional) – Default is ‘tsv’. The desired output format. Each sequence in the FASTA file will have its own set of output files, where the number of output files depends on the number of save_data predictions/scores specified.

  • start_position (int, optional) – Default is 0. The starting position of the subsequence to be mutated.

  • end_position (int or None, optional) – Default is None. The ending position of the subsequence to be mutated. If left as None, then self.sequence_length will be used.

Returns

Outputs data files from in silico mutagenesis to output_dir. For HDF5 output and ‘predictions’ in save_data, an additional file named *_ref_predictions.h5 will be outputted with the model prediction for the original input sequence.

Return type

None

Raises
  • ValueError – If the value of start_position or end_position is negative.

  • ValueError – If there are fewer than mutate_n_bases between start_position and end_position.

  • ValueError – If start_position is greater or equal to end_position.

  • ValueError – If start_position is not less than self.sequence_length.

  • ValueError – If end_position is greater than self.sequence_length.

in_silico_mutagenesis_predict(sequence, base_preds, mutations_list, reporters=[])[source]

Get the predictions for all specified mutations applied to a given sequence and, if applicable, compute the scores (“abs_diffs”, “diffs”, “logits”) for these mutations.

Parameters
  • sequence (str) – The sequence to mutate.

  • base_preds (numpy.ndarray) – The model’s prediction for sequence.

  • mutations_list (list(list(tuple))) – The mutations to apply to the sequence. Each element in mutations_list is a list of tuples, where each tuple specifies the int position in the sequence to mutate and what str base to which the position is mutated (e.g. (1, ‘A’)).

  • reporters (list(PredictionsHandler)) – The list of reporters, where each reporter handles the predictions made for each mutated sequence. Will collect, compute scores (e.g. AbsDiffScoreHandler computes the absolute difference between base_preds and the predictions for the mutated sequence), and output these as a file at the end.

Returns

Writes results to files corresponding to each reporter in reporters.

Return type

None

variant_effect_prediction(vcf_file, save_data, output_dir=None, output_format='tsv', strand_index=None, require_strand=False)[source]

Get model predictions and scores for a list of variants.

Parameters
  • vcf_file (str) – Path to a VCF file. Must contain the columns [#CHROM, POS, ID, REF, ALT], in order. Column header does not need to be present.

  • save_data (list(str)) – A list of the data files to output. Must input 1 or more of the following options: [“abs_diffs”, “diffs”, “logits”, “predictions”].

  • output_dir (str or None, optional) – Default is None. Path to the output directory. If no path is specified, will save files corresponding to the options in save_data to the current working directory.

  • output_format ({‘tsv’, ‘hdf5’}, optional) – Default is ‘tsv’. Choose whether to save TSV or HDF5 output files. TSV is easier to access (i.e. open with text editor/Excel) and quickly peruse, whereas HDF5 files must be accessed through specific packages/viewers that support this format (e.g. h5py Python package). Choose

    • ‘tsv’ if your list of variants is relatively small (\(10^4\) or less in order of magnitude) and/or your model has a small number of features (<1000).

    • ‘hdf5’ for anything larger and/or if you would like to access the predictions/scores as a matrix that you can easily filter, apply computations, or use in a subsequent classifier/model. In this case, you may access the matrix using mat[“data”] after opening the HDF5 file using mat = h5py.File(“<output.h5>”, ‘r’). The matrix columns are the features and will match the same ordering as your features .txt file (same as the order your model outputs its predictions) and the matrix rows are the sequences. Note that the row labels (chrom, pos, id, ref, alt) will be output as a separate .txt file.

  • strand_index (int or None, optional.) – Default is None. If applicable, specify the column index (0-based) in the VCF file that contains strand information for each variant.

  • require_strand (bool, optional.) – Default is False. Whether strand can be specified as ‘.’. If False, Selene accepts strand value to be ‘+’, ‘-‘, or ‘.’ and automatically treats ‘.’ as ‘+’. If True, Selene skips any variant with strand ‘.’. This parameter assumes that strand_index has been set.

Returns

Saves all files to output_dir. If any bases in the ‘ref’ column of the VCF do not match those at the specified position in the reference genome, the row labels .txt file will mark this variant as ref_match = False. If most of your variants do not match the reference genome, please check that the reference genome you specified matches the version with which the variants were called. The predictions can used directly if you have verified that the ‘ref’ bases specified for these variants are correct (Selene will have substituted these bases for those in the reference genome). In addition, if any base in the retrieved reference sequence is unknown, the row labels .txt file will mark this variant as contains_unk = True. Finally, some variants may show up in an ‘NA’ file. This is because the surrounding sequence context ended up being out of bounds or overlapping with blacklist regions or the chromosome containing the variant did not show up in the reference genome FASTA file.

Return type

None