Detailed Options

spats_shape_seq.run_spats(target_path, r1_path, r2_path, output_path, cotrans=False)
Convenience function for a common-case SPATS run that doesn’t
require any non-default configuration.
Parameters:
  • target_path – path to the targets FASTA file
  • r1_path – path to the R1 input data FASTQ file
  • r2_path – path to the R2 input data FASTQ file
  • output_path – path to write resulting reactivities
  • cotrans – pass True for cotrans-style experiments.
class spats_shape_seq.spats.Spats(cotrans=False)

The main SPATS driver.

Parameters:cotrans – pass True for cotrans-style experiments.
addTarget(name, seq, rowid=-1)
addTargets(*target_paths)

Used to add one or more target files for processing. Can be called multiple times to add more targets. Inputs are expected to be in FASTA format with one or more targets per path. Must be called before processing.

Parameters:args – one or more filesystem paths to target files.
compare_results(other_spats, verbose=False)
Used to compare the results of the current run against another
SPATS instance. Must be run after running process_pair_data(), or after loading the data (load()) from a previously-run session.
Parameters:
  • other_spatsSpats instance to compare.
  • verbose – set to True for detailed output of mismatched sites.
Returns:

(match_count, total) : match_count indicates the number of sites matched, total indicates total number of sites.

compute_profiles()

Computes beta/theta/c reactivity values after pair data have been processed.

Returns:a profiles.Profiles object, which contains the reactivities for all targets.
counters

Returns the underlying counters.Counters object, which contains information about site and tag counts.

load(input_path)

Loads SPATS state from a file.

Parameters:input_path – the path of a previously saved SPATS session.
loadTargets(pair_db)
merge(input_path)

Merges SPATS state from a file with existing state.

Parameters:input_path – the path of a previously saved SPATS session.
merge_targets(pair_db)
process_pair(pair)

Used process a single pair.Pair. Typically only used for debugging or analysis of specific cases.

Parameters:pair – a pair.Pair to process.
process_pair_data(data_r1_path, data_r2_path, force_mask=None)

Used to read and process a pair of FASTQ data files.

Note that this parses the pair data into an in-memory SQLite database, which on most modern systems will be fine except for the largest input sets. If you hit memory issues, create a disk-based SQLite DB via db.PairDB and then use process_pair_db().

Note that this may be called multiple times to process more than one set of data files before computing profiles.

Parameters:
  • data_r1_path – path to R1 fragments
  • data_r2_path – path to matching R2 fragments.
process_pair_db(pair_db, batch_size=65536)

Processes pair data provided by a db.PairDB.

Note that this may be called multiple times to process more than one set of inputs before computing profiles.

Parameters:pair_db – a db.PairDB of pairs to process.
reset_processor()
store(output_path)

Saves the state of the SPATS run for later processing.

Parameters:output_path – the path for writing the output. Recommended file extension is .spats
targets
validate_results(data_r1_path, data_r2_path, algorithm='find_partial', verbose=False)
Used to validate the results of the current run using against a
different algorithm. Must be run after running process_pair_data(), or after loading the data (load()) from a previously-run session.
Parameters:
  • data_r1_path – path to R1 fragments
  • data_r2_path – path to matching R2 fragments.
  • algorithm – Generally the default is correct, but you can select a particular algorithm for data validation (see run.Run.algorithm).
  • verbose – set to True for detailed output of mismatched sites.
Returns:

True if results validate, False otherwise.

write_reactivities(output_path)

Convenience function used to write the reactivities to an output file. Must be called after compute_profiles().

param output_path:
 the path for writing the output.
class spats_shape_seq.run.Run

Encapsulates the inputs/config required for a Spats run.

adapter_b = None

default AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

adapter_t = None

default AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

algorithm = None

Default find_partial, set to lookup to use the lookup optimization.

allow_indeterminate = None

Default False, set to True to allow indeterminate nucleotides to be processed as matches. For example, with this set to True, the sequence ACNT in a pair will be considered a match (no error) to ACGT in the target. See also allowed_adapter_errors and allowed_target_errors.

allow_multiple_rt_starts = None

Default False, in which the right edge must match the edge of the target. Set to True to allow other possibilities for the right edge.

allow_negative_values = None

Default False, set to True to allow beta, theta, and rho values to be negative (otherwise, negative values are set to 0.0).

allowed_adapter_errors = None

Default 0, increase to allow the indicated number of errors when performing adapter trimming.

allowed_dumbbell_errors = None

Default 0, increase to allow the indicated number of errors when performing dumbbell trimming.

allowed_target_errors = None

Default 0, increase to allow the indicated number of errors (mutations / indels) when matching to the target. WARNING: the ambiguity of the match increases exponentially with this number of mutations/indels; it’s recommended to not set this higher than 2.

apply_config_restrictions()
collapse_left_prefixes = None

Default False, set to True to treat any read with a 5’ prefix as starting at site zero. Setting this will force the usage of the count_left_prefixes option.

collapse_only_prefixes = None

Default None, which means collapse all prefixes. Set to a list of comma-separated strings to only collapse prefixes that appear in the list. Specifying these will force the collapse_left_prefixes and count_left_prefixes options to be True.

compute_z_reactivity = None

Default False, set to True to also compute reactivity based upon z-scores.

config_dict()
config_string()
cotrans = None

Default False, set to True to run as a cotrans experiment. Pass a single target instead of a generated targets file.

cotrans_linker = None

Default CTGACTCGGGCACCAAGGAC, change as necessary for cotrans experiments.

cotrans_minimum_length = None

Default 20, set to adjust the minimum number of bp to use from the cotrans target.

count_edge_mutations = None

Defaults to None. If set to stop_and_mut, will count mutations that are at the site like any other mutation. If set to stop_only, will count the stop but no mutation. In the default behavior, neither stops nor edge mutations are counted.

count_left_prefixes = None

Default False, set to True to count and report information on pairs that align left of the 5’ end. the count for each different prefix encountered will be reported. Setting this will force the usage of the find_partial algorithm.

count_mutations = None

Defaults to False. If set to True, will count both stops and muations, and incorporate the mutation information into the reactivity profile computations. Note that setting this will force allowed_target_errors to be 1.

count_only_full_reads = None

Default False, set to True to only count reads with no stops.

debug = None

default False, set to True to output detailed run information

dumbbell = None

Default None, set to a string sequence for analyses which use a dumbbell sequence (on the front of R2).

generate_channel_reads = None

Default False, set to True to generate R1/R2 outputs for all matching reads, separated by channel (with handles stripped).

generate_sam = None

Default None, set to a string path to generate SAM outputs for the spats run.

handle_indels = None

Default False, set to True to look for indels (insertions/deltions) and incorporate their counts into the reactivity profile computations. Requires using the find_partial algorithm. Note that looking for indels while processing will be at least an order of magnitude slower, but could give more accurate reactivities.

ignore_stops_with_mismatched_overlap = None

Defaults to True. When R1 and R2 overlap and have a mismatch on their overlap, the default behavior is throw away the pair. Set this to False to have the stop and any mutations counted.

indel_gap_extend_cost = None

Defaults to 1, set to the value to penalize the extension of indel (insertion or deletion) gaps in the Smith-Waterman alignment algorithm Only applies when handle_indels is True.

indel_gap_open_cost = None

Defaults to 5, set to the value to penalize the initiation of indel (insertion or deletion) gaps in the Smith-Waterman alignment algorithm Only applies when handle_indels is True.

indel_match_value = None

Defaults to 3, set to the value to reward matching characters in the Smith-Waterman alignment algorithm Only applies when handle_indels is True.

indel_mismatch_cost = None

Defaults to 2, set to the value to penalize mismatching characters in the Smith-Waterman alignment algorithm Only applies when handle_indels is True.

load_from_config(config_dict)
log = None

defaults to sys.stdout, set to file-like object to gather debugging info

masks = None

default [ 'RRRY', 'YYYR' ], treated mask is first.

minimum_adapter_len = None

Defaults to 0, set higher to require a minimal amount of adapter in order to do trimming. Generally not necessary since a positive match in the target is required before trimming.

minimum_tag_match_length = None

Default 8, set to adjust the minimum length for matching tags for the reads analyzer.

minimum_target_match_length = None

Defaults to 10. Runs will potentially be faster if you set it higher, but may miss some pairs that have only shorter matching subsequences. You can set it lower, but then there’s some chance pairs will match the wrong place – in which case, they will have too many errors and be discarded – and it will allow shorter sequences at the end (which end up being adapter-trimmed) to be accepted. Might want to analyze your targets (xref target.Target.longest_target_self_matches()) to determine an appropriate value.

mutations_require_quality_score = None

Defaults to None. If set to a phred-score integer value (0 - 42), and count_mutations is True, then this will require the quality score on any mutation to be greater than or equal to the indicated phred score to be counted.

num_workers = None

Default None, which auto-detects the number of available cores (using multiprocessing.cpu_count() and creates that many workers. Set to an integer to force an explicit number. Set to 1 to disable multiprocessing (sometimes useful for debugging). Only used when bulk processing input data from within spats.Spats or spats.Spats.process_pair_data() or Spats.process_pair_db.

pair_length = None

Default None, in which case the pair length is detected from input data. Otherwise, can be set explicitly.

quiet = None

default False, set to true to suppress output messages

regions_of_interest = None

Default None, specify a list of pairs like (min, max) where each pair specifies a minimum and maximum nucleotide index indicating a region in which to watch for activity. If a read has a stop and/or mutation within this region, it will be tagged with as interesting. Only meaningfull when using the reads tool and the find_partial algorithms.

result_set_name = None

Defaults to "default", set to a string to choose the name for this result set. Also used for resuming processing, xref resume_processing. Result sets can be compared using db.PairDB.differing_results()

resume_processing = None

Default False, set to True to resume processing (if there’s been a previous run using writeback_results and the same result_set_name).

rt_primers = None

Default None, which means allow RT starts from either anywhere (when allow_multiple_rt_starts is True) or from the right (3’) edge of the target only (when it is False). Set to a list of comma-separated strings to restrict where RT starts can happen in the target. Specifying these will force the allow_multiple_rt_starts option to be True. Note: each primer in the list must unambiguously match a single unique region of the the target.

single_target_linker = None

Default None, if set to a sequence, then this will be required as a prefix to R1. Internally, this sets allow_multiple_rt_starts to True, and rt_primers to this value.

skip_database = None

Default True, set to False to parse to a database and only process unique counts. Typically straight parsing is faster, but on some datasets it can be faster to determine and only process unique counts.

validate_config()
writeback_results = None

Default False, set to True to write the results back to the input database for further analysis or incremental/resumable processing. xref resume_processing.

class spats_shape_seq.reads.ReadsData(db_path)

Encapsulates the data for reads analysis.

Parameters:db_path – the path for the reads data
pair_db

Access the underlying db.PairDB.

parse(target_path, r1_paths, r2_paths, sample_size=100000, show_progress_every=1000000)

Used to parse target and r1/r2 reads data for reads processing.

Parameters:
  • target_path – Path to targets file, must be in FASTA format.
  • r1_paths – List of paths to R1 reads files, must be in FASTQ format.
  • r2_paths – List of paths to R2 reads files, must be in FASTQ format.
  • sample_size – the number of samples to use for analysis. Samples will be (more or less) uniformly randomly selected from the population of pairs.
  • show_progress_every – default show a ‘.’ for every 1 million pairs parsed. Set to 0 to disable output.
class spats_shape_seq.reads.ReadsAnalyzer(reads_data, cotrans=False)

Performs the analysis/tagging required for the reads analysis.

Parameters:
  • reads_data – the ReadsData with the input data.
  • cotrans – pass True for cotrans-style experiments.
addTagPlugin(tag, handler)
addTagTarget(name, tag)
process_tags()

Processes the tags in the input data for analysis.

run

Provides access to the run.Run which is used to configure tag analysis.

tag_counts()

Returns a dictionary of { tag: count } of tags analyzed.