Detailed Options¶
-
spats_shape_seq.
run_spats
(target_path, r1_path, r2_path, output_path, cotrans=False)¶ - Convenience function for a common-case SPATS run that doesn’t
- require any non-default configuration.
Parameters: - target_path – path to the targets FASTA file
- r1_path – path to the R1 input data FASTQ file
- r2_path – path to the R2 input data FASTQ file
- output_path – path to write resulting reactivities
- cotrans – pass True for cotrans-style experiments.
-
class
spats_shape_seq.spats.
Spats
(cotrans=False)¶ The main SPATS driver.
Parameters: cotrans – pass True for cotrans-style experiments. -
addTarget
(name, seq, rowid=-1)¶
-
addTargets
(*target_paths)¶ Used to add one or more target files for processing. Can be called multiple times to add more targets. Inputs are expected to be in FASTA format with one or more targets per path. Must be called before processing.
Parameters: args – one or more filesystem paths to target files.
-
compare_results
(other_spats, verbose=False)¶ - Used to compare the results of the current run against another
- SPATS instance. Must be run after running
process_pair_data()
, or after loading the data (load()
) from a previously-run session.
Parameters: - other_spats –
Spats
instance to compare. - verbose – set to True for detailed output of mismatched sites.
Returns: (match_count, total) : match_count indicates the number of sites matched, total indicates total number of sites.
-
compute_profiles
()¶ Computes beta/theta/c reactivity values after pair data have been processed.
Returns: a profiles.Profiles
object, which contains the reactivities for all targets.
-
counters
¶ Returns the underlying
counters.Counters
object, which contains information about site and tag counts.
-
load
(input_path)¶ Loads SPATS state from a file.
Parameters: input_path – the path of a previously saved SPATS session.
-
loadTargets
(pair_db)¶
-
merge
(input_path)¶ Merges SPATS state from a file with existing state.
Parameters: input_path – the path of a previously saved SPATS session.
-
merge_targets
(pair_db)¶
-
process_pair
(pair)¶ Used process a single
pair.Pair
. Typically only used for debugging or analysis of specific cases.Parameters: pair – a pair.Pair
to process.
-
process_pair_data
(data_r1_path, data_r2_path, force_mask=None)¶ Used to read and process a pair of FASTQ data files.
Note that this parses the pair data into an in-memory SQLite database, which on most modern systems will be fine except for the largest input sets. If you hit memory issues, create a disk-based SQLite DB via
db.PairDB
and then useprocess_pair_db()
.Note that this may be called multiple times to process more than one set of data files before computing profiles.
Parameters: - data_r1_path – path to R1 fragments
- data_r2_path – path to matching R2 fragments.
-
process_pair_db
(pair_db, batch_size=65536)¶ Processes pair data provided by a
db.PairDB
.Note that this may be called multiple times to process more than one set of inputs before computing profiles.
Parameters: pair_db – a db.PairDB
of pairs to process.
-
reset_processor
()¶
-
store
(output_path)¶ Saves the state of the SPATS run for later processing.
Parameters: output_path – the path for writing the output. Recommended file extension is .spats
-
targets
¶
-
validate_results
(data_r1_path, data_r2_path, algorithm='find_partial', verbose=False)¶ - Used to validate the results of the current run using against a
- different algorithm. Must be run after running
process_pair_data()
, or after loading the data (load()
) from a previously-run session.
Parameters: - data_r1_path – path to R1 fragments
- data_r2_path – path to matching R2 fragments.
- algorithm – Generally the default is correct, but you
can select a particular algorithm for data validation (see
run.Run.algorithm
). - verbose – set to True for detailed output of mismatched sites.
Returns: True if results validate, False otherwise.
-
write_reactivities
(output_path)¶ Convenience function used to write the reactivities to an output file. Must be called after
compute_profiles()
.param output_path: the path for writing the output.
-
-
class
spats_shape_seq.run.
Run
¶ Encapsulates the inputs/config required for a Spats run.
-
adapter_b
= None¶ default
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-
adapter_t
= None¶ default
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
-
algorithm
= None¶ Default
find_partial
, set tolookup
to use the lookup optimization.
-
allow_indeterminate
= None¶ Default
False
, set toTrue
to allow indeterminate nucleotides to be processed as matches. For example, with this set toTrue
, the sequenceACNT
in a pair will be considered a match (no error) toACGT
in the target. See alsoallowed_adapter_errors
andallowed_target_errors
.
-
allow_multiple_rt_starts
= None¶ Default
False
, in which the right edge must match the edge of the target. Set toTrue
to allow other possibilities for the right edge.
-
allow_negative_values
= None¶ Default
False
, set toTrue
to allow beta, theta, and rho values to be negative (otherwise, negative values are set to0.0
).
-
allowed_adapter_errors
= None¶ Default
0
, increase to allow the indicated number of errors when performing adapter trimming.
-
allowed_dumbbell_errors
= None¶ Default
0
, increase to allow the indicated number of errors when performing dumbbell trimming.
-
allowed_target_errors
= None¶ Default
0
, increase to allow the indicated number of errors (mutations / indels) when matching to the target. WARNING: the ambiguity of the match increases exponentially with this number of mutations/indels; it’s recommended to not set this higher than2
.
-
apply_config_restrictions
()¶
-
collapse_left_prefixes
= None¶ Default
False
, set toTrue
to treat any read with a 5’ prefix as starting at site zero. Setting this will force the usage of thecount_left_prefixes
option.
-
collapse_only_prefixes
= None¶ Default
None
, which means collapse all prefixes. Set to a list of comma-separated strings to only collapse prefixes that appear in the list. Specifying these will force thecollapse_left_prefixes
andcount_left_prefixes
options to be True.
-
compute_z_reactivity
= None¶ Default
False
, set toTrue
to also compute reactivity based upon z-scores.
-
config_dict
()¶
-
config_string
()¶
-
cotrans
= None¶ Default
False
, set toTrue
to run as a cotrans experiment. Pass a single target instead of a generated targets file.
-
cotrans_linker
= None¶ Default
CTGACTCGGGCACCAAGGAC
, change as necessary for cotrans experiments.
-
cotrans_minimum_length
= None¶ Default
20
, set to adjust the minimum number of bp to use from the cotrans target.
-
count_edge_mutations
= None¶ Defaults to
None
. If set tostop_and_mut
, will count mutations that are at the site like any other mutation. If set tostop_only
, will count the stop but no mutation. In the default behavior, neither stops nor edge mutations are counted.
-
count_left_prefixes
= None¶ Default
False
, set toTrue
to count and report information on pairs that align left of the 5’ end. the count for each different prefix encountered will be reported. Setting this will force the usage of thefind_partial
algorithm.
-
count_mutations
= None¶ Defaults to
False
. If set toTrue
, will count both stops and muations, and incorporate the mutation information into the reactivity profile computations. Note that setting this will forceallowed_target_errors
to be1
.
-
count_only_full_reads
= None¶ Default
False
, set toTrue
to only count reads with no stops.
-
debug
= None¶ default
False
, set toTrue
to output detailed run information
-
dumbbell
= None¶ Default
None
, set to a string sequence for analyses which use a dumbbell sequence (on the front of R2).
-
generate_channel_reads
= None¶ Default
False
, set toTrue
to generate R1/R2 outputs for all matching reads, separated by channel (with handles stripped).
-
generate_sam
= None¶ Default
None
, set to a string path to generate SAM outputs for the spats run.
-
handle_indels
= None¶ Default
False
, set toTrue
to look for indels (insertions/deltions) and incorporate their counts into the reactivity profile computations. Requires using thefind_partial
algorithm. Note that looking for indels while processing will be at least an order of magnitude slower, but could give more accurate reactivities.
-
ignore_stops_with_mismatched_overlap
= None¶ Defaults to
True
. When R1 and R2 overlap and have a mismatch on their overlap, the default behavior is throw away the pair. Set this toFalse
to have the stop and any mutations counted.
-
indel_gap_extend_cost
= None¶ Defaults to
1
, set to the value to penalize the extension of indel (insertion or deletion) gaps in the Smith-Waterman alignment algorithm Only applies whenhandle_indels
is True.
-
indel_gap_open_cost
= None¶ Defaults to
5
, set to the value to penalize the initiation of indel (insertion or deletion) gaps in the Smith-Waterman alignment algorithm Only applies whenhandle_indels
is True.
-
indel_match_value
= None¶ Defaults to
3
, set to the value to reward matching characters in the Smith-Waterman alignment algorithm Only applies whenhandle_indels
is True.
-
indel_mismatch_cost
= None¶ Defaults to
2
, set to the value to penalize mismatching characters in the Smith-Waterman alignment algorithm Only applies whenhandle_indels
is True.
-
load_from_config
(config_dict)¶
-
log
= None¶ defaults to
sys.stdout
, set to file-like object to gather debugging info
-
masks
= None¶ default
[ 'RRRY', 'YYYR' ]
, treated mask is first.
-
minimum_adapter_len
= None¶ Defaults to
0
, set higher to require a minimal amount of adapter in order to do trimming. Generally not necessary since a positive match in the target is required before trimming.
-
minimum_tag_match_length
= None¶ Default
8
, set to adjust the minimum length for matching tags for the reads analyzer.
-
minimum_target_match_length
= None¶ Defaults to 10. Runs will potentially be faster if you set it higher, but may miss some pairs that have only shorter matching subsequences. You can set it lower, but then there’s some chance pairs will match the wrong place – in which case, they will have too many errors and be discarded – and it will allow shorter sequences at the end (which end up being adapter-trimmed) to be accepted. Might want to analyze your targets (xref
target.Target.longest_target_self_matches()
) to determine an appropriate value.
-
mutations_require_quality_score
= None¶ Defaults to
None
. If set to a phred-score integer value (0 - 42), andcount_mutations
isTrue
, then this will require the quality score on any mutation to be greater than or equal to the indicated phred score to be counted.
-
num_workers
= None¶ Default
None
, which auto-detects the number of available cores (usingmultiprocessing.cpu_count()
and creates that many workers. Set to an integer to force an explicit number. Set to1
to disable multiprocessing (sometimes useful for debugging). Only used when bulk processing input data from withinspats.Spats
orspats.Spats.process_pair_data()
orSpats.process_pair_db
.
-
pair_length
= None¶ Default
None
, in which case the pair length is detected from input data. Otherwise, can be set explicitly.
-
quiet
= None¶ default
False
, set to true to suppress output messages
-
regions_of_interest
= None¶ Default
None
, specify a list of pairs like (min, max) where each pair specifies a minimum and maximum nucleotide index indicating a region in which to watch for activity. If a read has a stop and/or mutation within this region, it will be tagged with asinteresting
. Only meaningfull when using the reads tool and thefind_partial
algorithms.
-
result_set_name
= None¶ Defaults to
"default"
, set to a string to choose the name for this result set. Also used for resuming processing, xrefresume_processing
. Result sets can be compared usingdb.PairDB.differing_results()
-
resume_processing
= None¶ Default
False
, set toTrue
to resume processing (if there’s been a previous run usingwriteback_results
and the sameresult_set_name
).
-
rt_primers
= None¶ Default
None
, which means allow RT starts from either anywhere (whenallow_multiple_rt_starts
is True) or from the right (3’) edge of the target only (when it is False). Set to a list of comma-separated strings to restrict where RT starts can happen in the target. Specifying these will force theallow_multiple_rt_starts
option to be True. Note: each primer in the list must unambiguously match a single unique region of the the target.
-
single_target_linker
= None¶ Default
None
, if set to a sequence, then this will be required as a prefix to R1. Internally, this setsallow_multiple_rt_starts
toTrue
, andrt_primers
to this value.
-
skip_database
= None¶ Default
True
, set toFalse
to parse to a database and only process unique counts. Typically straight parsing is faster, but on some datasets it can be faster to determine and only process unique counts.
-
validate_config
()¶
-
writeback_results
= None¶ Default
False
, set toTrue
to write the results back to the input database for further analysis or incremental/resumable processing. xrefresume_processing
.
-
-
class
spats_shape_seq.reads.
ReadsData
(db_path)¶ Encapsulates the data for reads analysis.
Parameters: db_path – the path for the reads data -
pair_db
¶ Access the underlying
db.PairDB
.
-
parse
(target_path, r1_paths, r2_paths, sample_size=100000, show_progress_every=1000000)¶ Used to parse target and r1/r2 reads data for reads processing.
Parameters: - target_path – Path to targets file, must be in FASTA format.
- r1_paths – List of paths to R1 reads files, must be in FASTQ format.
- r2_paths – List of paths to R2 reads files, must be in FASTQ format.
- sample_size – the number of samples to use for analysis. Samples will be (more or less) uniformly randomly selected from the population of pairs.
- show_progress_every – default show a ‘.’ for every 1 million pairs parsed. Set to 0 to disable output.
-
-
class
spats_shape_seq.reads.
ReadsAnalyzer
(reads_data, cotrans=False)¶ Performs the analysis/tagging required for the reads analysis.
Parameters: - reads_data – the
ReadsData
with the input data. - cotrans – pass True for cotrans-style experiments.
-
addTagPlugin
(tag, handler)¶
-
addTagTarget
(name, tag)¶
Processes the tags in the input data for analysis.
-
tag_counts
()¶ Returns a dictionary of
{ tag: count }
of tags analyzed.
- reads_data – the