FAQ¶
!!! note Hashing and Reading, modes alongside with the kmers parameters are tightly associated with the kDataFrame object.
1. Hashing¶
1.1 What are kDataFrame hashing modes?¶
Hashing modes are only applied in kDataFrameMQF and kDataFramePHMAP
There are multiple hasing modes for the kDataFrame to use.
- 0: When setting the mode parameter to
0that will define the hashing to be irreversible. By other words, once the kmer is hashed and inserted in thekDataFramecan’t be reversed back to it’s string representation. - 1: Hashing mode
1will define the hashing to be reversible, so that you can iterate over thekDataFramekmers in their string representation.
1.2 Why there are not user-selected hashing modes in kDataFrameMAP?¶
The only purpose of using the kDataFrameMAP is to store the kmers lexicographically. Consequently, no hashing is required, just store the kmers in their Two-bits representation.
1.3 What’s the best hashing mode to use?¶
There’s no best hashing mode, that depends totally on the application.
The irreversible mode is way faster than the reversible mode and requires less space. However, it’s not as sensitive as the reversible mode in hashing the kmers because it stores the approximate hash value of the kmers with a user-controlled false-positive rate. As the false-positive rate increases, the probability of hashing multiple kmers with the same hash value will increase. Moreover, querying the kDataFrame with a similar but not stored kmer can show that the kmer already exist, but actually it’s not.
On the other hand, the reversible hashing mode is slower than the irreversible mode but allows the user to iterate over the kmers in their string representation.
2. Sequence Parsing¶
In order to extract the kmers from a sequences file or string in kProcessor, you will need to pass a Python dictionary with the parsing parameters.
2.1 Parameters¶
Keys of the dictionary are strings describing the parameter type, values are integers for the parameters values
2.2 Modes¶
There are four modes for decoding the sequences substrings KMERS, SKIPMERS, MINIMIZERS, and PROTEIN. The following table describes the reading mode/hashing mode compatibility. True values represent the modes that can be used together.
| Reading Mode | Hashing Mode | Compatible? | |:—————-:|:————————–:|:—————:| | KMERS | mumur_hasher | true | | SKIPMERS | mumur_hasher | true | | MINIMIZERS | mumur_hasher | true | | PROTEIN | mumur_hasher | false | | KMERS | integer_hasher | true | | SKIPMERS | integer_hasher | true | | MINIMIZERS | integer_hasher | true | | PROTEIN | integer_hasher | false | | KMERS | TwoBits_hasher | true | | SKIPMERS | TwoBits_hasher | true | | MINIMIZERS | TwoBits_hasher | true | | PROTEIN | TwoBits_hasher | false | | KMERS | nonCanonicalInteger_Hasher | true | | SKIPMERS | nonCanonicalInteger_Hasher | true | | MINIMIZERS | nonCanonicalInteger_Hasher | true | | PROTEIN | nonCanonicalInteger_Hasher | false | | KMERS | protein_hasher | false | | SKIPMERS | protein_hasher | false | | MINIMIZERS | protein_hasher | false | | PROTEIN | protein_hasher | true | | KMERS | proteinDayhoff_hasher | false | | SKIPMERS | proteinDayhoff_hasher | false | | MINIMIZERS | proteinDayhoff_hasher | false | | PROTEIN | proteinDayhoff_hasher | true |
2.2.1 KMERS¶
- Description: Extracts the sequences substrings in the default popular mode “kmers”.
- Reading Mode:
KMERS - Parameters:
- k_size: total number of nucleotide bases
2.2.2 SKIPMERS¶
- Description: A cyclic pattern of picked or skipped positions. read more about skipmers
- Reading Mode:
SKIPMERS - Parameters:
- “k_size”: total number of bases
- “m”: used bases per cycle
- “n”: cycle length
2.2.3 MINIMIZERS¶
- Description: short substrings that represents the sequence. read more about minimizers
- Reading Mode:
MINIMIZERS - Parameters:
- “k_size”: total number of bases
- “w”: window size
2.2.4 PROTEIN¶
- Description: Parsing protein sequences to extract the sequence substrings.
- Reading Mode:
PROTEIN - Parameters:
- k_size: total number of amino acid bases
2.3 Default Hashing and Reading Modes¶
- kDataFrameMQF/kDataFrameBMQF: Default reading mode is
KMERSand the default hashing mode isinteger_hasherwhich is reversible. - kDataFramePHMAP: Default reading mode is
KMERSand the default hashing mode isTwoBits_hasherwhich is reversible. - kDataFrameMAP: Default reading mode is
KMERSand the default hashing mode isTwoBits_hasherwhich is reversible.
!!! note “technical information”
The twoBits_hasher is used to avoid the double hashing of the kmers since the underlying data structures has an internal hashing. twoBits_hasher is not an actual hasher, it just converts a kmer substring to it’s corresponding two-bits representation.
2.4 Examples¶
2.4.1 Extract kmers with kmer size 31¶
parse_params = {
"kSize" : 31
}
KF_KMERS = kDataFramePHMAP(KMERS, twoBits_hasher, parse_params)
2.4.2 Extract skipmers with k = 10, m = 2, n = 3¶
parse_params = {
"k" : 10,
"m" : 2,
"n" : 3
}
KF_SKIPMERS = kp.kDataFramePHMAP(SKIPMERS, integer_hasher, parse_params)
2.4.3 Extract Minimzers with k = 5, w = 10¶
parse_params = {
"k" : 5,
"w" : 10,
}
KF_MINIMIZERS = kp.kDataFramePHMAP(MINIMIZERS, twoBits_hasher, parse_params)
2.4.4 Extract Protein with kSize = 5¶
parse_params = {
"kSize" : 5,
}
KF_PROTEIN = kp.kDataFramePHMAP(PROTEIN, protein_hasher, parse_params)