Python API¶
The main programmatic way to interact with the kProcessor is via its Python API.
Please also see recipes of using the python API.
Contents
- Python API
kDataFrame: The abstract base class for defining the kDataFramekDataFrameIterator: The abstract base class for defining a kDataFrame iteratorkDataFrameMQF: subclass derived fromkDataFramekDataFrameMAP: subclass derived fromkDataFramekDataFramePHMAP: subclass derived fromkDataFramecolored_kDataFrame: colored kDataFrame that holds the source sequence of each KmerSet Functions: Function like intersection & unionKmer input using files or strings
kDataFrame: The abstract base class for defining the kDataFrame¶
-
class
kDataFrame¶ The abstract base class defining a kDataFrame.
-
reserve(n)¶ Request a capacity change so that the kDataFrame can approximately hold at least n kmers
Parameters: n – Minimum number of kmers
-
insert(kmer, N=1)¶ Insert the kmer N time in the kDataFrame, or increment the kmer count with N if it is already exists.
Parameters: - kmer (string) – The Kmer to increment its count
- N (integer) – Kmer count (Optional, Default = 1)
Returns: Boolean value indicating whether the kmer is inserted or not
Return type: bool
-
setCount(kmer, N)¶ Set the kmer’s count to N time in the kDataFrame
Parameters: - kmer (string) – The Kmer to set its count
- N (integer) – Kmer count
Returns: Boolean value indicating whether the kmer is inserted or not
Return type: bool
-
getCount(kmer)¶ Retrieve number of times the kmer was inserted in the kDataFrame
Parameters: kmer (string) – The kmer to retrieve its count Returns: The count of the kmer in the kDataFrame Return type: integer
-
erase(kmer)¶ Removes a kmer from the kDataFrame
Parameters: kmer (string) – The kmer to be erased Returns: Boolean value indicating whether the kmer is erased or not Return type: bool
-
size()¶ Number of kmers in the kDataFrame
Returns: The number of kmers in the kDataframe Return type: integer
-
max_size()¶ Maximum number of kmers that the kDataframe can hold.
Returns: The maximum number of kmers that the kDataframe can hold. Return type: integer
-
empty()¶ Check whether the kDataFrame is empty of kmers or not.
Returns: Boolean value indicating whether the kDataFrame is empty, i.e. whether its size is 0 Return type: boolean
-
load_factor()¶ Retrieving the current load factor of the kDataFrame in percentage to indicate how full is it.
Returns: The current load factor in the kDataFrame. Return type: integer
-
max_load_factor()¶ Retrieving the maximum load factor of the kDataFrame in percentage.
Returns: The maximum load factor in the kDataFrame. Return type: integer
-
begin()¶ Instantiate a kDataFrameIterator object pointing to the first kmer position :return: An iterator at the begin of the kDataFrame. :rtype:
kProcessor.kDataFrameIterator
-
end()¶ Instantiate a kDataFrameIterator object pointing to the last kmer position
Returns: An iterator at the end of the kDataFrame. Return type: kProcessor.kDataFrameIterator
-
save()¶ Serialize the kDataFrame on the disk in a form of binary file alongside other metadata files.
-
static
load(filePath)¶ A static method to load a kDataFrame file from disk.
Note
Load the file without the extension [.mqf, .map, .phmap]
Parameters: filePath – The serialized kDataFrame binary file without the extension Returns: the loaded kDataFrame from disk Return type: kProcessor.kDataFrame- Example:
>>> import kProcessor as kp >>> # File path : "path/to/file.mqf" >>> KF = kp.kDataFrame.load("path/to/file")
-
kSize()¶ Get the kmer size of the kDataFrame
Returns: kmer size Return type: integer
-
kDataFrameIterator: The abstract base class for defining a kDataFrame iterator¶
-
class
kDataFrameIterator¶ Base class for kDataFrame Iterator
-
next()¶ Increment the iterator to the next kmer
Returns: kDataFrame Iterator pointing to the new kmer position Return type: kProcessor.kDataFrameIterator
-
getKmer()¶ Get the kmer at the current iterator position
Returns: Kmer at the current position Return type: string
-
getHashedKmer()¶ Get the hash value of the kmer at the current iterator position
Returns: Kmer’s hash value at the current position Return type: integer
-
getCount()¶ Get the count of the kmer at the current iterator position
Returns: kmer count Return type: integer
-
setCount()¶ Sets the count of the current kmer
Returns: True if succeeded, False if failed Return type: boolean
-
kDataFrameMQF: subclass derived from kDataFrame¶
-
class
kDataFrameMQF(kSize)¶ The abstract base class defining a kDataFrameMQF.
Instantiate a kDataFrameMQF object with predefined kmer size.
Parameters: kSize (integer) – Kmer Size Returns: kProcessor.kDataFrameMQF- Instantiation Example:
>>> import kProcessor as kp >>> KF_MQF_1 = kp.kDataFrameMQF(31) # kSize = 31 >>> KF_MQF_1 = kp.kDataFrameMQF(SKIPMERS, integer_hasher, {'m': 2, 'n': 3, 'k': 10}) # Reading mode = skipmers, hashing mode = integer hashing, (m, n, k) are the skipmers params. >>> KF_MQF_2 = kp.kDataFrameMQF(PROTEIN, protein_hasher, {'kSize': 5}); # Reading/hashing mode = protein, kSize = 5 >>> KF_MQF_3 = kp.kDataFrameMQF(PROTEIN, proteinDayhoff_hasher, {'kSize': 11}); # Reading mode = protein, hashing mode = dayhoff encoding, kSize = 11
Note
Read more about hashing modes in the FAQ page.
-
getTwin()¶ creates a new
kDataFrameMQFusing the same parameters as the currentkDataFrameMQF.Returns: A shallow copy of the current kDataFrameMQF.Return type: kDataFrameMQF
-
reserve(n)¶ Request a capacity change so that the kDataFrameMQF can approximately hold at least n kmers
Parameters: n – Minimum number of kmers
-
insert(kmer, N=1)¶ Insert the kmer N time in the kDataFrameMQF, or increment the kmer count with N if it is already exists.
Parameters: - kmer (string) – The Kmer to increment its count
- N (integer) – Kmer count (Optional, Default = 1)
Returns: Boolean value indicating whether the kmer is inserted or not
Return type: bool
-
setCount(kmer, N)¶ Set the kmer’s count to N time in the kDataFrameMQF
Parameters: - kmer (string) – The Kmer to set its count
- N (integer) – Kmer count
Returns: Boolean value indicating whether the kmer is inserted or not
Return type: bool
-
getCount(kmer)¶ Retrieve number of times the kmer was inserted in the kDataFrameMQF
Parameters: kmer (string) – The kmer to retrieve its count Returns: The count of the kmer in the kDataFrameMQF Return type: integer
-
erase(kmer)¶ Removes a kmer from the kDataFrameMQF
Parameters: kmer (string) – The kmer to be erased Returns: Boolean value indicating whether the kmer is erased or not Return type: bool
-
size()¶ Number of kmers in the kDataFrameMQF
Returns: The number of kmers in the kDataFrameMQF Return type: integer
-
max_size()¶ Maximum number of kmers that the kDataFrameMQF can hold.
Returns: The maximum number of kmers that the kDataFrameMQF can hold. Return type: integer
-
empty()¶ Check whether the kDataFrameMQF is empty of kmers or not.
Returns: Boolean value indicating whether the kDataFrameMQF is empty, i.e. whether its size is 0 Return type: boolean
-
load_factor()¶ Retrieving the current load factor of the kDataFrameMQF in percentage to indicate how full is it.
Returns: The current load factor in the kDataFrameMQF. Return type: integer
-
max_load_factor()¶ Retrieving the maximum load factor of the kDataFrameMQF in percentage.
Returns: The maximum load factor in the kDataFrameMQF. Return type: integer
-
begin()¶ Instantiate a kDataFrameIterator object pointing to the first kmer position :return: An iterator at the begin of the kDataFrameMQF. :rtype:
kProcessor.kDataFrameIterator
-
end()¶ Instantiate a kDataFrameIterator object pointing to the last kmer position
Returns: An iterator at the end of the kDataFrameMQF. Return type: kProcessor.kDataFrameIterator
-
save()¶ Serialize the kDataFrameMQF on the disk in a form of binary file alongside other metadata files.
-
static
load(filePath)¶ A static method to load a kDataFrameMQF file from disk.
Note
Load the file without the extension [.mqf, .map, .phmap]
Parameters: filePath – The serialized kDataFrameMQF binary file without the extension Returns: the loaded kDataFrameMQF from disk Return type: kProcessor.kDataFrameMQF- Example:
>>> import kProcessor as kp >>> # File path : "path/to/file.mqf" >>> KF = kp.kDataFrameMQF.load("path/to/file")
-
kSize()¶ Get the kmer size of the kDataFrameMQF
Returns: kmer size Return type: integer
kDataFrameMAP: subclass derived from kDataFrame¶
-
class
kDataFrameMAP(kSize)¶ The abstract base class defining a kDataFrameMAP.
Parameters: kSize (integer) – Kmer Size Returns: kProcessor.kDataFrameMAPNote
Read more about the usage of kDataFrameMAP in the FAQ page.
-
reserve(n)¶ Request a capacity change so that the kDataFrameMAP can approximately hold at least n kmers
Parameters: n – Minimum number of kmers
-
insert(kmer, N=1)¶ Insert the kmer N time in the kDataFrameMAP, or increment the kmer count with N if it is already exists.
Parameters: - kmer (string) – The Kmer to increment its count
- N (integer) – Kmer count (Optional, Default = 1)
Returns: Boolean value indicating whether the kmer is inserted or not
Return type: bool
-
setCount(kmer, N)¶ Set the kmer’s count to N time in the kDataFrameMAP
Parameters: - kmer (string) – The Kmer to set its count
- N (integer) – Kmer count
Returns: Boolean value indicating whether the kmer is inserted or not
Return type: bool
-
getCount(kmer)¶ Retrieve number of times the kmer was inserted in the kDataFrameMAP
Parameters: kmer (string) – The kmer to retrieve its count Returns: The count of the kmer in the kDataFrameMAP Return type: integer
-
erase(kmer)¶ Removes a kmer from the kDataFrameMAP
Parameters: kmer (string) – The kmer to be erased Returns: Boolean value indicating whether the kmer is erased or not Return type: bool
-
size()¶ Number of kmers in the kDataFrameMAP
Returns: The number of kmers in the kDataFrameMAP Return type: integer
-
max_size()¶ Maximum number of kmers that the kDataFrameMAP can hold.
Returns: The maximum number of kmers that the kDataFrameMAP can hold. Return type: integer
-
empty()¶ Check whether the kDataFrameMAP is empty of kmers or not.
Returns: Boolean value indicating whether the kDataFrameMAP is empty, i.e. whether its size is 0 Return type: boolean
-
load_factor()¶ Retrieving the current load factor of the kDataFrameMAP in percentage to indicate how full is it.
Returns: The current load factor in the kDataFrameMAP. Return type: integer
-
max_load_factor()¶ Retrieving the maximum load factor of the kDataFrameMAP in percentage.
Returns: The maximum load factor in the kDataFrameMAP. Return type: integer
-
begin()¶ Instantiate a kDataFrameIterator object pointing to the first kmer position :return: An iterator at the begin of the kDataFrameMAP. :rtype:
kProcessor.kDataFrameIterator
-
end()¶ Instantiate a kDataFrameIterator object pointing to the last kmer position
Returns: An iterator at the end of the kDataFrameMAP. Return type: kProcessor.kDataFrameIterator
-
save()¶ Serialize the kDataFrameMAP on the disk in a form of binary file alongside other metadata files.
-
static
load(filePath)¶ A static method to load a kDataFrameMAP file from disk.
Note
Load the file without the extension [.mqf, .map, .phmap]
Parameters: filePath – The serialized kDataFrameMAP binary file without the extension Returns: the loaded kDataFrameMAP from disk Return type: kProcessor.kDataFrameMAP- Example:
>>> import kProcessor as kp >>> # File path : "path/to/file.mqf" >>> KF = kp.kDataFrameMAP.load("path/to/file")
-
kSize()¶ Get the kmer size of the kDataFrameMAP
Returns: kmer size Return type: integer
-
kDataFramePHMAP: subclass derived from kDataFrame¶
-
class
kDataFramePHMAP(kSize)¶ The abstract base class defining a kDataFramePHMAP.
Instantiate a kDataFramePHMAP object with predefined kmer size.
Parameters: - kSize (integer) – Kmer Size
- mode (integer) – Hashing mode for the kDataFramePHMAP, default = 1
Returns: - Instantiation Example:
>>> import kProcessor as kp >>> KF_PHMAP_1 = kp.kDataFramePHMAP(31) # kSize = 31 >>> KF_PHMAP_2 = kp.kDataFramePHMAP(PROTEIN, protein_hasher, {'kSize': 5}); # Reading/hashing mode = protein, kSize = 5 >>> KF_PHMAP_3 = kp.kDataFramePHMAP(PROTEIN, proteinDayhoff_hasher, {'kSize': 11}); # Reading mode = protein, hashing mode = dayhoff encoding, kSize = 11
Note
Read more about reading and hashing modes in the FAQ page.
-
getTwin()¶ creates a new
kDataFramePHMAPusing the same parameters as the currentkDataFramePHMAP.Returns: A shallow copy of the current kDataFramePHMAP.Return type: kDataFramePHMAP
-
reserve(n)¶ Request a capacity change so that the kDataFramePHMAP can approximately hold at least n kmers
Parameters: n – Minimum number of kmers
-
insert(kmer, N=1)¶ Insert the kmer N time in the kDataFramePHMAP, or increment the kmer count with N if it is already exists.
Parameters: - kmer (string) – The Kmer to increment its count
- N (integer) – Kmer count (Optional, Default = 1)
Returns: Boolean value indicating whether the kmer is inserted or not
Return type: bool
-
setCount(kmer, N)¶ Set the kmer’s count to N time in the kDataFramePHMAP
Parameters: - kmer (string) – The Kmer to set its count
- N (integer) – Kmer count
Returns: Boolean value indicating whether the kmer is inserted or not
Return type: bool
-
getCount(kmer)¶ Retrieve number of times the kmer was inserted in the kDataFramePHMAP
Parameters: kmer (string) – The kmer to retrieve its count Returns: The count of the kmer in the kDataFramePHMAP Return type: integer
-
erase(kmer)¶ Removes a kmer from the kDataFramePHMAP
Parameters: kmer (string) – The kmer to be erased Returns: Boolean value indicating whether the kmer is erased or not Return type: bool
-
size()¶ Number of kmers in the kDataFramePHMAP
Returns: The number of kmers in the kDataFramePHMAP Return type: integer
-
max_size()¶ Maximum number of kmers that the kDataFramePHMAP can hold.
Returns: The maximum number of kmers that the kDataFramePHMAP can hold. Return type: integer
-
empty()¶ Check whether the kDataFramePHMAP is empty of kmers or not.
Returns: Boolean value indicating whether the kDataFramePHMAP is empty, i.e. whether its size is 0 Return type: boolean
-
load_factor()¶ Retrieving the current load factor of the kDataFramePHMAP in percentage to indicate how full is it.
Returns: The current load factor in the kDataFramePHMAP. Return type: integer
-
max_load_factor()¶ Retrieving the maximum load factor of the kDataFramePHMAP in percentage.
Returns: The maximum load factor in the kDataFramePHMAP. Return type: integer
-
begin()¶ Instantiate a kDataFrameIterator object pointing to the first kmer position :return: An iterator at the begin of the kDataFramePHMAP. :rtype:
kProcessor.kDataFrameIterator
-
end()¶ Instantiate a kDataFrameIterator object pointing to the last kmer position
Returns: An iterator at the end of the kDataFramePHMAP. Return type: kProcessor.kDataFrameIterator
-
save()¶ Serialize the kDataFramePHMAP on the disk in a form of binary file alongside other metadata files.
-
static
load(filePath)¶ A static method to load a kDataFramePHMAP file from disk.
Note
Load the file without the extension [.mqf, .map, .phmap]
Parameters: filePath – The serialized kDataFramePHMAP binary file without the extension Returns: the loaded kDataFramePHMAP from disk Return type: kProcessor.kDataFramePHMAP- Example:
>>> import kProcessor as kp >>> # File path : "path/to/file.mqf" >>> KF = kp.kDataFramePHMAP.load("path/to/file")
-
kSize()¶ Get the kmer size of the kDataFramePHMAP
Returns: kmer size Return type: integer
colored_kDataFrame: colored kDataFrame that holds the source sequence of each Kmer¶
-
class
colored_kDataFrame¶ colored_kDataFrame class
Note
the colored_kDataFrame Inherits all the functions from
kProcessor.kDataFrameplus other new functions.- Introduction:
- The colored_kDataFrame class holds the Kmers colors instead of their count.
- The color is an integer represents the targets which contains that kmer.
- Example:
color:
1: represents the transcriptstranscript_A,transcript_Bandtranscript_Ccolor:2: represents the transcriptstranscript_A,transcript_Bkmer:
ACTGATCGATCGTACGAChas the color 2, that means it’s found in both transcript_A and transcript_B kmer:ATAAGCATTTACAGCAAThas the color 1, that means it’s found in both transcript_A , transcript_B and transcript_C
-
getColor(kmer)¶ Get the color of the kmer
Parameters: kmer (str) – Kmer string Returns: The color of the kmer Return type: int
-
getKmerSource(kmer)¶ Get all sample IDs that contains that kmer.
Parameters: kmer (str) – Kmer string Returns: List of all samples IDs associated with that kmer. Return type: list
-
getKmerSourceFromColor(color)¶ Get all sample IDs that contains that kmer.
Parameters: color (int) – Kmer color Returns: List of all samples IDs associated with that color. Return type: list
-
names_map()¶ Get the names map dictionary that represents sample ID as key and its group name as value.
Returns: names map dictionary. Return type: dict
-
inverse_names_map()¶ Get the names map dictionary that represents group name as key and its sample ID as value.
Returns: inverse names map dictionary. Return type: dict
-
static
load(prefix)¶ Load colored_kDataFrame file from disk.
Parameters: prefix (string) – file path Returns: Colored kDataFrame that has been serialized on disk. Return type: kProcessor.colored_kDataFrame
-
get_kDataFrame()¶ Get the kDataFrame object that holds the kmers alongside their colors.
Returns: the embedded kDataFrame inside the colored_kDataFrame. Return type: kProcessor.kDataFrame
Set Functions: Function like intersection & union¶
-
kFrameUnion(input)¶ Calculate the union of the kDataFrames. The result kDataframe will have all the kmers in the input list of kDataframes. The count of the kmers equals to the sum of the kmer count in the input list.
Warning
This function works only with
kProcessor.kDataFrameMQF.Parameters: input (list of kProcessor.kDataFrameMQF) – List of kDataFramesReturns: New kDataFrame object holding the union of kmers in the kDataFrames list. Return type: kProcessor.kDataFrame
-
kFrameIntersect(input)¶ Calculate the intersect of the kDataFrames. The result kDataframe will have only kmers that exists in all the kDataframes. The count of the kmers equals to the minimum of the kmer count in the input list.
Warning
This function works only with
kProcessor.kDataFrameMQF.Parameters: input (list of kProcessor.kDataFrameMQF) – List of kDataFramesReturns: New kDataFrame object holding the intersection of kmers in the kDataFrames list. Return type: kDataFrame
-
kFrameDiff(input)¶ Calculate the difference of the kDataframes. The result kDataframe will have only kmers that exists in the first kDataframe and not in any of the rest input kDataframes. The count of the kmers equals to the count in the first kDataframe.
Warning
This function works only with
kProcessor.kDataFrameMQF.Parameters: input (list of kProcessor.kDataFrameMQF) – List of kDataFramesReturns: New kDataFrame object holding the difference of kmers in the kDataFrames list. Return type: kDataFrame
Kmer input using files or strings¶
-
index(kframe, filename, chunk_size, names_fileName)¶ Perform indexing to a sequences file with predefined kmers decoding mode.
Parameters: - kframe (
kProcessor.kDataFrame) – the kDataFrame to be filled with the kmers with their colors - filename (str) – Sequence(s) file path
- chunk_size (int) – Number of sequences to parse at once.
- names_fileName – The TSV names file that contains target sequences headers corresponding to their groups.
Returns: colored kDataFrame with the decoded kmers and their colors.
Return type: - Example 1:
>>> import kProcessor as kp >>> KF = kp.kDataFrameMQF(31) >>> ckf = kp.index(KF, "seq.fa", 1000, "seq.names")
- Example 2:
>>> import kProcessor as kp >>> KF_prot = kp.kDataFramePHMAP(PROTEIN, protein_hasher, {'kSize': 5}) >>> ckf_prot = kp.index(KF, "seq.fa", 1000, "seq.names")
- kframe (
-
countKmersFromFile(kframe, filename, chunk_size)¶ Load the kmers with their counts in the input file into the output kDataframe. Input File can be of formats: fastq,fasta.
Note
The kDataFrame of this function are passed-by-reference. So, it returns nothing.
Parameters: - kframe (
kProcessor.kDataFrame) – the kDataFrame to be filled with the kmers with their counts - filename (str) – Sequence(s) file path
- chunk_size – Number of sequences to parse at once.
- Example:
>>> import kProcessor as kp >>> KF = kp.kDataFramePHMAP(11) >>> kp.parseSequencesFromFile(KF, "seq.fa", 1000) # Fill the KF with the kmers and counts
- kframe (
-
countKmersFromString(seq, kFrame)¶ Load the kmers in the input string into the output kDataframe.
Note
The kDataFrame of this function are passed-by-reference. So, it returns nothing.
Parameters: - kFrame (
kProcessor.kDataFrame) – the kDataFrame to be filled with the kmers with their counts - sequence (string) – Sequence to be parsed
- Example:
>>> import kProcessor as kp >>> KF = kDataFramePHMAP(11) >>> seq = "ACGATCGATCGATTATATATATCGACGATCGATCGTACGTAGC" >>> kp.parseSequencesFromString(seq, KF) # Fill the KF with the kmers and counts
- kFrame (