Python API

The main programmatic way to interact with the kProcessor is via its Python API. Please also see recipes of using the python API.

kDataFrame: The abstract base class for defining the kDataFrame

class kDataFrame

The abstract base class defining a kDataFrame.

reserve(n)

Request a capacity change so that the kDataFrame can approximately hold at least n kmers

Parameters:n – Minimum number of kmers
insert(kmer, N=1)

Insert the kmer N time in the kDataFrame, or increment the kmer count with N if it is already exists.

Parameters:
  • kmer (string) – The Kmer to increment its count
  • N (integer) – Kmer count (Optional, Default = 1)
Returns:

Boolean value indicating whether the kmer is inserted or not

Return type:

bool

setCount(kmer, N)

Set the kmer’s count to N time in the kDataFrame

Parameters:
  • kmer (string) – The Kmer to set its count
  • N (integer) – Kmer count
Returns:

Boolean value indicating whether the kmer is inserted or not

Return type:

bool

getCount(kmer)

Retrieve number of times the kmer was inserted in the kDataFrame

Parameters:kmer (string) – The kmer to retrieve its count
Returns:The count of the kmer in the kDataFrame
Return type:integer
erase(kmer)

Removes a kmer from the kDataFrame

Parameters:kmer (string) – The kmer to be erased
Returns:Boolean value indicating whether the kmer is erased or not
Return type:bool
size()

Number of kmers in the kDataFrame

Returns:The number of kmers in the kDataframe
Return type:integer
max_size()

Maximum number of kmers that the kDataframe can hold.

Returns:The maximum number of kmers that the kDataframe can hold.
Return type:integer
empty()

Check whether the kDataFrame is empty of kmers or not.

Returns:Boolean value indicating whether the kDataFrame is empty, i.e. whether its size is 0
Return type:boolean
load_factor()

Retrieving the current load factor of the kDataFrame in percentage to indicate how full is it.

Returns:The current load factor in the kDataFrame.
Return type:integer
max_load_factor()

Retrieving the maximum load factor of the kDataFrame in percentage.

Returns:The maximum load factor in the kDataFrame.
Return type:integer
begin()

Instantiate a kDataFrameIterator object pointing to the first kmer position :return: An iterator at the begin of the kDataFrame. :rtype: kProcessor.kDataFrameIterator

end()

Instantiate a kDataFrameIterator object pointing to the last kmer position

Returns:An iterator at the end of the kDataFrame.
Return type:kProcessor.kDataFrameIterator
save()

Serialize the kDataFrame on the disk in a form of binary file alongside other metadata files.

static load(filePath)

A static method to load a kDataFrame file from disk.

Note

Load the file without the extension [.mqf, .map, .phmap]

Parameters:filePath – The serialized kDataFrame binary file without the extension
Returns:the loaded kDataFrame from disk
Return type:kProcessor.kDataFrame
Example:
>>> import kProcessor as kp
>>> # File path : "path/to/file.mqf"
>>> KF = kp.kDataFrame.load("path/to/file")
kSize()

Get the kmer size of the kDataFrame

Returns:kmer size
Return type:integer

kDataFrameIterator: The abstract base class for defining a kDataFrame iterator

class kDataFrameIterator

Base class for kDataFrame Iterator

next()

Increment the iterator to the next kmer

Returns:kDataFrame Iterator pointing to the new kmer position
Return type:kProcessor.kDataFrameIterator
getKmer()

Get the kmer at the current iterator position

Returns:Kmer at the current position
Return type:string
getHashedKmer()

Get the hash value of the kmer at the current iterator position

Returns:Kmer’s hash value at the current position
Return type:integer
getCount()

Get the count of the kmer at the current iterator position

Returns:kmer count
Return type:integer
setCount()

Sets the count of the current kmer

Returns:True if succeeded, False if failed
Return type:boolean

kDataFrameMQF: subclass derived from kDataFrame

class kDataFrameMQF(kSize)

The abstract base class defining a kDataFrameMQF.

Instantiate a kDataFrameMQF object with predefined kmer size.

Parameters:kSize (integer) – Kmer Size
Returns:kProcessor.kDataFrameMQF
Instantiation Example:
>>> import kProcessor as kp
>>> KF_MQF_1 = kp.kDataFrameMQF(31) # kSize = 31
>>> KF_MQF_1 = kp.kDataFrameMQF(SKIPMERS, integer_hasher, {'m': 2, 'n': 3, 'k': 10}) # Reading mode = skipmers, hashing mode = integer hashing, (m, n, k) are the skipmers params.
>>> KF_MQF_2 = kp.kDataFrameMQF(PROTEIN, protein_hasher, {'kSize': 5}); # Reading/hashing mode = protein, kSize = 5
>>> KF_MQF_3 = kp.kDataFrameMQF(PROTEIN, proteinDayhoff_hasher, {'kSize': 11}); # Reading mode = protein, hashing mode = dayhoff encoding, kSize = 11

Note

Read more about hashing modes in the FAQ page.

getTwin()

creates a new kDataFrameMQF using the same parameters as the current kDataFrameMQF.

Returns:A shallow copy of the current kDataFrameMQF.
Return type:kDataFrameMQF
reserve(n)

Request a capacity change so that the kDataFrameMQF can approximately hold at least n kmers

Parameters:n – Minimum number of kmers
insert(kmer, N=1)

Insert the kmer N time in the kDataFrameMQF, or increment the kmer count with N if it is already exists.

Parameters:
  • kmer (string) – The Kmer to increment its count
  • N (integer) – Kmer count (Optional, Default = 1)
Returns:

Boolean value indicating whether the kmer is inserted or not

Return type:

bool

setCount(kmer, N)

Set the kmer’s count to N time in the kDataFrameMQF

Parameters:
  • kmer (string) – The Kmer to set its count
  • N (integer) – Kmer count
Returns:

Boolean value indicating whether the kmer is inserted or not

Return type:

bool

getCount(kmer)

Retrieve number of times the kmer was inserted in the kDataFrameMQF

Parameters:kmer (string) – The kmer to retrieve its count
Returns:The count of the kmer in the kDataFrameMQF
Return type:integer
erase(kmer)

Removes a kmer from the kDataFrameMQF

Parameters:kmer (string) – The kmer to be erased
Returns:Boolean value indicating whether the kmer is erased or not
Return type:bool
size()

Number of kmers in the kDataFrameMQF

Returns:The number of kmers in the kDataFrameMQF
Return type:integer
max_size()

Maximum number of kmers that the kDataFrameMQF can hold.

Returns:The maximum number of kmers that the kDataFrameMQF can hold.
Return type:integer
empty()

Check whether the kDataFrameMQF is empty of kmers or not.

Returns:Boolean value indicating whether the kDataFrameMQF is empty, i.e. whether its size is 0
Return type:boolean
load_factor()

Retrieving the current load factor of the kDataFrameMQF in percentage to indicate how full is it.

Returns:The current load factor in the kDataFrameMQF.
Return type:integer
max_load_factor()

Retrieving the maximum load factor of the kDataFrameMQF in percentage.

Returns:The maximum load factor in the kDataFrameMQF.
Return type:integer
begin()

Instantiate a kDataFrameIterator object pointing to the first kmer position :return: An iterator at the begin of the kDataFrameMQF. :rtype: kProcessor.kDataFrameIterator

end()

Instantiate a kDataFrameIterator object pointing to the last kmer position

Returns:An iterator at the end of the kDataFrameMQF.
Return type:kProcessor.kDataFrameIterator
save()

Serialize the kDataFrameMQF on the disk in a form of binary file alongside other metadata files.

static load(filePath)

A static method to load a kDataFrameMQF file from disk.

Note

Load the file without the extension [.mqf, .map, .phmap]

Parameters:filePath – The serialized kDataFrameMQF binary file without the extension
Returns:the loaded kDataFrameMQF from disk
Return type:kProcessor.kDataFrameMQF
Example:
>>> import kProcessor as kp
>>> # File path : "path/to/file.mqf"
>>> KF = kp.kDataFrameMQF.load("path/to/file")
kSize()

Get the kmer size of the kDataFrameMQF

Returns:kmer size
Return type:integer

kDataFrameMAP: subclass derived from kDataFrame

class kDataFrameMAP(kSize)

The abstract base class defining a kDataFrameMAP.

Parameters:kSize (integer) – Kmer Size
Returns:kProcessor.kDataFrameMAP

Note

Read more about the usage of kDataFrameMAP in the FAQ page.

reserve(n)

Request a capacity change so that the kDataFrameMAP can approximately hold at least n kmers

Parameters:n – Minimum number of kmers
insert(kmer, N=1)

Insert the kmer N time in the kDataFrameMAP, or increment the kmer count with N if it is already exists.

Parameters:
  • kmer (string) – The Kmer to increment its count
  • N (integer) – Kmer count (Optional, Default = 1)
Returns:

Boolean value indicating whether the kmer is inserted or not

Return type:

bool

setCount(kmer, N)

Set the kmer’s count to N time in the kDataFrameMAP

Parameters:
  • kmer (string) – The Kmer to set its count
  • N (integer) – Kmer count
Returns:

Boolean value indicating whether the kmer is inserted or not

Return type:

bool

getCount(kmer)

Retrieve number of times the kmer was inserted in the kDataFrameMAP

Parameters:kmer (string) – The kmer to retrieve its count
Returns:The count of the kmer in the kDataFrameMAP
Return type:integer
erase(kmer)

Removes a kmer from the kDataFrameMAP

Parameters:kmer (string) – The kmer to be erased
Returns:Boolean value indicating whether the kmer is erased or not
Return type:bool
size()

Number of kmers in the kDataFrameMAP

Returns:The number of kmers in the kDataFrameMAP
Return type:integer
max_size()

Maximum number of kmers that the kDataFrameMAP can hold.

Returns:The maximum number of kmers that the kDataFrameMAP can hold.
Return type:integer
empty()

Check whether the kDataFrameMAP is empty of kmers or not.

Returns:Boolean value indicating whether the kDataFrameMAP is empty, i.e. whether its size is 0
Return type:boolean
load_factor()

Retrieving the current load factor of the kDataFrameMAP in percentage to indicate how full is it.

Returns:The current load factor in the kDataFrameMAP.
Return type:integer
max_load_factor()

Retrieving the maximum load factor of the kDataFrameMAP in percentage.

Returns:The maximum load factor in the kDataFrameMAP.
Return type:integer
begin()

Instantiate a kDataFrameIterator object pointing to the first kmer position :return: An iterator at the begin of the kDataFrameMAP. :rtype: kProcessor.kDataFrameIterator

end()

Instantiate a kDataFrameIterator object pointing to the last kmer position

Returns:An iterator at the end of the kDataFrameMAP.
Return type:kProcessor.kDataFrameIterator
save()

Serialize the kDataFrameMAP on the disk in a form of binary file alongside other metadata files.

static load(filePath)

A static method to load a kDataFrameMAP file from disk.

Note

Load the file without the extension [.mqf, .map, .phmap]

Parameters:filePath – The serialized kDataFrameMAP binary file without the extension
Returns:the loaded kDataFrameMAP from disk
Return type:kProcessor.kDataFrameMAP
Example:
>>> import kProcessor as kp
>>> # File path : "path/to/file.mqf"
>>> KF = kp.kDataFrameMAP.load("path/to/file")
kSize()

Get the kmer size of the kDataFrameMAP

Returns:kmer size
Return type:integer

kDataFramePHMAP: subclass derived from kDataFrame

class kDataFramePHMAP(kSize)

The abstract base class defining a kDataFramePHMAP.

Instantiate a kDataFramePHMAP object with predefined kmer size.

Parameters:
  • kSize (integer) – Kmer Size
  • mode (integer) – Hashing mode for the kDataFramePHMAP, default = 1
Returns:

kProcessor.kDataFramePHMAP

Instantiation Example:
>>> import kProcessor as kp
>>> KF_PHMAP_1 = kp.kDataFramePHMAP(31) # kSize = 31
>>> KF_PHMAP_2 = kp.kDataFramePHMAP(PROTEIN, protein_hasher, {'kSize': 5}); # Reading/hashing mode = protein, kSize = 5
>>> KF_PHMAP_3 = kp.kDataFramePHMAP(PROTEIN, proteinDayhoff_hasher, {'kSize': 11}); # Reading mode = protein, hashing mode = dayhoff encoding, kSize = 11

Note

Read more about reading and hashing modes in the FAQ page.

getTwin()

creates a new kDataFramePHMAP using the same parameters as the current kDataFramePHMAP.

Returns:A shallow copy of the current kDataFramePHMAP.
Return type:kDataFramePHMAP
reserve(n)

Request a capacity change so that the kDataFramePHMAP can approximately hold at least n kmers

Parameters:n – Minimum number of kmers
insert(kmer, N=1)

Insert the kmer N time in the kDataFramePHMAP, or increment the kmer count with N if it is already exists.

Parameters:
  • kmer (string) – The Kmer to increment its count
  • N (integer) – Kmer count (Optional, Default = 1)
Returns:

Boolean value indicating whether the kmer is inserted or not

Return type:

bool

setCount(kmer, N)

Set the kmer’s count to N time in the kDataFramePHMAP

Parameters:
  • kmer (string) – The Kmer to set its count
  • N (integer) – Kmer count
Returns:

Boolean value indicating whether the kmer is inserted or not

Return type:

bool

getCount(kmer)

Retrieve number of times the kmer was inserted in the kDataFramePHMAP

Parameters:kmer (string) – The kmer to retrieve its count
Returns:The count of the kmer in the kDataFramePHMAP
Return type:integer
erase(kmer)

Removes a kmer from the kDataFramePHMAP

Parameters:kmer (string) – The kmer to be erased
Returns:Boolean value indicating whether the kmer is erased or not
Return type:bool
size()

Number of kmers in the kDataFramePHMAP

Returns:The number of kmers in the kDataFramePHMAP
Return type:integer
max_size()

Maximum number of kmers that the kDataFramePHMAP can hold.

Returns:The maximum number of kmers that the kDataFramePHMAP can hold.
Return type:integer
empty()

Check whether the kDataFramePHMAP is empty of kmers or not.

Returns:Boolean value indicating whether the kDataFramePHMAP is empty, i.e. whether its size is 0
Return type:boolean
load_factor()

Retrieving the current load factor of the kDataFramePHMAP in percentage to indicate how full is it.

Returns:The current load factor in the kDataFramePHMAP.
Return type:integer
max_load_factor()

Retrieving the maximum load factor of the kDataFramePHMAP in percentage.

Returns:The maximum load factor in the kDataFramePHMAP.
Return type:integer
begin()

Instantiate a kDataFrameIterator object pointing to the first kmer position :return: An iterator at the begin of the kDataFramePHMAP. :rtype: kProcessor.kDataFrameIterator

end()

Instantiate a kDataFrameIterator object pointing to the last kmer position

Returns:An iterator at the end of the kDataFramePHMAP.
Return type:kProcessor.kDataFrameIterator
save()

Serialize the kDataFramePHMAP on the disk in a form of binary file alongside other metadata files.

static load(filePath)

A static method to load a kDataFramePHMAP file from disk.

Note

Load the file without the extension [.mqf, .map, .phmap]

Parameters:filePath – The serialized kDataFramePHMAP binary file without the extension
Returns:the loaded kDataFramePHMAP from disk
Return type:kProcessor.kDataFramePHMAP
Example:
>>> import kProcessor as kp
>>> # File path : "path/to/file.mqf"
>>> KF = kp.kDataFramePHMAP.load("path/to/file")
kSize()

Get the kmer size of the kDataFramePHMAP

Returns:kmer size
Return type:integer

colored_kDataFrame: colored kDataFrame that holds the source sequence of each Kmer

class colored_kDataFrame

colored_kDataFrame class

Note

the colored_kDataFrame Inherits all the functions from kProcessor.kDataFrame plus other new functions.

Introduction:
  • The colored_kDataFrame class holds the Kmers colors instead of their count.
  • The color is an integer represents the targets which contains that kmer.
Example:

color: 1: represents the transcripts transcript_A , transcript_B and transcript_C color: 2: represents the transcripts transcript_A , transcript_B

kmer: ACTGATCGATCGTACGAC has the color 2, that means it’s found in both transcript_A and transcript_B kmer: ATAAGCATTTACAGCAAT has the color 1, that means it’s found in both transcript_A , transcript_B and transcript_C

getColor(kmer)

Get the color of the kmer

Parameters:kmer (str) – Kmer string
Returns:The color of the kmer
Return type:int
getKmerSource(kmer)

Get all sample IDs that contains that kmer.

Parameters:kmer (str) – Kmer string
Returns:List of all samples IDs associated with that kmer.
Return type:list
getKmerSourceFromColor(color)

Get all sample IDs that contains that kmer.

Parameters:color (int) – Kmer color
Returns:List of all samples IDs associated with that color.
Return type:list
names_map()

Get the names map dictionary that represents sample ID as key and its group name as value.

Returns:names map dictionary.
Return type:dict
inverse_names_map()

Get the names map dictionary that represents group name as key and its sample ID as value.

Returns:inverse names map dictionary.
Return type:dict
static load(prefix)

Load colored_kDataFrame file from disk.

Parameters:prefix (string) – file path
Returns:Colored kDataFrame that has been serialized on disk.
Return type:kProcessor.colored_kDataFrame
get_kDataFrame()

Get the kDataFrame object that holds the kmers alongside their colors.

Returns:the embedded kDataFrame inside the colored_kDataFrame.
Return type:kProcessor.kDataFrame

Set Functions: Function like intersection & union

kFrameUnion(input)

Calculate the union of the kDataFrames. The result kDataframe will have all the kmers in the input list of kDataframes. The count of the kmers equals to the sum of the kmer count in the input list.

Warning

This function works only with kProcessor.kDataFrameMQF.

Parameters:input (list of kProcessor.kDataFrameMQF) – List of kDataFrames
Returns:New kDataFrame object holding the union of kmers in the kDataFrames list.
Return type:kProcessor.kDataFrame
kFrameIntersect(input)

Calculate the intersect of the kDataFrames. The result kDataframe will have only kmers that exists in all the kDataframes. The count of the kmers equals to the minimum of the kmer count in the input list.

Warning

This function works only with kProcessor.kDataFrameMQF.

Parameters:input (list of kProcessor.kDataFrameMQF) – List of kDataFrames
Returns:New kDataFrame object holding the intersection of kmers in the kDataFrames list.
Return type:kDataFrame
kFrameDiff(input)

Calculate the difference of the kDataframes. The result kDataframe will have only kmers that exists in the first kDataframe and not in any of the rest input kDataframes. The count of the kmers equals to the count in the first kDataframe.

Warning

This function works only with kProcessor.kDataFrameMQF.

Parameters:input (list of kProcessor.kDataFrameMQF) – List of kDataFrames
Returns:New kDataFrame object holding the difference of kmers in the kDataFrames list.
Return type:kDataFrame

Kmer input using files or strings

index(kframe, filename, chunk_size, names_fileName)

Perform indexing to a sequences file with predefined kmers decoding mode.

Parameters:
  • kframe (kProcessor.kDataFrame) – the kDataFrame to be filled with the kmers with their colors
  • filename (str) – Sequence(s) file path
  • chunk_size (int) – Number of sequences to parse at once.
  • names_fileName – The TSV names file that contains target sequences headers corresponding to their groups.
Returns:

colored kDataFrame with the decoded kmers and their colors.

Return type:

kProcessor.colored_kDataFrame

Example 1:
>>> import kProcessor as kp
>>> KF = kp.kDataFrameMQF(31)
>>> ckf = kp.index(KF, "seq.fa", 1000, "seq.names")
Example 2:
>>> import kProcessor as kp
>>> KF_prot = kp.kDataFramePHMAP(PROTEIN, protein_hasher, {'kSize': 5})
>>> ckf_prot = kp.index(KF, "seq.fa", 1000, "seq.names")
countKmersFromFile(kframe, filename, chunk_size)

Load the kmers with their counts in the input file into the output kDataframe. Input File can be of formats: fastq,fasta.

Note

The kDataFrame of this function are passed-by-reference. So, it returns nothing.

Parameters:
  • kframe (kProcessor.kDataFrame) – the kDataFrame to be filled with the kmers with their counts
  • filename (str) – Sequence(s) file path
  • chunk_size – Number of sequences to parse at once.
Example:
>>> import kProcessor as kp
>>> KF = kp.kDataFramePHMAP(11)
>>> kp.parseSequencesFromFile(KF, "seq.fa", 1000) # Fill the KF with the kmers and counts
countKmersFromString(seq, kFrame)

Load the kmers in the input string into the output kDataframe.

Note

The kDataFrame of this function are passed-by-reference. So, it returns nothing.

Parameters:
  • kFrame (kProcessor.kDataFrame) – the kDataFrame to be filled with the kmers with their counts
  • sequence (string) – Sequence to be parsed
Example:
>>> import kProcessor as kp
>>> KF = kDataFramePHMAP(11)
>>> seq = "ACGATCGATCGATTATATATATCGACGATCGATCGTACGTAGC"
>>> kp.parseSequencesFromString(seq, KF) # Fill the KF with the kmers and counts