5. Recipe 5: Kmers Indexing

5.1. Description

  1. Index fasta file using index() function and save the coloredKDataFrame in ckf1
  2. Load the namesMap as a python dictionary and print it
  3. Load the inverted namesMap as a python dictionary and print it
  4. Query by kmer to get its color
  5. Save the kDataFrame on disk

5.2. Implementation

5.2.1. Importing

[1]:
import kProcessor as kp

5.2.2. Create kDataFrame Object

[2]:
KF = kp.kDataFrameMQF(31)

5.2.3. Indexing

[3]:
# kp.index(kDataFrame, mode, params, file_path, chunk size, namesfile)
ckf1 = kp.index(KF, "kmers", {"k_size" : 31}, "data/min_test_sample.fa", 1000, "data/min_test_sample.fa.names")

5.2.4. Getting the names map and its inverse

[4]:
namesMap = ckf1.names_map()
inverse_namesMap = ckf1.inverse_names_map()
[5]:
namesMap
[5]:
{14: 'ENST00000616125.4',
 23: 'ENST00000379409.6',
 5: 'ENST00000420190.6',
 19: 'ENST00000338591.7',
 28: 'ENST00000428771.6',
 10: 'ENST00000622503.4',
 1: 'ENST00000641515.2',
 24: 'ENST00000379407.7',
 15: 'ENST00000620200.4',
 6: 'ENST00000437963.5',
 20: 'ENST00000622660.1',
 29: 'ENST00000304952.10',
 11: 'ENST00000618323.4',
 2: 'ENST00000335137.4',
 25: 'ENST00000491024.1',
 22: 'ENST00000379410.7',
 9: 'ENST00000618181.4',
 26: 'ENST00000341290.6',
 4: 'ENST00000332831.4',
 21: 'ENST00000466300.1',
 18: 'ENST00000327044.6',
 13: 'ENST00000618779.4',
 30: 'ENST00000484667.2',
 27: 'ENST00000433179.3',
 31: 'ENST00000624697.3',
 8: 'ENST00000617307.4',
 17: 'ENST00000455979.1',
 12: 'ENST00000616016.4',
 3: 'ENST00000426406.3',
 7: 'ENST00000342066.7',
 16: 'ENST00000341065.8'}
[6]:
# Inverse of the namesMap dictionary
inverse_namesMap
[6]:
{'ENST00000624697.3': 31,
 'ENST00000379407.7': 24,
 'ENST00000622503.4': 10,
 'ENST00000622660.1': 20,
 'ENST00000616125.4': 14,
 'ENST00000428771.6': 28,
 'ENST00000491024.1': 25,
 'ENST00000338591.7': 19,
 'ENST00000618779.4': 13,
 'ENST00000433179.3': 27,
 'ENST00000455979.1': 17,
 'ENST00000341290.6': 26,
 'ENST00000616016.4': 12,
 'ENST00000342066.7': 7,
 'ENST00000379410.7': 22,
 'ENST00000426406.3': 3,
 'ENST00000379409.6': 23,
 'ENST00000484667.2': 30,
 'ENST00000620200.4': 15,
 'ENST00000341065.8': 16,
 'ENST00000617307.4': 8,
 'ENST00000641515.2': 1,
 'ENST00000332831.4': 4,
 'ENST00000466300.1': 21,
 'ENST00000437963.5': 6,
 'ENST00000304952.10': 29,
 'ENST00000618181.4': 9,
 'ENST00000327044.6': 18,
 'ENST00000618323.4': 11,
 'ENST00000335137.4': 2,
 'ENST00000420190.6': 5}

5.3. Get a kmer color from the coloredKDataFrame

[7]:
# kmer from the dataset to perform the query
kmer = "TCGAAGCTGGAGAAGGCGGACATCCTGGAGA"

# Get the color
kmer_color = ckf1.getColor(kmer)

print(kmer_color)
61

5.4. Get all the transcripts IDs associated to the previous color

[8]:
transcript_ids = ckf1.getKmerSourceFromColor(61)

5.4.1. Get all transcript names of that color

[9]:
for tr_id in transcript_ids:
    print(namesMap[tr_id])
ENST00000428771.6
ENST00000304952.10
ENST00000484667.2

5.4.2. Save the colored kDataFrame to disk with name “ckf1”

[10]:
ckf1.save("ckf1")

5.4.3. Load the colored kDataFrame again from disk.

Unlike the other kDataFrames, the colored_kDataFrame is a structure that contains a kDataFrame alongside the colors information of the kmers. Loading a colored_kDataFrame is done by using the static function load(file_path) from the colored_kDataFrame class.

[11]:
loaded_ckf = kp.colored_kDataFrame.load("ckf1")

we can use the colored_kDataFrame to query kmers and finding their associated colors and source sequences. On the other hand, the kDataFrame that’s encapsulated under the colored_kDataFrame is used to directly iterate over the kmers and their colors.

5.4.4. Get the embedded kDataFrame in the colored_kDataFrame loaded_ckf

[12]:
loaded_kf = ckf1.getkDataFrame()

5.4.6. Create a kDataFrame Iterator

[14]:
it = loaded_kf.begin()

5.4.7. Iterate and print the first 5 kmers with their colors

[15]:
for i in range(5):

    # Get the kmer
    kmer = it.getKmer()

    # Get the color (stored as count)
    color = it.getCount()

    # Verify by querying the kmer on the colored kDataFrame


    print("Kmer  : %s" % kmer)
    print("Color : %d" % color)

    print("-------------------------------")

    it.next() # Extremely Important!
Kmer  : ACTTCCCAGCCCGCTTCCCGTCCCACCCTCG
Color : 64
-------------------------------
Kmer  : CCTCTCCGTCCGAGTCTTTGGGGGGCTCGTC
Color : 45
-------------------------------
Kmer  : GTACTCGGCCGGCGGCTATGACGGGGCCTCC
Color : 40
-------------------------------
Kmer  : AGGGCACCCTCCAGCACGGCCACGCCCGCTG
Color : 40
-------------------------------
Kmer  : CTGCAGCCGCCGCCAGAGGGTTTCCTTCGGC
Color : 18
-------------------------------

5.5. Complete Script

import kProcessor as kp

KF = kp.kDataFrameMQF(31)

# kp.index(kDataFrame, mode, params, file_path, chunk size, namesfile)
ckf1 = kp.index(KF, "kmers", {"k_size" : 31}, "data/min_test_sample.fa", 1000, "data/min_test_sample.fa.names")

namesMap = ckf1.names_map()

inverse_namesMap = ckf1.inverse_names_map()

print(f"namesMap: {namesMap}")

# Inverse of the namesMap dictionary
print(f"inverse_namesMap: {inverse_namesMap}")


# kmer from the dataset to perform the query
kmer = "TCGAAGCTGGAGAAGGCGGACATCCTGGAGA"

# Get the color
kmer_color = ckf1.getColor(kmer)

print(kmer_color)

for tr_id in transcript_ids:
    print(namesMap[tr_id])


ckf1.save("ckf1")

loaded_ckf = kp.colored_kDataFrame.load("ckf1")
loaded_kf = ckf1.getkDataFrame()

it = loaded_kf.begin()

for i in range(5):

    # Get the kmer
    kmer = it.getKmer()

    # Get the color (stored as count)
    color = it.getCount()

    # Verify by querying the kmer on the colored kDataFrame


    print("Kmer  : %s" % kmer)
    print("Color : %d" % color)

    print("-------------------------------")

    it.next() # Extremely Important!