5. Recipe 5: Kmers Indexing¶
5.1. Description¶
- Index fasta file using index() function and save the coloredKDataFrame in ckf1
- Load the namesMap as a python dictionary and print it
- Load the inverted namesMap as a python dictionary and print it
- Query by kmer to get its color
- Save the kDataFrame on disk
5.2. Implementation¶
5.2.1. Importing¶
[1]:
import kProcessor as kp
5.2.2. Create kDataFrame Object¶
[2]:
KF = kp.kDataFrameMQF(31)
5.2.3. Indexing¶
[3]:
# kp.index(kDataFrame, mode, params, file_path, chunk size, namesfile)
ckf1 = kp.index(KF, "kmers", {"k_size" : 31}, "data/min_test_sample.fa", 1000, "data/min_test_sample.fa.names")
5.2.4. Getting the names map and its inverse¶
[4]:
namesMap = ckf1.names_map()
inverse_namesMap = ckf1.inverse_names_map()
[5]:
namesMap
[5]:
{14: 'ENST00000616125.4',
23: 'ENST00000379409.6',
5: 'ENST00000420190.6',
19: 'ENST00000338591.7',
28: 'ENST00000428771.6',
10: 'ENST00000622503.4',
1: 'ENST00000641515.2',
24: 'ENST00000379407.7',
15: 'ENST00000620200.4',
6: 'ENST00000437963.5',
20: 'ENST00000622660.1',
29: 'ENST00000304952.10',
11: 'ENST00000618323.4',
2: 'ENST00000335137.4',
25: 'ENST00000491024.1',
22: 'ENST00000379410.7',
9: 'ENST00000618181.4',
26: 'ENST00000341290.6',
4: 'ENST00000332831.4',
21: 'ENST00000466300.1',
18: 'ENST00000327044.6',
13: 'ENST00000618779.4',
30: 'ENST00000484667.2',
27: 'ENST00000433179.3',
31: 'ENST00000624697.3',
8: 'ENST00000617307.4',
17: 'ENST00000455979.1',
12: 'ENST00000616016.4',
3: 'ENST00000426406.3',
7: 'ENST00000342066.7',
16: 'ENST00000341065.8'}
[6]:
# Inverse of the namesMap dictionary
inverse_namesMap
[6]:
{'ENST00000624697.3': 31,
'ENST00000379407.7': 24,
'ENST00000622503.4': 10,
'ENST00000622660.1': 20,
'ENST00000616125.4': 14,
'ENST00000428771.6': 28,
'ENST00000491024.1': 25,
'ENST00000338591.7': 19,
'ENST00000618779.4': 13,
'ENST00000433179.3': 27,
'ENST00000455979.1': 17,
'ENST00000341290.6': 26,
'ENST00000616016.4': 12,
'ENST00000342066.7': 7,
'ENST00000379410.7': 22,
'ENST00000426406.3': 3,
'ENST00000379409.6': 23,
'ENST00000484667.2': 30,
'ENST00000620200.4': 15,
'ENST00000341065.8': 16,
'ENST00000617307.4': 8,
'ENST00000641515.2': 1,
'ENST00000332831.4': 4,
'ENST00000466300.1': 21,
'ENST00000437963.5': 6,
'ENST00000304952.10': 29,
'ENST00000618181.4': 9,
'ENST00000327044.6': 18,
'ENST00000618323.4': 11,
'ENST00000335137.4': 2,
'ENST00000420190.6': 5}
5.3. Get a kmer color from the coloredKDataFrame¶
[7]:
# kmer from the dataset to perform the query
kmer = "TCGAAGCTGGAGAAGGCGGACATCCTGGAGA"
# Get the color
kmer_color = ckf1.getColor(kmer)
print(kmer_color)
61
5.4. Get all the transcripts IDs associated to the previous color¶
[8]:
transcript_ids = ckf1.getKmerSourceFromColor(61)
5.4.1. Get all transcript names of that color¶
[9]:
for tr_id in transcript_ids:
print(namesMap[tr_id])
ENST00000428771.6
ENST00000304952.10
ENST00000484667.2
5.4.3. Load the colored kDataFrame again from disk.¶
Unlike the other kDataFrames, the colored_kDataFrame is a structure that contains a kDataFrame alongside the colors information of the kmers. Loading a colored_kDataFrame is done by using the static function load(file_path) from the colored_kDataFrame class.
[11]:
loaded_ckf = kp.colored_kDataFrame.load("ckf1")
we can use the colored_kDataFrame to query kmers and finding their associated colors and source sequences. On the other hand, the kDataFrame that’s encapsulated under the colored_kDataFrame is used to directly iterate over the kmers and their colors.
5.4.4. Get the embedded kDataFrame in the colored_kDataFrame loaded_ckf¶
[12]:
loaded_kf = ckf1.getkDataFrame()
5.4.5. Print the types of loaded_kf & loaded_ckf¶
[13]:
print(type(loaded_ckf))
print(type(loaded_kf))
<class 'kProcessor.colored_kDataFrame'>
<class 'kProcessor.kDataFrame'>
5.4.6. Create a kDataFrame Iterator¶
[14]:
it = loaded_kf.begin()
5.4.7. Iterate and print the first 5 kmers with their colors¶
[15]:
for i in range(5):
# Get the kmer
kmer = it.getKmer()
# Get the color (stored as count)
color = it.getCount()
# Verify by querying the kmer on the colored kDataFrame
print("Kmer : %s" % kmer)
print("Color : %d" % color)
print("-------------------------------")
it.next() # Extremely Important!
Kmer : ACTTCCCAGCCCGCTTCCCGTCCCACCCTCG
Color : 64
-------------------------------
Kmer : CCTCTCCGTCCGAGTCTTTGGGGGGCTCGTC
Color : 45
-------------------------------
Kmer : GTACTCGGCCGGCGGCTATGACGGGGCCTCC
Color : 40
-------------------------------
Kmer : AGGGCACCCTCCAGCACGGCCACGCCCGCTG
Color : 40
-------------------------------
Kmer : CTGCAGCCGCCGCCAGAGGGTTTCCTTCGGC
Color : 18
-------------------------------
5.5. Complete Script¶
import kProcessor as kp
KF = kp.kDataFrameMQF(31)
# kp.index(kDataFrame, mode, params, file_path, chunk size, namesfile)
ckf1 = kp.index(KF, "kmers", {"k_size" : 31}, "data/min_test_sample.fa", 1000, "data/min_test_sample.fa.names")
namesMap = ckf1.names_map()
inverse_namesMap = ckf1.inverse_names_map()
print(f"namesMap: {namesMap}")
# Inverse of the namesMap dictionary
print(f"inverse_namesMap: {inverse_namesMap}")
# kmer from the dataset to perform the query
kmer = "TCGAAGCTGGAGAAGGCGGACATCCTGGAGA"
# Get the color
kmer_color = ckf1.getColor(kmer)
print(kmer_color)
for tr_id in transcript_ids:
print(namesMap[tr_id])
ckf1.save("ckf1")
loaded_ckf = kp.colored_kDataFrame.load("ckf1")
loaded_kf = ckf1.getkDataFrame()
it = loaded_kf.begin()
for i in range(5):
# Get the kmer
kmer = it.getKmer()
# Get the color (stored as count)
color = it.getCount()
# Verify by querying the kmer on the colored kDataFrame
print("Kmer : %s" % kmer)
print("Color : %d" % color)
print("-------------------------------")
it.next() # Extremely Important!