4. Recipe 4: Itaration on kDataFrame kmers

4.1. Description

  1. Create kDataFrame with kmerSize = 21
  2. Insert some random kmers with random counts
  3. Iterate over kDataFrames kmers and print Kmer and Count
  4. Save the result in dictionary

4.2. Implementation

4.2.1. Importing

[1]:
import kProcessor as kp
import random

4.2.2. Create kmers list with 4 kmers

[2]:
kmers = ["ATCATACTGATCGATCGATGC", "CGTAACCTATGCTAGCTAGAT", "CTGACTACTCAGAGCTAGCCT","CAATCGCTGATACGATACGTA"]

4.2.3. Create an empty kDataFrame

[3]:
kf2 = kp.kDataFrameMQF(21)

4.2.4. Insert all kmers using a for loop

[4]:
for kmer in kmers:
    random_count = random.randint(1,100) # generate random count between 1 and 100
    print("Inserting kmer: %s with count %d" % (kmer, random_count))
    kf2.insert(kmer, random_count)
Inserting kmer: ATCATACTGATCGATCGATGC with count 99
Inserting kmer: CGTAACCTATGCTAGCTAGAT with count 22
Inserting kmer: CTGACTACTCAGAGCTAGCCT with count 35
Inserting kmer: CAATCGCTGATACGATACGTA with count 82

4.2.5. Iterate over all kmers and print their count and save them in a dictionary

[5]:
# Create empty dictionary
kf2_data = {}

# create iterator with the first position in the kDataFrame
it = kf2.begin()

while(it != kf2.end()):

    # Get the kmer string
    kmer = it.getKmer()

    # Get the kmer count
    count = it.getCount()

    # Print the data
    print("retrieved kmer: %s with count: %d" % (kmer, count))

    # Save data in a dictionary
    kf2_data[kmer] = count

    it.next() # Extremely Important!
retrieved kmer: AGGCTAGCTCTGAGTAGTCAG with count: 35
retrieved kmer: ATCTAGCTAGCATAGGTTACG with count: 22
retrieved kmer: ATCATACTGATCGATCGATGC with count: 99
retrieved kmer: CAATCGCTGATACGATACGTA with count: 82

4.2.6. Dump the dictionary data to a file

[6]:
with open("kf2_data.tsv", 'w') as kf2:
    kf2.write("kmer\tcount\n")
    for kmer,count in kf2_data.items():
        kf2.write("%s\t%d\n" % (kmer, count))

4.3. Complete Script

import kProcessor as kp
import random

kmers = ["ATCATACTGATCGATCGATGC", "CGTAACCTATGCTAGCTAGAT", "CTGACTACTCAGAGCTAGCCT","CAATCGCTGATACGATACGTA"]

kf2 = kp.kDataFrameMQF(21)

for kmer in kmers:
    random_count = random.randint(1,100) # generate random count between 1 and 100
    print("Inserting kmer: %s with count %d" % (kmer, random_count))
    kf2.insert(kmer, random_count)

# Create empty dictionary
kf2_data = {}

# create iterator with the first position in the kDataFrame
it = kf2.begin()

while(it != kf2.end()):

    # Get the kmer string
    kmer = it.getKmer()

    # Get the kmer count
    count = it.getCount()

    # Print the data
    print("retrieved kmer: %s with count: %d" % (kmer, count))

    # Save data in a dictionary
    kf2_data[kmer] = count

    it.next() # Extremely Important!


with open("kf2_data.tsv", 'w') as kf2:
    kf2.write("kmer\tcount\n")
    for kmer,count in kf2_data.items():
        kf2.write("%s\t%d\n" % (kmer, count))

4.4. Output CSV

[7]:
%%bash
cat kf2_data.tsv
kmer    count
AGGCTAGCTCTGAGTAGTCAG   35
ATCTAGCTAGCATAGGTTACG   22
ATCATACTGATCGATCGATGC   99
CAATCGCTGATACGATACGTA   82