Chapter 17. K-mer Counting

A K-mer is a substring of length K (K > 0), and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence data. Counting K-mers for a DNA sequence means finding frequencies of K-mers for the entire sequence. In bioinformatics, K-mer counting is used for genome and transcriptome assembly, metagenomic sequencing, and error correction of sequence reads. Although simple in principle, K-mer counting is a big data challenge, since a single DNA sample can contain several billion DNA sequences. The K-mer counting problem has been defined by the Schatz lab.

This chapter will provide a complete K-mer solution using MapReduce/Hadoop and Spark. Our implementation discovers all K-mers for a given K > 0 and finds the top N K-mers for a given N > 0. The complete MapReduce/Hadoop and Spark solutions are available on GitHub.

A K-mer is a DNA sequence of length K. For example, if a sample DNA sequence is CACACACAGT, then the following are 3-mers and 5-mers for that sequence:

original sequence: CACACACAGT
3-mers:            CAC
                    ACA
                     CAC
                      ACA
                       CAC
                        ACA
                         CAG
                          AGT

original sequence: CACACACAGT
5-mers:            CACAC
                    ACACA
                     CACAC
                      ACACA
                       CACAG
                        ACAGT

Given a DNA sequence (we’ll call it string S) comprising any of {A, C, G, T}, which represent the primary nucleobases, we are interested in counting the number of occurrences in S of every substring of length K.

Input Data for K-mer Counting

For input, we will use FASTQ, a concise and compact format that stores DNA ...

Get Data Algorithms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.