Chapter 13. Out-of-Memory Approaches: Tabix and SQLite

In this chapter, we’ll look at out-of-memory approaches—computational strategies built around storing and working with data kept out of memory on the disk. Reading data from a disk is much, much slower than working with data in memory (see “The Almighty Unix Pipe: Speed and Beauty in One”), but in many cases this is the approach we have to take when in-memory (e.g., loading the entire dataset into R) or streaming approaches (e.g., using Unix pipes, as we did in Chapter 7) aren’t appropriate. Specifically, we’ll look at two tools to work with data out of memory: Tabix and SQLite databases.

Fast Access to Indexed Tab-Delimited Files with BGZF and Tabix

BGZF and Tabix solve a really important problem in genomics: we often need fast read-only random access to data linked to a genomic location or range. For the scale of data we encounter in genomics, retrieving this type of data is not trivial for a few reasons. First, the data may not fit entirely in memory, requiring an approach where data is kept out of memory (in other words, on a slow disk). Second, even powerful relational database systems can be sluggish when querying out millions of entries that overlap a specific region—an incredibly common operation in genomics. The tools we’ll see in this section are specially designed to get around these limitations, allowing fast random-access of tab-delimited genome position data.

In chapter on alignment, we saw ...

Get Bioinformatics Data Skills now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.