Preparing the dataset for analysis

Our starting point will be a VCF file (or equivalent), with calls made by a genotyper (Genome Analysis Toolkit (GATK) in our case), including the annotations. As we will be filtering NGS data, we need reliable decision criteria to call a site. So, how do we get that information? Generally, we can't, but if we need to do it, there are three basic approaches:

  • Using a more robust sequencing technology for comparison; for example, using Sanger sequencing to verify NGS datasets. This is cost-prohibitive and can only be done for a few loci.
  • Sequencing closely related individuals, for example, two parents and their offspring. In this case, we use Mendelian inheritance rules to devise if a certain call is acceptable ...

Get Bioinformatics with Python Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.