You are previewing Advances in Statistical Bioinformatics.
O'Reilly logo
Advances in Statistical Bioinformatics

Book Description

Providing genome-informed personalized treatment is a goal of modern medicine. Identifying new translational targets in nucleic acid characterizations is an important step toward that goal. The information tsunami produced by such genome-scale investigations is stimulating parallel developments in statistical methodology and inference, analytical frameworks, and computational tools. Within the context of genomic medicine and with a strong focus on cancer research, this book describes the integration of high-throughput bioinformatics data from multiple platforms to inform our understanding of the functional consequences of genomic alterations. This includes rigorous and scalable methods for simultaneously handling diverse data types such as gene expression array, miRNA, copy number, methylation, and next-generation sequencing data. This material is written for statisticians who are interested in modeling and analyzing high-throughput data. Chapters by experts in the field offer a thorough introduction to the biological and technical principles behind multiplatform high-throughput experimentation.

Table of Contents

  1. Cover
  2. Half Title
  3. Title
  4. Copyright
  5. Table of Contents
  6. List of Contributors
  7. Preface
  8. 1 An Introduction to Next-Generation Biological Platforms
    1. 1.1 Introduction
    2. 1.2 The Biology of Gene Silencing
      1. 1.2.1 DNA Methylation
      2. 1.2.2 RNA Interference
    3. 1.3 High-Throughput Profiling
      1. 1.3.1 Molecular Inversion Probe Arrays
      2. 1.3.2 Array Comparative Genomic Hybridization (aCGH)
      3. 1.3.3 Genome-Wide Association Studies
      4. 1.3.4 Reverse-Phase Protein Array
    4. 1.4 Next-Generation Sequencing
      1. 1.4.1 Whole-Genome and Whole-Exome Sequencing
      2. 1.4.2 ChIP-Seq
      3. 1.4.3 RNA-Seq
      4. 1.4.4 BS-seq
    5. 1.5 NGS Data Management and Analysis
    6. 1.6 Platform Integration
    7. Acknowledgments
    8. References
  9. 2 An Introduction to The Cancer Genome Atlas
    1. 2.1 Introduction
    2. 2.2 History and Goals of the TCGA Project
    3. 2.3 Sample Collection and Processing
      1. 2.3.1 Step 1: Tissue Collection
      2. 2.3.2 Step 2: Quality Control and DNA/RNA Extraction
      3. 2.3.3 Step 3: Molecular Profiling and Sequencing
      4. 2.3.4 Step 4: Data Collection and Public Distribution
      5. 2.3.5 Step 5: Data Analysis
    4. 2.4 Data Processing, Storage, and Access
      1. 2.4.1 TCGA Barcodes and UUIDs
      2. 2.4.2 The Data Coordinating Center
      3. 2.4.3 Data Access Matrix
      4. 2.4.4 Bulk Download
      5. 2.4.5 HTTP
      6. 2.4.6 CGHub
      7. 2.4.7 Sample and Data Relationship Format (SDRF) and Investigation Description Format (IDF) Files
      8. 2.4.8 File Format
      9. 2.4.9 Version
    5. 2.5 Tools for Visualizing and Analyzing TCGA Data
      1. 2.5.1 cBio Cancer Genomics Portal
      2. 2.5.2 MBatch Portal
      3. 2.5.3 Next-Generation Clustered Heat Maps
      4. 2.5.4 Regulome Explorer
      5. 2.5.5 Integrative Genome Viewer
      6. 2.5.6 Cancer Genomics Browser
    6. 2.6 Summary
    7. Acknowledgments
    8. References
  10. 3 DNA Variant Calling in Targeted Sequencing Data
    1. 3.1 Introduction
    2. 3.2 Background
      1. 3.2.1 Single-Nucleotide Variation
      2. 3.2.2 Long Padlock Probes
      3. 3.2.3 Array-Based Resequencing
    3. 3.3 Sequence Robust Multiarray Analysis
      1. 3.3.1 Quality Control
      2. 3.3.2 Variant Calling
    4. 3.4 Application of SRMA
      1. 3.4.1 Candidate Gene Study for Mitochondrial Diseases
      2. 3.4.2 Validation Results
      3. 3.4.3 Biological Findings
    5. 3.5 Conclusion
    6. Appendix
    7. References
  11. 4 Statistical Analysis of Mapped Reads from mRNA-Seq Data
    1. 4.1 Background
      1. 4.1.1 RNA Biology
      2. 4.1.2 RNA Technology
    2. 4.2 Mapping and Assembly Strategies
      1. 4.2.1 De Novo Assembly of the Transcriptome
      2. 4.2.2 Genome-Guided Assembly of the Transcriptome
      3. 4.2.3 Alignment to a Reference Transcriptome
    3. 4.3 Modeling Expression Levels
      1. 4.3.1 Poisson Model for Expression Quantification
    4. 4.4 Normalization
      1. 4.4.1 RPKM Normalization
      2. 4.4.2 Other Scaling Normalizations
      3. 4.4.3 Adjusted Transcript Lengths
      4. 4.4.4 Sequencing Bias
      5. 4.4.5 Fragment Size Distribution
      6. 4.4.6 Quantile Normalization
      7. 4.4.7 Nonparametric normalization factors
    5. 4.5 Modeling Overdispersion
    6. 4.6 Beyond Poisson and Negative Binomial Families
    7. 4.7 Differential Expression Analysis
      1. 4.7.1 Frequentist Methods
      2. 4.7.2 Bayesian Methods
      3. 4.7.3 Nonparametric Approaches
    8. 4.8 Allelic Imbalance
    9. 4.9 Concluding Remarks
    10. References
  12. 5 Model-Based Methods for Transcript Expression-Level Quantification in RNA-Seq
    1. 5.1 Introduction
    2. 5.2 Bias and Variation in RNA-Seq Experiments
      1. 5.2.1 Experimental Sources
      2. 5.2.2 Biological Sources
      3. 5.2.3 Other Sources
    3. 5.3 Base-Level Reads Count Data
    4. 5.4 Quantification Methods
      1. 5.4.1 Generalized Poisson model and GPseq
      2. 5.4.2 Poisson Mixed Effects Model and POME
      3. 5.4.3 Poisson Mixture Model and PMseq
      4. 5.4.4 Poisson Regression with Sequencing Preference Correction (mseq)
      5. 5.4.5 Cufflinks with Bias Adjustment
      6. 5.4.6 RPKM with GC-Content Correction
    5. 5.5 Comparison Results
    6. 5.6 Discussions
    7. References
  13. 6 Bayesian Model-Based Approaches for Solexa Sequencing Data
    1. 6.1 Introduction
    2. 6.2 A Hierarchical Model for GA-I
    3. 6.3 Analysis and Results for GA-I
    4. 6.4 Application to GA-II
    5. 6.5 Discussion
    6. Acknowledgment
    7. References
  14. 7 Statistical Aspects of ChIP-Seq Analysis
    1. 7.1 Introduction: The Purpose of the ChIP-seq Experiment
    2. 7.2 Aims
    3. 7.3 Experimental Overview
      1. 7.3.1 Control Experiments
      2. 7.3.2 Paired-End Sequencing
      3. 7.3.3 The Data
      4. 7.3.4 Potential Sources of Error and Bias
      5. 7.3.5 Histones/Nucleosomes
      6. 7.3.6 ReChIP
      7. 7.3.7 Other Experiments
      8. 7.3.8 Estimating Fragment Length
    4. 7.4 Peak-Calling in TF Data
      1. 7.4.1 Strategy-Independent Issues
      2. 7.4.2 Count-Based Strategies
      3. 7.4.3 Shape-Based Strategies
    5. 7.5 Peak-calling in Histone Mark Data
    6. 7.6 Validation
      1. 7.6.1 Functional Binding Site Validation
      2. 7.6.2 Binding Site Validation
      3. 7.6.3 Motif Analysis
      4. 7.6.4 Replication
      5. 7.6.5 Technical and Biological Replication
    7. 7.7 Assessing the Reliability of Peak-Callers
    8. 7.8 Differential Count-Based Strategies
      1. 7.8.1 Analysis Protocol
    9. 7.9 The Future of ChIP-seq
      1. 7.9.1 Integrating ChIP-seq with Expression Data
    10. References
  15. 8 Bayesian Modeling of ChIP-Seq Data from Transcription Factor to Nucleosome Positioning
    1. 8.1 Introduction
    2. 8.2 ChIP-seq Analysis
      1. 8.2.1 The PICS Framework
      2. 8.2.2 Other Methods to be Compared
      3. 8.2.3 Application to the FOXA1 Data
    3. 8.3 Nucleosome Positioning
      1. 8.3.1 The PING Framework
      2. 8.3.2 Methods to be Compared
      3. 8.3.3 Application to Experimental Data
    4. 8.4 Bioconductor Pipeline
    5. 8.5 Discussion
    6. 8.6 Acknowledgments
    7. References
  16. 9 Multivariate Linear Models for GWAS
    1. 9.1 Introduction
    2. 9.2 The Polygenic Model
    3. 9.3 Analysis of GWAS
    4. 9.4 Challenges in Multivariate Linear Models for GWAS
    5. 9.5 Lasso Approaches to GWAS
    6. 9.6 Bayesian Approaches to GWAS
    7. 9.7 Conclusion
    8. References
  17. 10 Bayesian Model Averaging for Genetic Association Studies
    1. 10.1 Genetic Association Studies
    2. 10.2 Statistical Analysis for Association Studies
      1. 10.2.1 Bayesian Variable Selection
    3. 10.3 Stochastic Search Variable Selection
      1. 10.3.1 Prior Specification
      2. 10.3.2 MCMC Sampling
      3. 10.3.3 Posterior Inference
      4. 10.3.4 Decision Rules and FDR Control
      5. 10.3.5 SSVS for Genetic Association Studies
    4. 10.4 Application: Folate Metabolism and Lung Cancer
    5. 10.5 Discussion
    6. References
  18. 11 Whole-Genome Multi-SNP-Phenotype Association Analysis
    1. 11.1 Introduction
      1. 11.1.1 Single-SNP Analysis has Difficulties in Assessing Overall Association Signals
      2. 11.1.2 Whole-Genome Multi-SNP Analysis – A Nontechnical Summary
    2. 11.2 Bayesian Variable Selection Regression
      1. 11.2.1 Prior Relating σ2a, π with Heritability
      2. 11.2.2 Computation and Inference
    3. 11.3 Penalized Regression
    4. 11.4 Estimate PVE Without Identifying Causal Variants
    5. 11.5 Binary Phenotypes
      1. 11.5.1 Extension of BVSR to Binary Phenotype
      2. 11.5.2 Machine Learning Approach
    6. 11.6 Discussion
    7. References
  19. 12 Methods for the Analysis of Copy Number Data in Cancer Research
    1. 12.1 Introduction
    2. 12.2 Allele Ratio and the Balance Statistic
    3. 12.3 Modeling Tumor Ploidy and DNA Purity
    4. 12.4 Further Examples
    5. 12.5 Estimating Copy Number
      1. 12.5.1 Early Methods
      2. 12.5.2 OncoSNP
      3. 12.5.3 ASCAT
      4. 12.5.4 PICNIC
      5. 12.5.5 FREEC
      6. 12.5.6 TAPS
      7. 12.5.7 Parent-Specific Copy Number
      8. 12.5.8 CNAnorm
      9. 12.5.9 Tightrope
      10. 12.4.10 ABSOLUTE
    6. 12.6 Summary of Tumor DNA Purity and Ploidy Results
    7. 12.7 Summary
    8. Acknowledgments
    9. References
  20. 13 Bayesian Models for Integrative Genomics
    1. 13.1 Introduction
    2. 13.2 Models That Integrate External Information With Experimental Data
      1. 13.2.1 Linear Models for Pathway and Gene Selection
      2. 13.2.2 Biomarker Selection in Mixture Models
    3. 13.3 Models That Integrate Data From Different Platforms
      1. 13.3.1 Graphical Models to Infer Regulatory Networks
    4. 13.4 Conclusion
    5. Acknowledgments
    6. References
  21. 14 Bayesian Graphical Models for Integrating Multiplatform Genomics Data
    1. 14.1 Introduction
    2. 14.2 Graph-Based Integration of Multiplatform Data
    3. 14.3 Objective Bayesian Model Selection for GGM
    4. 14.4 Application Data Example
      1. 14.4.1 Clinical Characteristics
      2. 14.4.2 microRNA Data Set
      3. 14.4.3 mRNA Data Set
      4. 14.4.4 Analysis Results
    5. 14.5 Discussion
    6. Acknowledgment
    7. References
  22. 15 Genetical Genomics Data: Some Statistical Problems and Solutions
    1. 15.1 Introduction and Review of Current Methods
      1. 15.1.1 Expression Quantitative Trait Loci (eQTL)
      2. 15.1.2 Methods for Identifying Cis- and Trans-eQTLs
    2. 15.2 Differential Co-expression Analysis
      1. 15.2.1 Dynamic Co-expression Analysis
      2. 15.2.2 Gene-set Based Differential Co-expression Analysis
    3. 15.3 Conditional Gaussian Graphical Model
      1. 15.3.1 Estimation Based on e1 Penalization
      2. 15.3.2 Estimation Based on e1-Constrained Minimization
    4. 15.4 Multi-Tissue eQTL Analysis
      1. 15.4.1 A Matrix-Normal Model for Multi-Tissue eQTL Data
      2. 15.4.2 e1 Penalized Estimation
    5. 15.5 eQTL Analysis Using RNA-seq Data
      1. 15.5.1 RNA-seq Data and a Brief Review of Current Methods
      2. 15.5.2 A Poisson-Gamma Hierarchical Model for Differential Expression Analysis
    6. 15.6 Conclusions and Future Directions
    7. References
  23. 16 A Bayesian Framework for Integrating Copy Number and Gene Expression Data
    1. 16.1 Introduction
      1. 16.1.1 Overview
      2. 16.1.2 Biological Background
    2. 16.2 Motivating Examples
    3. 16.3 Probability Model
      1. 16.3.1 Sampling Model for w and y
      2. 16.3.2 Latent Probit Scores and Probit Regression
    4. 16.4 Bayesian Multiplicity Control
    5. 16.5 Simulation Study
    6. 16.6 Posterior Inference on the Breast Cancer Data Set
    7. 16.7 Discussion
    8. Appendix
    9. References
  24. 17 Application of Bayesian Sparse Factor Analysis Models in Bioinformatics
    1. 17.1 Introduction
    2. 17.2 Classical Factor Analysis Model
    3. 17.3 Bayesian Sparse Factor Models
      1. 17.3.1 Prior Specification in Bayesian Sparse Factor Model
      2. 17.3.2 Inferential Procedure
    4. 17.4 Bioinformatics Applications
      1. 17.4.1 Transcription Regulatory Network Inference
      2. 17.4.2 Biological Pathway Analysis
      3. 17.4.3 Genetic Analysis
      4. 17.4.4 Annotation of Spatial Gene Expression Patterns
      5. 17.4.5 Joint Analysis of Genomic and Pharmacological Data
    5. 17.5 Conclusions and Future Perspectives
    6. Acknowledgments
    7. References
  25. 18 Predicting Cancer Subtypes Using Survival-Supervised Latent Dirichlet Allocation Models
    1. 18.1 Background
    2. 18.2 The survLDA model
    3. 18.3 Empirical Results From TCGA Application
    4. 18.4 Prediction in the survLDA Model
    5. 18.5 Evaluation of Prediction in TCGA Cohort
    6. 18.6 Simulation Study to Assess Predictive Performance
    7. 18.7 Discussion
    8. References
  26. 19 Regularization Techniques for Highly Correlated Gene Expression Data with Unknown Group Structure
    1. 19.1 Introduction
    2. 19.2 The Tightest Convex Relaxation of e2-norm Plus Cardinality
    3. 19.3 Methods
    4. 19.4 The Statistical Analysis of Lung Cancer Microarrays and Survival Outcome
    5. 19.5 Simulation Studies
    6. 19.6 Remarks
    7. References
  27. 20 Optimized Cross-Study Analysis of Microarray-Based Predictors
    1. 20.1 Background
    2. 20.2 Methods
    3. 20.2.1 Data Preparation
    4. 20.2.2 Patient Selection
    5. 20.2.3 Microarray Features Annotation
    6. 20.2.4 Regression Models
    7. 20.2.5 Integrative Correlation Method
    8. 20.2.6 Gene Screening Method
    9. 20.3 Results
    10. 20.3.1 Study Synopsis
    11. 20.3.2 Impact of Alternative Cross-referencing Procedures to Agreement among Platforms
    12. 20.3.3 Impact of Data Standardization and Filtering to Agreement among Platforms
    13. 20.3.4 Investigation of the Huang Data Set and Impact of Sample selection on Agreement among Studies
    14. 20.3.5 Intrinsic Genes Signatures Validation
    15. 20.4 Discussion
    16. 20.5 Conclusions
    17. 20.6 Acknowledgments
    18. References
  28. 21 Functional Enrichment Testing: A Survey of Statistical Methods
    1. 21.1 Introduction and Motivation
    2. 21.2 Elements of Enrichment Testing
      1. 21.2.1 Thresholded versus Continuous Measures of Differential Expression
      2. 21.2.2 Comptetitive versus Self-Contained Hypotheses
      3. 21.2.3 Gene-Resampling versus Subject-Resampling
      4. 21.2.4 Basic versus Informed Tests
    3. 21.3 Software
    4. 21.4 Application
    5. 21.5 Discussion
    6. References
  29. 22 Discover Trend and Progression Underlying High-Dimensional Data
    1. 22.1 Introduction
    2. 22.2 Sample Progression Discovery
      1. 22.2.1 Clustering of Features
      2. 22.2.2 Constructing Minimum-Spanning Trees
      3. 22.2.3 Selecting Progression-Associated Features
      4. 22.2.4 Reconstructing the Overall Progression Pattern
    3. 22.3 Progression With Respect to Time
    4. 22.4 Progression With Branchings
    5. 22.5 Progression Analysis Versus Clustering
    6. 22.6 Discussions
    7. References
  30. 23 Bayesian Phylogenetics Adapts to Comprehensive Infectious Disease Sequence Data
    1. 23.1 Introduction
    2. 23.2 Infectious Disease Sequence Databases
    3. 23.3 Modeling Infectious Disease Dynamics
    4. 23.4 Phylogenetic Inference
    5. 23.5 Tools to Analyze Massive Sequence Data Sets
      1. 23.5.1 Modeling Advances
      2. 23.5.2 Computing Advances
    6. 23.6 Discussion
    7. References
  31. Index