Cover image for Bioinformatics Data Skills

Book Description

Though many biologists begin bioinformatics training by learning Perl and R, but there’s a huge gap between solving small problems with messy scripts and analyzing large amounts of biological data. This practical book teaches the data skills you need to turn large sequencing datasets into reproducible and robust biological findings.

Table of Contents

  1. I. Ideology: Data Skills, Robust and Reproducible Bioinformatics
    1. 1. How to Learn Bioinformatics
      1. Why Bioinformatics? Biology’s Growing Data
      2. Learning Data Skills to Learn Bioinformatics
      3. New Challenges for Reproducible and Robust Research
      4. Reproducible Research
      5. Robust Research and the Golden Rule of Bioinformatics
      6. Adopting Robust and Reproducible Practices will Prevent Headaches Too
      7. Recommendations for Robust Research
        1. Pay Attention to Experimental Design
        2. Write Code for Humans, Write Data for Computers
        3. Let Your Computer Do the Work For You
        4. Test Code, or Better Yet, Let Code Test Code
        5. Use Existing Libraries Whenever Possible
        6. Make Assertions and be Loud, in Code and in Your Methods
        7. Treat Data as Read-Only
        8. Spend Time Developing Frequently-Used Scripts into Tools
        9. Let Data Prove It’s High Quality
      8. Recommendations for Reproducible Research
        1. Release Your Code and Data
        2. Document Everything
        3. Make Figures and Statistics the Results of Scripts
        4. Use Code as Documentation
  2. II. Prerequisites: Setting up a Project, Working with Unix, Version Control, and Data
    1. 2. Setting up and Managing a Bioinformatics Project
      1. Project Directories and Directory Structures
      2. Keeping Project Documentation
      3. Use Directories to Divide Up Your Project into Sub-Projects
      4. Organizing Data to Automate Tasks
      5. Keeping a Project Documentation in Markdown Notebook
        1. Markdown Formatting Basics
        2. Using Pandoc to Render HTML
    2. 3. Remedial Unix Shell
      1. Why Do We use Unix in Bioinformatics? Modularity and the Unix Philosophy
      2. Working with Streams and Redirection
        1. Redirecting Standard Out to a File
        2. Redirecting Standard Error
        3. Using Standard Input Redirection
      3. The Almighty Unix Pipe: Speed and Beauty in One
        1. Pipes in Action: Creating Simple Programs with Grep and Pipes
        2. Combining Pipes and Redirection
        3. Even More Redirection: A
      4. Managing and Interacting with Processes
        1. Background Processes
        2. Killing Processes
        3. Exit Status: How to Programmatically Tell Whether Your Command Worked
      5. Command Substitution
    3. 4. Working with Remote Machines
      1. Connecting to Remote Machines with SSH
      2. Quick Authentication with SSH Keys
      3. Maintaining Long-Running Jobs with Nohup and Tmux
        1. Nohup
      4. Working with Remote Machines through Tmux
        1. Installing and Configuring Tmux
        2. Creating, Detaching, and Attaching Tmux Sessions
        3. Working With Tmux Windows
    4. 5. Git for Scientists
      1. Why Git is Necessary in Bioinformatics Projects
        1. Git Allows You to Keep Snapshots of Your Project
        2. Git Helps You Keep Track of Important Changes to Code
        3. Git Helps Keep Software Organized, Even When People Leave
      2. Installing Git
      3. Basic Git: Creating Repositories, Tracking Files, Staging Changes, and Commits
        1. Git Setup: Telling Git Who You Are
        2. and
        3. Tracking Files in Git:
        4. Staging Files in Git:
        5. : Taking a Snapshot of your Project
        6. Seeing file Differences:
        7. Seeing Differences and Your Commit History:
        8. Moving and Removing Files:
        9. Telling Git what to Ignore:
        10. Undoing A Stage:
      4. Collaborating with Git: Git Remotes,
        1. Creating a Shared Central Repository with Github
        2. Authenticating with Git Remotes
        3. Connecting with Git Remotes:
        4. Pushing Commits to a Remote Repository
        5. Pulling Commits from a Remote Repository
        6. Working with your Collaborators: Pushing and Pulling
        7. Merge Conflicts
        8. More Github Workflows: Forking and Pull Requests
      5. Using Git to Make Life Easier: Working with Past Commits
        1. Getting Files from the Past:
        2. Stashing Your Changes:
        3. Comparing Commits and Files: More
        4. Undoing and Editing Commits:
      6. Working with Branches
        1. Creating and Working with Branches:
        2. Merging Branches:
        3. Branches and Remotes
    5. 6. Bioinformatics Data
      1. Retrieving Bioinformatics Data
        1. Wget & Curl
        2. Rsync and Secure Copy (scp)
      2. Data Integrity
        1. SHA and MD5 Checksums
      3. Looking at Differences Between Data
      4. Compressing Data and Working with Compressed Data
        1. Gzip
        2. Working with Gzipped Compressed Files
      5. Case Study: Reproducibly Downloading Data
    6. 7. A Rapid Introduction to the R Language
      1. Why Use R?
      2. The R Language’s Heritage and Design Features
      3. How R Fits into our Bioinformatics Workflow
      4. Working and Developing in R with RStudio
      5. Basic R: Building the Data Foundation
        1. First Steps: Assignment, Environments, and Functions
        2. Second Steps: Control Flow, Looping, and Including Files
        3. An Introduction to R’s Vectors, Indexing, and Vectorization
        4. Vector Types and Coercion
        5. Vector Names
        6. Special Values
        7. Digression: Getting Help in R
        8. Lists
        9. Factors
        10. Data with Dimensions: Arrays and Matrices
        11. Peeking Under the Hood: S3 Classes and Polymorphism
        12. Data Frames
      6. Working with Data in R
        1. Subsetting Vectors and Dataframes
        2. Reordering Vectors
        3. Joining Data: Matching Vectors and Merging Dataframes
        4. Exporting and Importing Data
        5. Writing and Applying Functions to Data
        6. Working with Split-Apply-Combine Pattern
        7. Working with Strings
        8. Exploring Dataframes with
      7. Visualization
        1. Creating a Simple Scatterplot
        2. Histograms and Statistical Summaries
        3. Adding Facets
        4. Smoothing Data
        5. Setting Different Axis Scales
    7. 8. Working with Range Data
      1. A Crash Course in Genomic Ranges and Coordinate Systems
      2. An Interactive Introduction to Range Data with GenomicRanges
        1. Installing and Working with Bioconductor Packages
        2. Storing Generic Ranges with IRanges
        3. Basic Range Operations: Arithmetic, Transformations, and Set Operations
        4. Finding Overlapping Ranges
        5. Finding Nearest Ranges and Calculating Distance
        6. Run Length Encoding and Views
        7. Storing Genomic Ranges with GenomicRanges
        8. Grouping Data with GRangesList
        9. Working with Annotation Data: GenomicFeatures and rtracklayer
        10. Retrieving Promoter Regions: Flank and Promoters
        11. Retrieving Promoter Sequence: Connection GenomicRanges with Sequence Data
        12. Getting Intergenic and Intronic Regions: Gaps, Reduce, and Setdiffs in Practice
        13. Finding and Working with Overlapping Ranges
        14. Calculating Coverage of GRanges Objects
      3. Working with Ranges Data on the Command Line with BEDTools
        1. Computing Overlaps with BEDTools Intersect
        2. BEDTools Slop and Flank
        3. Coverage with BEDTools
        4. Other BEDTools Subcommands and pybedtools
  3. About the Author
  4. Copyright