An Example

In this chapter, we will load a data set from NCBI’s Gene Expression Omnibus (GEO) website (http://www.ncbi.nlm.nih.gov/geo/). GEO is a public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomic data submitted by the scientific community. It is one of many resources available through the National Center for Biotechnology Information (NCBI), an organization that is part of the National Library of Medicine, and, in turn, part of the U.S. National Institutes of Health (NIH). This is a very useful resource when learning to use Bioconductor, because you can find not only raw data but also references to papers that analyzed that data.

As an example, we’ll use the data files from GSE2034 (http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE2034), a study that looked for predictors of relapse-free breast cancer survival. (I used data from the same study as an example in Survival Models.) My goal was not to re-create the results shown in the original papers (which I did not do), but instead to show how Bioconductor tools could be used to load and inspect this data.

Loading Raw Expression Data

Let’s start with an example of loading raw data into R. We’ll show how to load Affymetrix CEL files, which are output from Affymetrix’s scanner software. If you would like to try this yourself, you can download the raw CEL files from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE2034/GSE2034_RAW.tar ...

Get R in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.