11 Managing data projects

Deploying a successful data collection project requires more than knowledge of web technologies. The focus of this chapter is on R and operation system functionality that will be required for setting up and maintaining large-scale, automated data collection projects. Additionally, we discuss good practices to organize and write code that adds robustness and traceability in case of errors. In Section 11.1, we start by providing an overview of R functions for interacting with the local file system. In Section 11.2, we show methods for iterative code execution for downloading pages or extracting relevant information from multiple web documents. Section 11.3 provides a template for organizing extraction code and making it more robust to failed specification. We conclude the chapter with an overview of system tools that can executive R scripts automatically, which is a key requirement for building datasets from regularly updated Internet resources (Section 11.4).

11.1 Interacting with the file system

One type of R function that appears frequently in data projects is dedicated to working with files and folders on the local file system. Over the course of a data project, we are continuously interacting with the file system of our operating system. Web documents are stored locally, loaded into R, processed, and saved again after the post-processing or analysis. The file system has an important role in the data collection and analysis workflow and a firm ...

Get Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.