You are previewing Data Intensive Distributed Computing.
O'Reilly logo
Data Intensive Distributed Computing

Book Description

The trend in scientific, as well as commercial, applications from a diverse range of fields has been towards being more and more data-intensive over time. Data Intensive Distributed Computing: Challenges and Solutions for Large-scale Information Management focuses on the challenges of distributed systems imposed by data intensive applications and on the different state-of-the-art solutions proposed to overcome such challenges. Providing hints on how to manage low-level data handling issues when performing data intensive distributed computing, this publication is ideal for scientists, researchers, engineers, and application developers, alike. With the knowledge of the correct data management techniques for their applications, readers will be able to focus on their primary goal, assured that their data management needs are handled reliably and efficiently.

Table of Contents

  1. Cover
  2. Title Page
  3. Copyright Page
  4. Preface
    1. MOTIVATION and ORGANIZATION of the Book
  5. Section 1: New Paradigms in Data Intensive Computing
    1. Chapter 1: Data-Aware Distributed Computing
      1. ABSTRACT
      2. INTRODUCTION
      3. BACKGROUND
      4. DATA SCHEDULING
      5. INTEGRATION WITH WORKFLOW PLANNING
      6. THROUGHPUT OPTIMIZATION
      7. Parallel TCP Stream Optimization
      8. Buffer Size Tuning vs Parallel Streams
      9. CONCLUSION
    2. Chapter 2: Towards Data Intensive Many-Task Computing
      1. Abstract
      2. Introduction
      3. Data Diffusion Architecture
      4. Theoretical Evaluation
      5. Micro-Benchmarks
      6. Synthetic Workloads
      7. Large-scale Astronomy Application Performance Evaluation
      8. Related Work
      9. Conclusion
    3. Chapter 3: Micro-Services
      1. ABSTRACT
      2. INTRODUCTION
      3. Micro-Service Oriented Architecture
      4. iRODS: integrated Rule-oriented Data Systems
      5. Policy Enforcement
      6. Data Management Applications
      7. Current Status and Conclusion
  6. Section 2: Distributed Storage
    1. Chapter 4: Distributed Storage Systems for Data Intensive Computing
      1. ABSTRACT
      2. Data Intensive Computing Challenges
      3. Demands and Requirements for Distributed Storage Systems in Data Intensive Science
      4. Case Studies in Distributed Storage Systems
      5. SUMMARY
    2. Chapter 5: Metadata Management in PetaShare Distributed Storage Network
      1. ABSTRACT
      2. Introduction
      3. System Overview
      4. CLIENT TOOLS
      5. Cross-Domain Metadata management in PetaShare
      6. MetaDATA REPLICATION
      7. CONCLUSION
    3. Chapter 6: Data Intensive Computing with Clustered Chirp Servers
      1. ABSTRACT
      2. Introduction
      3. The Chirp File ServeR
      4. Case Study: The GRAND Data LAB
      5. Case Study: The Biometrics Research Grid
      6. Case Study: The Biocompute Web Portal
      7. Conclusion
  7. Section 3: Data & Workflow Management
    1. Chapter 7: A Survey of Scheduling and Management Techniques for Data-Intensive Application Workflows
      1. ABSTRACT
      2. INTRODUCTION
      3. RELATED WORK
      4. TERMS and DEFINITIONS
      5. ABSTRACT MODEL OF A WORKFLOW MANAGEMENT SYSTEM
      6. SURVEY
      7. FUTURE DIRECTIONS
      8. CONCLUSION
    2. Chapter 8: Data Management in Scientific Workflows
      1. ABSTRACT
      2. Introduction
      3. Workflow Creation
      4. Workflow Planning and Execution
      5. Derived Data and Provenance
      6. Conclusion
    3. Chapter 9: Replica Management in Data Intensive Distributed Science Applications
      1. ABSTRACT
      2. Introduction
      3. Systems for Cataloguing and Discovery of Replicas
      4. Custom Replica Management Systems for Large Scientific Collaborations
      5. Policy-Driven Data Replication
      6. Related Work
      7. Conclusion
  8. Section 4: Data Discovery & Visualization
    1. Chapter 10: Data Intensive Computing for Bioinformatics
      1. ABSTRACT
      2. Introduction
      3. Innovations in Algorithms for Data Intensive Computing
      4. Innovations in programming models using cloud technologies
      5. Iterative MapReduce with TWISTER
      6. Conclusion
    2. Chapter 11: Visualization of Large-Scale Distributed Data
      1. ABSTRACT
      2. Introduction
      3. The Large-Scale Data Visualization Pipeline
      4. Data Management for Supporting Distributed Visualization
      5. Data Rendering for Supporting Distributed Visualization
      6. Advanced Displays for Supporting Distributed Visualization
      7. Conclusion
    3. Chapter 12: On-Demand Visualization on Scalable Shared Infrastructure
      1. ABSTRACT
      2. Introduction
      3. Background
      4. An On-Demand System
      5. The Scheduler
      6. Results
      7. Conclusion and Future Work
  9. Compilation of References
  10. About the Contributors
  11. Index