You are previewing Big Data.
O'Reilly logo
Big Data

Book Description

As today’s organizations are capturing exponentially larger amounts of data than ever, now is the time for organizations to rethink how they digest that data. Through advanced algorithms and analytics techniques, organizations can harness this data, discover hidden patterns, and use the newly acquired knowledge to achieve competitive advantages.

Presenting the contributions of leading experts in their respective fields, Big Data: Algorithms, Analytics, and Applications bridges the gap between the vastness of Big Data and the appropriate computational methods for scientific and social discovery. It covers fundamental issues about Big Data, including efficient algorithmic methods to process data, better analytical strategies to digest data, and representative applications in diverse fields, such as medicine, science, and engineering. The book is organized into five main sections:

  1. Big Data Management—considers the research issues related to the management of Big Data, including indexing and scalability aspects
  2. Big Data Processing—addresses the problem of processing Big Data across a wide range of resource-intensive computational settings
  3. Big Data Stream Techniques and Algorithms—explores research issues regarding the management and mining of Big Data in streaming environments
  4. Big Data Privacy—focuses on models, techniques, and algorithms for preserving Big Data privacy
  5. Big Data Applications—illustrates practical applications of Big Data across several domains, including finance, multimedia tools, biometrics, and satellite Big Data processing

Overall, the book reports on state-of-the-art studies and achievements in algorithms, analytics, and applications of Big Data. It provides readers with the basis for further efforts in this challenging scientific field that will play a leading role in next-generation database, data warehousing, data mining, and cloud computing research. It also explores related applications in diverse sectors, covering technologies for media/data communication, elastic media/data storage, cross-network media/data fusion, and SaaS.

Table of Contents

  1. Foreword by Jack Dongarra
  2. Foreword by Dr. Yi Pan
  3. Foreword by D. Frank Hsu
  4. Preface
  5. Editors
  6. Contributors
  7. Section I - Big Data Management
    1. Chapter 1 - Scalable Indexing for Big Data Processing
      1. Abstract
      2. 1.1 Introduction
      3. 1.2 Permutation-Based Indexing
        1. 1.2.1 Indexing Model
        2. 1.2.2 Technical Implementation
      4. 1.3 Related Data Structures
        1. 1.3.1 Metric Inverted Files
        2. 1.3.2 Brief Permutation Index
        3. 1.3.3 Prefix Permutation Index
        4. 1.3.4 Neighborhood Approximation
        5. 1.3.5 Metric Suffix Array
        6. 1.3.6 Metric Permutation Table
      5. 1.4 Distributed Indexing
        1. 1.4.1 Data Based
          1. 1.4.1.1 Indexing
          2. 1.4.1.2 Searching
        2. 1.4.2 Reference Based
          1. 1.4.2.1 Indexing
          2. 1.4.2.2 Searching
        3. 1.4.3 Index Based
          1. 1.4.3.1 Indexing
          2. 1.4.3.2 Searching
      6. 1.5 Evaluation
        1. 1.5.1 Recall and Position Error
        2. 1.5.2 Indexing and Searching Performance
        3. 1.5.3 Big Data Indexing and Searching
      7. 1.6 Conclusion
      8. Acknowledgment
      9. References
    2. Chapter 2 - Scalability and Cost Evaluation of Incremental Data Processing Using Amazon’s Hadoop Service
      1. Abstract
      2. 2.1 Introduction
      3. 2.2 Introduction of MapReduce and Apache Hadoop
      4. 2.3 A Motivating Application: Movie Ratings from Netflix Prize
      5. 2.4 Implementation in Hadoop
      6. 2.5 Deployment Architecture
      7. 2.6 Scalability and Cost Evaluation
      8. 2.7 Discussions
      9. 2.8 Related Work
      10. 2.9 Conclusion
      11. Acknowledgment
      12. References
      13. Appendix 2.A: Source Code of Mappers and Reducers
    3. Chapter 3 - Singular Value Decomposition, Clustering, and Indexing for Similarity Search for Large Data Sets in High-Dimensional Spaces
      1. Abstract
      2. 3.1 Introduction
      3. 3.2 Data Reduction Methods and SVD
      4. 3.3 Clustering Methods
        1. 3.3.1 Partitioning Methods
        2. 3.3.2 Hierarchical Clustering
        3. 3.3.3 Density-Based Methods
        4. 3.3.4 Grid-Based Methods
        5. 3.3.5 Subspace Clustering Methods
      5. 3.4 Steps in Building an Index for k-NN Queries
      6. 3.5 Nearest Neighbors Queries in High-Dimensional Space
      7. 3.6 Alternate Method Combining SVD and Clustering
      8. 3.7 Survey of High-Dimensional Indices
      9. 3.8 Conclusions
      10. Acknowledgments
      11. References
      12. Appendix 3.A: Computing Approximate Distances with Dimensionality-Reduced Data
    4. Chapter 4 - Multiple Sequence Alignment and Clustering with Dot Matrices, Entropy, and Genetic Algorithms
      1. Abstract
      2. 4.1 Introduction
      3. 4.2 CDM
      4. 4.3 PEA
      5. 4.4 Divide and Conquer
      6. 4.5 GAs
      7. 4.6 DCGA
      8. 4.7 K-Means
      9. 4.8 Clustering Genetic Algorithm with the SSE Criterion
      10. 4.9 MapReduce Section
      11. 4.10 Simulation
      12. 4.11 Conclusion
      13. References
  8. Section II - Big Data Processing
    1. Chapter 5 - Approaches for High-Performance Big Data Processing: Applications and Challenges
      1. Abstract
      2. 5.1 Introduction
      3. 5.2 Big Data Definition and Concepts
      4. 5.3 Cloud Computing for Big Data Analysis
        1. 5.3.1 Data Analytics Tools as SaaS
        2. 5.3.2 Computing as IaaS
      5. 5.4 Challenges and Current Research Directions
      6. 5.5 Conclusions and Perspectives
      7. References
    2. Chapter 6 - The Art of Scheduling for Big Data Science
      1. Abstract
      2. 6.1 Introduction
      3. 6.2 Requirements for Scheduling in Big Data Platforms
      4. 6.3 Scheduling Models and Algorithms
      5. 6.4 Data Transfer Scheduling
      6. 6.5 Scheduling Policies
      7. 6.6 Optimization Techniques for Scheduling
      8. 6.7 Case Study on Hadoop and Big Data Applications
      9. 6.8 Conclusions
      10. References
    3. Chapter 7 - Time–Space Scheduling in the MapReduce Framework
      1. Abstract
      2. 7.1 INTRODUCTION
      3. 7.2 OVERVIEW OF Big Data PROCESSING ARCHITECTURE
      4. 7.3 SELF-ADAPTIVE REDUCE TASK SCHEDULING
        1. 7.3.1 Problem Analysis
        2. 7.3.2 Runtime Analysis of MapReduce Jobs
        3. 7.3.3 A Method of Reduce Task Start-Time Scheduling
      5. 7.4 REDUCE PLACEMENT
        1. 7.4.1 Optimal Algorithms for Cross-Rack Communication Optimization
        2. 7.4.2 Locality-Aware Reduce Task Scheduling
        3. 7.4.3 MapReduce Network Traffic Reduction
        4. 7.4.4 The Source of MapReduce Skews
        5. 7.4.5 Reduce Placement in Hadoop
      6. 7.5 NER IN BIOMEDICAL Big Data MINING: A CASE STUDY
        1. 7.5.1 Biomedical Big Data
        2. 7.5.2 Biomedical Text Mining and NER
        3. 7.5.3 MapReduce for CRFs
      7. 7.6 CONCLUDING REMARKS
      8. References
    4. Chapter 8 - GEMS: Graph Database Engine for Multithreaded Systems
      1. Abstract
      2. 8.1 INTRODUCTION
      3. 8.2 RELATED INFRASTRUCTURES
      4. 8.3 GEMS OVERVIEW
      5. 8.4 GMT ARCHITECTURE
        1. 8.4.1 GMT: Aggregation
        2. 8.4.2 GMT: Fine-Grained Multithreading
      6. 8.5 EXPERIMENTAL RESULTS
        1. 8.5.1 Synthetic Benchmarks
        2. 8.5.2 BSBM
        3. 8.5.3 RDESC
      7. 8.6 CONCLUSIONS
      8. References
    5. Chapter 9 - KSC-net: Community Detection for Big Data Networks
      1. Abstract
      2. 9.1 INTRODUCTION
      3. 9.2 KSC fOR Big Data NETWORKS
        1. 9.2.1 Notations
        2. 9.2.2 FURS Selection
        3. 9.2.3 KSC Framework
          1. 9.2.3.1 Training Model
          2. 9.2.3.2 Model Selection
          3. 9.2.3.3 Out-of-Sample Extension
        4. 9.2.4 Practical Issues
      4. 9.3 KSC-net SOFTWARE
        1. 9.3.1 KSC Demo on Synthetic Network
        2. 9.3.2 KSC Subfunctions
        3. 9.3.3 KSC Demo on Real-Life Network
      5. 9.4 CONCLUSION
      6. Acknowledgments
      7. References
    6. Chapter 10 - Making Big Data Transparent to the Software Developers’ Community
      1. Abstract
      2. 10.1 Introduction
      3. 10.2 Software Developers’ Information Needs
        1. 10.2.1 Information Needs: Core Work Practice
        2. 10.2.2 Information Needs: Constructing and Maintaining Relationships
        3. 10.2.3 Information Needs: Professional/Career Development
      4. 10.3 Software Developers’ Ecosystem
        1. 10.3.1 Social Media Use
        2. 10.3.2 The Ecosystem
      5. 10.4 Information Overload and Awareness Issue
      6. 10.5 The Application of Big Data to Support the Software Developers’ Community
        1. 10.5.1 Data Generated from Core Practices
        2. 10.5.2 Software Analytics
      7. 10.6 Conclusion
      8. References
  9. Section III - Big Data Stream Techniques and Algorithms
    1. Chapter 11 - Key Technologies for Big Data Stream Computing
      1. Abstract
      2. 11.1 INTRODUCTION
        1. 11.1.1 Stream Computing
        2. 11.1.2 Application Background
        3. 11.1.3 Chapter Organization
      3. 11.2 OVERVIEW OF A BDSC SYSTEM
        1. 11.2.1 Directed Acyclic Graph and Stream Computing
        2. 11.2.2 System Architecture for Stream Computing
        3. 11.2.3 Key Technologies for BDSC Systems
          1. 11.2.3.1 System Structure
          2. 11.2.3.2 Data Stream Transmission
          3. 11.2.3.3 Application Interfaces
          4. 11.2.3.4 High Availability
      4. 11.3 EXAMPLE BDSC SYSTEMS
        1. 11.3.1 Twitter Storm
          1. 11.3.1.1 Task Topology
          2. 11.3.1.2 Fault Tolerance
          3. 11.3.1.3 Reliability
          4. 11.3.1.4 Storm Cluster
        2. 11.3.2 Yahoo! S4
          1. 11.3.2.1 Processing Element
          2. 11.3.2.2 Processing Nodes
          3. 11.3.2.3 Fail-Over, Checkpointing, and Recovery Mechanism
          4. 11.3.2.4 System Architecture
        3. 11.3.3 Microsoft TimeStream and Naiad
          1. 11.3.3.1 TimeStream
          2. 11.3.3.2 Naiad
      5. 11.4 FUTURE PERSPECTIVE
        1. 11.4.1 Grand Challenges
          1. 11.4.1.1 High Scalability
          2. 11.4.1.2 High Fault Tolerance
          3. 11.4.1.3 High Consistency
          4. 11.4.1.4 High Load Balancing
          5. 11.4.1.5 High Throughput
        2. 11.4.2 On-the-Fly Work
      6. Acknowledgments
      7. References
    2. Chapter 12 - Streaming Algorithms for Big Data Processing on Multicore Architecture
      1. Abstract
      2. 12.1 Introduction
      3. 12.2 An Unconventional Big Data Processor
        1. 12.2.1 Terminology
        2. 12.2.2 Overview of Hadoop
        3. 12.2.3 Hadoop Alternative: Big Data Replay
      4. 12.3 Putting the Pieces Together
        1. 12.3.1 More on the Scope of the Problem
        2. 12.3.2 Overview of Literature
      5. 12.4 The Data Streaming Problem
        1. 12.4.1 Data Streaming Terminology
        2. 12.4.2 Related Information Theory and Formulations
        3. 12.4.3 Practical Applications and Designs
      6. 12.5 Practical Hashing and Bloom Filters
        1. 12.5.1 Bloom Filters: Store, Lookup, and Efficiency
        2. 12.5.2 Unconventional Bloom Filter Designs for Data Streams
        3. 12.5.3 Practical Data Streaming Targets
      7. 12.6 Big Data Streaming Optimization
        1. 12.6.1 A Simple Model of a Data Streaming Process
        2. 12.6.2 Streaming on Multicore
        3. 12.6.3 Performance Metrics
        4. 12.6.4 Example Analysis
      8. 12.7 Big Data Streaming on Multicore Technology
        1. 12.7.1 Parallel Processing Basics
        2. 12.7.2 DLL
        3. 12.7.3 Lock-Free Parallelization
        4. 12.7.4 Software APIs
      9. 12.8 Summary
      10. References
    3. Chapter 13 - Organic Streams: A Unified Framework for Personal Big Data Integration and Organization Towards Social Sharing and Individualized Sustainable Use
      1. Abstract
      2. 13.1 Introduction
      3. 13.2 Overview of Related Work
      4. 13.3 Organic Stream: Definitions and Organizations
        1. 13.3.1 Metaphors and Graph Model
        2. 13.3.2 Definition of Organic Stream
        3. 13.3.3 Organization of Social Streams
      5. 13.4 Experimental Result and Analysis
        1. 13.4.1 Functional Modules
        2. 13.4.2 Experiment Analysis
      6. 13.5 Summary
      7. References
    4. Chapter 14 - Managing Big Trajectory Data: Online Processing of Positional Streams
      1. Abstract
      2. 14.1 Introduction
      3. 14.2 Trajectory Representation and Management
      4. 14.3 Online Trajectory Compression with Spatiotemporal Criteria
      5. 14.4 Amnesic Multiresolution Trajectory Synopses
      6. 14.5 Continuous Range Search over Uncertain Locations
      7. 14.6 Multiplexing of Evolving Trajectories
      8. 14.7 Toward Next-Generation Management of Big Trajectory Data
      9. References
  10. Section IV - Big Data Privacy
    1. Chapter 15 - Personal Data Protection Aspects of Big Data
      1. Abstract
      2. 15.1 Introduction
        1. 15.1.1 Topic and Aim
        2. 15.1.2 Note to the Reader, Structure, and Arguments
      3. 15.2 Data Protection Aspects
        1. 15.2.1 Big Data and Analytics in Four Steps
        2. 15.2.2 Personal Data
          1. 15.2.2.1 Profiling Activities on Personal Data
          2. 15.2.2.2 Pseudonymization
          3. 15.2.2.3 Anonymous Data
          4. 15.2.2.4 Reidentification
        3. 15.2.3 Purpose Limitation
      4. 15.3 Conclusions and Recommendations
      5. References
    2. Chapter 16 - Privacy-Preserving Big Data Management: The Case of OLAP
      1. Abstract
      2. 16.1 Introduction
        1. 16.1.1 Problem Definition
        2. 16.1.2 Chapter Organization
      3. 16.2 Literature Overview and Survey
        1. 16.2.1 Privacy-Preserving OLAP in Centralized Environments
        2. 16.2.2 Privacy-Preserving OLAP in Distributed Environments
      4. 16.3 Fundamental Definitions and Formal Tools
      5. 16.4 Dealing with Overlapping Query Workloads
      6. 16.5 Metrics for Modeling and Measuring Accuracy
      7. 16.6 Metrics for Modeling and Measuring Privacy
      8. 16.7 Accuracy and Privacy Thresholds
      9. 16.8 Accuracy Grids and Multiresolution Accuracy Grids: Conceptual Tools for Handling Accuracy and Privacy
      10. 16.9 An Effective and Efficient Algorithm for Computing Synopsis Data Cubes
        1. 16.9.1 Allocation Phase
        2. 16.9.2 Sampling Phase
        3. 16.9.3 Refinement Phase
        4. 16.9.4 The computeSynDataCube Algorithm
      11. 16.10 Experimental Assessment and Analysis
      12. 16.11 Conclusions and Future Work
      13. References
  11. Section V - Big Data Applications
    1. Chapter 17 - Big Data in Finance
      1. Background
      2. 17.1 Introduction
      3. 17.2 Financial Domain Dynamics
        1. 17.2.1 Historical Landscape versus Emerging Trends
      4. 17.3 Financial Capital Market Domain: In-Depth View
        1. 17.3.1 Big Data Origins
        2. 17.3.2 Information Flow
        3. 17.3.3 Data Analytics
      5. 17.4 Emerging Big Data Landscape in Finance
        1. 17.4.1 Challenges
        2. 17.4.2 New Models of Computation and Novel Architectures
      6. 17.5 Impact on Financial Research and Emerging Research Landscape
        1. 17.5.1 Background
        2. 17.5.2 UHFD (Big Data)–Driven Research
        3. 17.5.3 UHFD (Big Data) Implications
        4. 17.5.4 UHFD (Big Data) Challenges
      7. 17.6 Summary
      8. References
    2. Chapter 18 - Semantic-Based Heterogeneous Multimedia Big Data Retrieval
      1. Abstract
      2. 18.1 Introduction
      3. 18.2 Related Work
      4. 18.3 Proposed Framework
        1. 18.3.1 Overview
        2. 18.3.2 Semantic Annotation
        3. 18.3.3 Optimization and User Feedback
        4. 18.3.4 Semantic Representation
        5. 18.3.5 NoSQL-Based Semantic Storage
        6. 18.3.6 Heterogeneous Multimedia Retrieval
      5. 18.4 Performance Evaluation
        1. 18.4.1 Running Environment and Software Tools
        2. 18.4.2 Performance Evaluation Model
        3. 18.4.3 Precision Ratio Evaluation
        4. 18.4.4 Time and Storage Cost
      6. 18.5 Discussions and Conclusions
      7. Acknowledgments
      8. References
    3. Chapter 19 - Topic Modeling for Large-Scale Multimedia Analysis and Retrieval
      1. Abstract
      2. 19.1 Introduction
      3. 19.2 Large-Scale Computing Frameworks
      4. 19.3 Probabilistic Topic Modeling
      5. 19.4 Couplings among Topic Models, Cloud Computing, and Multimedia Analysis
        1. 19.4.1 Large-Scale Topic Modeling
        2. 19.4.2 Topic Modeling for Multimedia
        3. 19.4.3 Large-Scale Computing in Multimedia
      6. 19.5 Large-Scale Topic Modeling for Multimedia Retrieval and Analysis
      7. 19.6 Conclusions and Future Directions
      8. References
    4. Chapter 20 - Big Data Biometrics Processing: A Case Study of an Iris Matching Algorithm on Intel Xeon Phi
      1. Abstract
      2. 20.1 Introduction
      3. 20.2 Background
        1. 20.2.1 Intel Xeon Phi
        2. 20.2.2 Iris Matching Algorithm
        3. 20.2.3 OpenMP
        4. 20.2.4 Intel VTune Amplifier
      4. 20.3 Experiments
        1. 20.3.1 Experiment Setup
        2. 20.3.2 Workload Characteristics
        3. 20.3.3 Impact of Different Affinity
        4. 20.3.4 Optimal Number of Threads
        5. 20.3.5 Vectorization
      5. 20.4 Conclusions
      6. Acknowledgments
      7. References
    5. Chapter 21 - Storing, Managing, and Analyzing Big Satellite Data: Experiences and Lessons Learned from a Real-World Application
      1. 21.1 Introduction
      2. 21.2 The Landsat Program
      3. 21.3 New Challenges and Solutions
        1. 21.3.1 The Conventional Satellite Imagery Distribution System
        2. 21.3.2 The New Satellite Data Distribution Policy
        3. 21.3.3 Impact on the Data Process Work Flow
        4. 21.3.4 Impact on the System Architecture, Hardware, and Software
        5. 21.3.5 Impact on the Characteristics of Users and Their Behaviors
        6. 21.3.6 The New System Architecture
      4. 21.4 Using Big Data Analytics to Improve Performance and Reduce Operation Cost
        1. 21.4.1 Vis-EROS: Big Data Visualization
        2. 21.4.2 FastStor: Data Mining-Based Multilayer Prefetching
      5. 21.5 Conclusions: Experiences and Lessons Learned
      6. Acknowledgments
      7. References
    6. Chapter 22 - Barriers to the Adoption of Big Data Applications in the Social Sector
      1. 22.1 Introduction
      2. 22.2 The Potential of Big Data: Benefits to the Social Sector—From Business to Social Enterprise to NGO
      3. 22.3 How NGOs can Leverage Big Data to Achieve Their Missions
      4. 22.4 Historical Limitations and Considerations
      5. 22.5 The Gap in Understanding within the Social Sector
      6. 22.6 Next Steps: How to Bridge the Gap
      7. 22.7 Conclusion
      8. REFERENCES