You are previewing Scalable Big Data Architecture: A Practitioner’s Guide to Choosing Relevant Big Data Architecture.
O'Reilly logo
Scalable Big Data Architecture: A Practitioner’s Guide to Choosing Relevant Big Data Architecture

Book Description

This book highlights the different types of data architecture and illustrates the many possibilities hidden behind the term "Big Data", from the usage of No-SQL databases to the deployment of stream analytics architecture, machine learning, and governance.

Scalable Big Data Architecture covers real-world, concrete industry use cases that leverage complex distributed applications , which involve web applications, RESTful API, and high throughput of large amount of data stored in highly scalable No-SQL data stores such as Couchbase and Elasticsearch. This book demonstrates how data processing can be done at scale from the usage of NoSQL datastores to the combination of Big Data distribution.

When the data processing is too complex and involves different processing topology like long running jobs, stream processing, multiple data sources correlation, and machine learning, it’s often necessary to delegate the load to Hadoop or Spark and use the No-SQL to serve processed data in real time.

This book shows you how to choose a relevant combination of big data technologies available within the Hadoop ecosystem. It focuses on processing long jobs, architecture, stream data patterns, log analysis, and real time analytics. Every pattern is illustrated with practical examples, which use the different open sourceprojects such as Logstash, Spark, Kafka, and so on.

Traditional data infrastructures are built for digesting and rendering data synthesis and analytics from large amount of data. This book helps you to understand why you should consider using machine learning algorithms early on in the project, before being overwhelmed by constraints imposed by dealing with the high throughput of Big data.

Scalable Big Data Architecture is for developers, data architects, and data scientists looking for a better understanding of how to choose the most relevant pattern for a Big Data project and which tools to integrate into that pattern.

Table of Contents

  1. Cover
  2. Title
  3. Copyright
  4. Dedication
  5. Contents at a glance
  6. Contents
  7. About the Author
  8. About the Technical Reviewers
  9. Chapter 1: The Big (Data) Problem
    1. Identifying Big Data Symptoms
      1. Size Matters
      2. Typical Business Use Cases
    2. Understanding the Big Data Project’s Ecosystem
      1. Hadoop Distribution
      2. Data Acquisition
      3. Processing Language
      4. Machine Learning
      5. NoSQL Stores
    3. Creating the Foundation of a Long-Term Big Data Architecture
      1. Architecture Overview
      2. Log Ingestion Application
      3. Learning Application
      4. Processing Engine
      5. Search Engine
    4. Summary
  10. Chapter 2: Early Big Data with NoSQL
    1. NoSQL Landscape
      1. Key/Value
      2. Column
      3. Document
      4. Graph
      5. NoSQL in Our Use Case
    2. Introducing Couchbase
      1. Architecture
      2. Cluster Manager and Administration Console
      3. Managing Documents
    3. Introducing ElasticSearch
      1. Architecture
      2. Monitoring ElasticSearch
      3. Search with ElasticSearch
    4. Using NoSQL as a Cache in a SQL-based Architecture
      1. Caching Document
      2. ElasticSearch Plug-in for Couchbase with Couchbase XDCR
      3. ElasticSearch Only
    5. Summary
  11. Chapter 3: Defining the Processing Topology
    1. First Approach to Data Architecture
      1. A Little Bit of Background
      2. Dealing with the Data Sources
      3. Processing the Data
    2. Splitting the Architecture
      1. Batch Processing
      2. Stream Processing
    3. The Concept of a Lambda Architecture
    4. Summary
  12. Chapter 4: Streaming Data
    1. Streaming Architecture
      1. Architecture Diagram
      2. Technologies
    2. The Anatomy of the Ingested Data
      1. Clickstream Data
      2. The Raw Data
      3. The Log Generator
    3. Setting Up the Streaming Architecture
      1. Shipping the Logs in Apache Kafka
      2. Draining the Logs from Apache Kafka
    4. Summary
  13. Chapter 5: Querying and Analyzing Patterns
    1. Definining an Analytics Strategy
      1. Continuous Processing
      2. Real-Time Querying
    2. Process and Index Data Using Spark
      1. Preparing the Spark Project
      2. Understanding a Basic Spark Application
      3. Implementing the Spark Streamer
      4. Implementing a Spark Indexer
      5. Implementing a Spark Data Processing
    3. Data Analytics with Elasticsearch
      1. Introduction to the aggregation framework
    4. Visualize Data in Kibana
    5. Summary
  14. Chapter 6: Learning From Your Data?
    1. Introduction to Machine Learning
      1. Supervised Learning
      2. Unsupervised Learning
      3. Machine Learning with Spark
      4. Adding Machine Learning to Our Architecture
    2. Adding Machine Learning to Our Architecture
      1. Enriching the Clickstream Data
      2. Labelizing the Data
      3. Training and Making Prediction
    3. Summary
  15. Chapter 7: Governance Considerations
    1. Dockerizing the Architecture
      1. Introducing Docker
      2. Installing Docker
      3. Creating Your Docker Images
      4. Composing the Architecture
    2. Architecture Scalability
      1. Sizing and Scaling the Architecture
      2. Monitoring the Infrastructure Using the Elastic Stack
      3. Considering Security
    3. Summary
  16. Index