You are previewing Real-World Hadoop.
O'Reilly logo
Real-World Hadoop

Book Description

If you’re a business team leader, CIO, business analyst, or developer interested in how Apache Hadoop and Apache HBase-related technologies can address problems involving large-scale data in cost-effective ways, this book is for you. Using real-world stories and situations, authors Ted Dunning and Ellen Friedman show Hadoop newcomers and seasoned users alike how NoSQL databases and Hadoop can solve a variety of business and research issues.

You’ll learn about early decisions and pre-planning that can make the process easier and more productive. If you’re already using these technologies, you’ll discover ways to gain the full range of benefits possible with Hadoop. While you don’t need a deep technical background to get started, this book does provide expert guidance to help managers, architects, and practitioners succeed with their Hadoop projects.

  • Examine a day in the life of big data: India’s ambitious Aadhaar project
  • Review tools in the Hadoop ecosystem such as Apache’s Spark, Storm, and Drill to learn how they can help you
  • Pick up a collection of technical and strategic tips that have helped others succeed with Hadoop
  • Learn from several prototypical Hadoop use cases, based on how organizations have actually applied the technology
  • Explore real-world stories that reveal how MapR customers combine use cases when putting Hadoop and NoSQL to work, including in production

Ted Dunning is Chief Applications Architect at MapR Technologies, and committer and PMC member of the Apache’s Drill, Storm, Mahout, and ZooKeeper projects. He is also mentor for Apache’s Datafu, Kylin, Zeppelin, Calcite, and Samoa projects.

Ellen Friedman is a solutions consultant, speaker, and author, writing mainly about big data topics. She is a committer for the Apache Mahout project and a contributor to the Apache Drill project.

Table of Contents

  1. Dedication
  2. Preface
    1. How to Use This Book
  3. 1. Turning to Apache Hadoop and NoSQL Solutions
    1. A Day in the Life of a Big Data Project
    2. From Aadhaar to Your Own Big Data Project
    3. What Hadoop and NoSQL Do
    4. When Are Hadoop and NoSQL the Right Choice?
  4. 2. What the Hadoop Ecosystem Offers
    1. Typical Functions
    2. Data Storage and Persistence
    3. Data Ingest
      1. Apache Kafka
      2. Apache Sqoop
      3. Apache Flume
    4. Data Extraction from Hadoop
    5. Processing, Transforming, Querying
      1. Streaming
      2. Micro-batching
      3. Batch Processing
      4. Interactive Query
        1. Impala
        2. Apache Drill
        3. Apache Spark
      5. Search Abuse—Using Search and Indexing for Interactive Query
      6. Visualization Tools
    6. Integration via ODBC and JDBC
  5. 3. Understanding the MapR Distribution for Apache Hadoop
    1. Use of Existing Non-Hadoop Applications
    2. Making Use of a Realtime Distributed File System
    3. Meeting SLAs
    4. Deploying Data at Scale to Remote Locations
    5. Consistent Data Versioning
    6. Finding the Keys to Success
  6. 4. Decisions That Drive Successful Hadoop Projects
    1. Tip #1: Pick One Thing to Do First
    2. Tip #2: Shift Your Thinking
    3. Tip #3: Start Conservatively But Plan to Expand
    4. Tip #4: Be Honest with Yourself
    5. Tip #5: Plan Ahead for Maintenance
    6. Tip #6: Think Big: Don’t Underestimate What You Can (and Will) Want to Do
    7. Tip #7: Explore New Data Formats
    8. Tip #8: Consider Data Placement When You Expand a Cluster
    9. Tip #9: Plot Your Expansion
    10. Tip #10: Form a Queue to the Right, Please
    11. Tip #11: Provide Reliable Primary Persistence When Using Search Tools
    12. Tip #12: Establish Remote Clusters for Disaster Recovery
    13. Tip #13: Take a Complete View of Performance
    14. Tip #14: Read Our Other Books (Really!)
    15. Tip # 15: Just Do It
  7. 5. Prototypical Hadoop Use Cases
    1. Data Warehouse Optimization
    2. Data Hub
    3. Customer 360
    4. Recommendation Engine
    5. Marketing Optimization
    6. Large Object Store
    7. Log Processing
    8. Realtime Analytics
    9. Time Series Database
  8. 6. Customer Stories
    1. Telecoms
    2. What Customers Want
    3. Working with Money
    4. Sensor Data, Predictive Maintenance, and a “Time Machine”
      1. A Time Machine
    5. Manufacturing
      1. Extending Quality Assurance
  9. 7. What’s Next?
  10. A. Additional Resources
    1. Additional Publications
  11. About the Authors
  12. Colophon
  13. Copyright