You are previewing Tera-Tom Genius Series - Hadoop Architecture and SQL.
O'Reilly logo
Tera-Tom Genius Series - Hadoop Architecture and SQL

Book Description

Hadoop is one of the most exciting technologies to ever emerge and is transforming the computer industry. Although Hadoop is designed to process MapReduce queries, it has evolved into accepting SQL, and then converts that SQL to MapReduce. This has opened the door for millions of customers who want to take advantage of their SQL knowledge to query Hadoop systems. Tera-Tom Genius Series - Hadoop Architecture and SQL, written by Tom Coffing and Jason Nolander, details the architecture of Hadoop and the SQL commands available. This book is perfect for anyone who wants to query Hadoop with SQL. It educates readers on how to create tables, how the data is distributed, and how the system processes the data. In addition, it includes hundreds of pages of SQL examples and explanations. The Authors Tera-Tom Coffing, who has written over 75 successful books on Data Warehousing, and Jason Nolander, who has over 20 years of financial industry experience, have written a book that is sure to be your “go to” book on Hadoop.

Table of Contents

  1. Cover
  2. The Tera-Tom Video Series
  3. The Tera-Tom Genius Series
  4. The Best Query Tool Works on all Systems
  5. Trademarks and Copyrights
  6. About Tom Coffing
  7. About Jason Nolander
  8. Contents
  9. Introduction
    1. History of Data Warehousing
    2. The Growth of Computer Data and Use of Databases
    3. Enterprise Data vs. Big Data
    4. Why is Big Data Important?
    5. Why Enhance Your Company’s Data Warehousing Capabilities?
    6. Benefits of Big Data for Your Company
    7. Management Considerations
    8. Customer Considerations
    9. An Industry Example
    10. What does it mean to be data driven?
  10. Chapter 1 – The Concepts of Hadoop
    1. What is Hadoop All About?
    2. There is a Named Node and Up to 4000 Data Nodes
    3. The Named Node's Directory Tree
    4. The Data Nodes
    5. Hive MetaStore
    6. Data Layout and Protection – Step 1
    7. Data Layout and Protection – Step 2
    8. Data Layout and Protection – Step 3
    9. Data Layout and Protection – Step 4
    10. How are Blocks Distributed Amongst the Cluster?
    11. What is Parallel Processing?
    12. The Basics of a Single Computer
    13. Data in Memory is Fast as Lightning
    14. Parallel Processing of Data
    15. Introduction to Hive
    16. Commodity Hardware Servers are Configured for Hadoop
    17. Commodity Hardware Allows Nodes to Scale Forever (Linear)
    18. The Named Node
    19. The Data Node's Responsibilities
    20. All Reducers, Some Reducers or a Single Reducer
    21. A Table has Columns and Rows
    22. Hadoop has Linear Scalability
    23. The Architecture of a Hadoop Data Warehouse
    24. How to Find All Databases in the System?
    25. Setting Your Default Database with the USE Command
    26. List the Tables in a Database with the Show Tables Command
    27. Show Basic Table Information with the Describe Command
    28. Show Detailed Table Information Using Describe Extended
    29. The Show Functions Command Lists all System Functions
    30. Describe Function Command Provides Function Information
    31. Describe Function Extended Command Provides Details
  11. Chapter 2 – The Basics of SQL
    1. Introduction
    2. SELECT * (All Columns) in a Table
    3. SELECT Specific Columns in a Table
    4. Commas in the Front or Back?
    5. Place your Commas in front for better Debugging Capabilities
    6. Sort the Data with the ORDER BY Keyword
    7. ORDER BY Can Use the Column Number
    8. SORT BY Can Be Used Instead of ORDER BY
    9. Changing the ORDER BY to Descending Order
    10. Using the SORT BY in DESC Mode
    11. Major Sort vs. Minor Sorts
    12. SORT BY Using Major and Minor Sorts
    13. SORT BY Defaults to Ascending
    14. Sorts are Alphabetical, NOT Logical
    15. Using A CASE Statement to Sort Logically
    16. How to ALIAS a Column Name
    17. A Missing Comma can by Mistake become an Alias
    18. Comments using Double Dashes Are Single Line Comments
    19. Comments for Multi-Lines
    20. Comments for Multi-Lines as Double Dashes per Line
    21. A Great Technique for Comments to Look for SQL Errors
  12. Chapter 3 – The WHERE Clause
    1. The WHERE Clause Limits Returning Rows
    2. Case Sensitivity is Important
    3. Character Data needs Single Quotes, but Numbers Don’t
    4. You Cannot Use the Alias in the Where Clause
    5. NULL means NO DATA so Equals Null Returns No Rows
    6. Use IS NULL or IS NOT NULL for Null Values
    7. NULL is UNKNOWN DATA so NOT Equal Won’t Work
    8. Use IS NULL or IS NOT NULL when dealing with NULLs
    9. Using Greater Than
    10. Using Greater Than or Equal To (>=)
    11. AND in the WHERE Clause
    12. Troubleshooting AND
    13. OR in the WHERE Clause
    14. Troubleshooting Or
    15. Troubleshooting Character Data
    16. Using Different Columns in an AND Statement
    17. Quiz – How many rows will return?
    18. Answer to Quiz – How many rows will return?
    19. What is the Order of Precedence?
    20. Using Parentheses to change the Order of Precedence
    21. An IN List is Another Technique
    22. Using an IN List in place of OR
    23. The IN List Can Use Character Data
    24. Using a NOT IN List
    25. BETWEEN is Inclusive
    26. NOT BETWEEN is Also Inclusive
    27. LIKE uses Wildcards Percent ‘%’ and Underscore ‘_’
    28. LIKE command Underscore is Wildcard for one Character
    29. Quiz –Who Has the Letter 'n' in the 3rd Position
    30. Answer - Who Has the Letter 'n' in the 3rd Position
    31. LIKE is Case Sensitive
    32. LIKE Command to Find the Last Character of a Last_Name
    33. LIKE Command to Find Multiple Characters
    34. LIKE Command to Find Either Character
    35. Answer – What Data is Left Justified and What is Right?
    36. An Example of Data with Left and Right Justification
    37. A Visual of CHARACTER Data vs. VARCHAR Data
    38. Use the TRIM command to remove spaces on CHAR Data Escape Characters
    39. RTRIM command Removes Trailing spaces on CHAR Data
  13. Chapter 4 – Distinct, Group By, Limit and Sample
    1. The Distinct Command
    2. The Distinct Command
    3. Distinct vs. GROUP BY
    4. Quiz – How many rows come back from the Distinct?
    5. Answer – How many rows come back from the Distinct?
    6. Limit Will Limit the Returning Rows
    7. Limit Works Brilliantly with ORDER BY
    8. TABLESAMPLE
    9. Collect_List and Collect_Set
  14. Chapter 5 – Aggregation
    1. Quiz – You calculate the Answer Set in your own Mind
    2. Answer – You calculate the Answer Set in your own Mind
    3. There are Five Aggregates
    4. Quiz – How many Columns and Rows come back?
    5. Answer – How Many Columns and Rows Come Back?
    6. Quiz – What Happens with This Query?
    7. Answer to Quiz – What Happens With This Query?
    8. GROUP BY when Aggregates and Normal Columns Mix
    9. GROUP BY Delivers one row per Group
    10. Limiting Rows and Improving Performance with WHERE
    11. Keyword HAVING tests Aggregates after they are Totaled
    12. Getting the Average Values Per Column
    13. Average Values Per Column for all Columns in a Table
  15. Chapter 6 – Join Functions
    1. Hadoop Joins
    2. The Shuffle Join (1 of 2)
    3. The Shuffle Join (2 of 2)
    4. Shuffle
    5. Map Strategy
    6. Duplication of the Smaller Table across All-AMPs
    7. Using Buckets for Table Joins
    8. Sort-Merge Bucket Join Needs No Shuffling or Duplication
    9. A Two-Table Join Using Traditional Join Syntax
    10. A Two-Table Join Using ANSI Syntax
    11. Traditional Join Using a Table Alias
    12. ANSI Join Using a Table Alias
    13. ANSI Join Using a Table Alias with Keyword AS
    14. ANSI Join Using the Keyword JOIN Instead of INNER JOIN
    15. You Can Fully Qualify All Columns for Clarity
    16. A Two-Table Join in Action
    17. Quiz – Can You Finish the Join Syntax?
    18. Answer to Quiz – Can You Finish the Join Syntax?
    19. Another Way to Write a Join
    20. A Cartesian Product Join
    21. Quiz – Can You Find the Error?
    22. Answer to Quiz – Can You Find the Error?
    23. Super Quiz – Can You Find the Difficult Error?
    24. Answer to Super Quiz – Can You Find the Difficult Error?
    25. Quiz – Which rows from both tables Won’t Return?
    26. Answer to Quiz – Which rows from both tables Won’t Return?
    27. LEFT OUTER JOIN
    28. LEFT OUTER JOIN Results
    29. A LEFT SEMI JOIN Replaces a Subquery
    30. What is a LEFT SEMI JOIN?
    31. RIGHT OUTER JOIN
    32. RIGHT OUTER JOIN Example and Results
    33. FULL OUTER JOIN
    34. FULL OUTER JOIN Results
    35. Which Tables are the Left and Which are the Right?
    36. Answer - Which Tables are the Left and Which are the Right?
    37. INNER JOIN with an Additional WHERE Clause
    38. INNER JOIN with an Additional AND Clause
    39. OUTER JOIN with Additional WHERE Clause
    40. OUTER JOIN with Additional AND Clause
    41. OUTER JOIN with Additional AND Clause Results
    42. Quiz – Why is this Considered an INNER JOIN?
    43. Evaluation Order for Outer Queries
    44. Cartesian Product Join
    45. The CROSS JOIN with a WHERE Clause
    46. The CROSS JOIN with a WHERE Clause Answer Set
    47. The CROSS JOIN with an ON Clause
    48. The CROSS JOIN with an ON Clause Answer Set
    49. The Self Join
    50. How would you Join these two tables?
    51. An Associative Table is a Bridge that Joins Two Tables
    52. Quiz – Can you Write the 3-Table Join?
    53. Answer – Can you Write the 3-Table Join?
    54. Quiz – Can you Place the ON Clauses at the End?
    55. Answer – Can you Place the ON Clauses at the End?
    56. The 5-Table Join – Logical Insurance Model
    57. Quiz - Write a Five Table Join Using ANSI Syntax
    58. Answer - Write a Five Table Join Using ANSI Syntax
    59. Quiz –Re-Write this putting the ON clauses at the END
    60. Answer –Re-Write this putting the ON clauses at the END
    61. The Nexus Query Chameleon Writes the SQL for Users.
  16. Chapter 7 – Sub-query Functions
    1. An IN List is much like a Subquery
    2. An IN List Never has Duplicates – Just like a Subquery
    3. The Subquery
    4. The Three Steps of How a Basic Subquery Works
    5. These are Equivalent Queries
    6. The Final Answer Set from the Subquery
    7. Quiz- Answer the Difficult Question
    8. Answer to Quiz- Answer the Difficult Question
    9. Should you use a Subquery or a Join?
    10. Quiz- Write the Subquery
    11. Answer to Quiz- Write the Subquery
    12. Quiz- Write the More Difficult Subquery
    13. Answer to Quiz- Write the More Difficult Subquery
    14. Quiz- Write the Subquery with an Aggregate
    15. Answer to Quiz- Write the Subquery with an Aggregate
    16. Quiz- Write the Correlated Subquery
    17. Answer to Quiz- Write the Correlated Subquery
    18. Quiz- Write the NOT Subquery
    19. Answer to Quiz- Write the NOT Subquery
    20. Quiz- Write the Subquery using a WHERE Clause
    21. Answer - Write the Subquery using a WHERE Clause
    22. Quiz – How many rows return on a NOT IN with a NULL?
    23. Answer – How many rows return on a NOT IN with a NULL?
    24. How to handle a NOT IN with potential NULL Values?
    25. Using a Correlated Exists
    26. How a Correlated Exists matches up
    27. The Correlated NOT Exists
  17. Chapter 8 – Date Functions
    1. Current_Date and Current_Timestamp Functions
    2. Extracted the Date from a Time Data Type
    3. Adding and Subtracting Days from a Time Column
    4. Adding Days and Providing a Discount
    5. Getting the Date Extracted From a Time Data Type
    6. Getting the Date Extracted from a Time Using Substring
    7. Getting the Date Extracted from a Time Using Concat
    8. Getting the Date Extracted in Day-Month-Year Format
    9. Getting the Date Extracted in Day-Month-Year Format
    10. The Date in Perfect Day-Month-Year Format with CASE
    11. Getting a Count of All Orders Per Year Per Month
    12. Extracting the Year, Month and Day from a Time Data Type
    13. Extracting the Hour, Minute and Second from Time Data
    14. The ADD_MONTHS Command
    15. The ADD_MONTHS Command to Add Years
    16. Using Cast to Change a Data Type
    17. The Months_Between Command
    18. NEXT_DAY Command Finds a Future Day of the Week
    19. NEXT_DAY Command Finds a Future Day of the Week
    20. Interval Day
    21. Hadoop Calendar Knows Leap Year
    22. Interval Day, Month, Year Plus Cast
  18. Chapter 9 – OLAP Functions
    1. The Row_Number Command
    2. Quiz – How did the Row_Number Reset?
    3. Quiz – How did the Row_Number Reset?
    4. Using a Derived Table and Row_Number
    5. Ordered Analytics OVER
    6. RANK and DENSE RANK
    7. RANK Defaults to Ascending Order
    8. Getting RANK to Sort in DESC Order
    9. RANK OVER and PARTITION BY
    10. PERCENT_RANK OVER
    11. PERCENT_RANK OVER with 14 rows in Calculation
    12. PERCENT_RANK OVER with 21 rows in Calculation
    13. CSUM – Rows Unbounded Preceding Explained
    14. CSUM – Making Sense of the Data
    15. CSUM – Making Even More Sense of the Data
    16. CSUM – The Major and Minor Sort Key(s)
    17. The ANSI CSUM – Getting a Sequential Number
    18. Reset with a PARTITION BY Statement
    19. PARTITION BY only Resets a Single OLAP not ALL of them
    20. CURRENT ROW AND UNBOUNDED FOLLOWING
    21. Different Windowing Options
    22. Moving Sum has a Moving Window
    23. How ANSI Moving SUM Handles the Sort
    24. Quiz – How is that Total Calculated?
    25. Answer to Quiz – How is that Total Calculated?
    26. Moving SUM every 3-rows Vs a Continuous Average
    27. PARTITION BY Resets an ANSI OLAP
    28. The Moving Window is Current Row and Preceding
    29. Moving Average
    30. Moving Average Using a CAST Statement
    31. Moving Average every 3-rows Vs a Continuous Average
    32. PARTITION BY Resets an ANSI OLAP
    33. COUNT OVER for a Sequential Number
    34. COUNT OVER without Rows Unbounded Preceding
    35. Quiz – What caused the COUNT OVER to Reset?
    36. Answer to Quiz – What caused the COUNT OVER to Reset?
    37. The MAX OVER Command
    38. MAX OVER with PARTITION BY Reset
    39. The MIN OVER Command
    40. MIN OVER without Rows Unbounded Preceding
    41. The CSUM for Each Product_Id and the Next Start Date
    42. How Ntile Works
    43. Ntile
    44. Ntile Continued
    45. Ntile Percentile
    46. Another Ntile Example
    47. Using Quantiles (Partitions of Four)
    48. NTILE with a Single Sort Key
    49. NTILE Using a Value of 10
    50. NTILE With a Partition
    51. Using FIRST_VALUE
    52. FIRST_VALUE
    53. FIRST_VALUE After Sorting by the Highest Value
    54. FIRST_VALUE with Partitioning
    55. FIRST_VALUE Combined with Row_Number
    56. FIRST_VALUE And Row_Number with Different Sort
    57. Using LAST_VALUE
    58. LAST_VALUE
    59. Using LAG and LEAD
    60. LEAD
    61. LEAD
    62. LEAD with Partitioning
    63. Finding the First Occurrence
    64. Finding the Last Occurrence
    65. Using LEAD
    66. Using LEAD with an Offset of 2
    67. Using LAG
    68. Using LAG with an Offset of 2
    69. LAG
    70. LAG with Partitioning
    71. CUME_DIST
    72. CUME_DIST with a Partition
    73. SUM(SUM(n))
  19. Chapter 10 – Temporary Tables
    1. There are two types of Temporary Tables
    2. CREATING A Derived Table
    3. CREATING A Derived Table using the WITH Command
    4. The Same Derived Query shown Two Different Ways
    5. Most Derived Tables Are Used to Join To Other Tables
    6. The Three Components of a Derived Table
    7. Visualize This Derived Table
    8. Our Join Example with A Different Column Aliasing Style
    9. Column Aliasing Can Default for Normal Columns
    10. Our Join Example with the WITH Syntax
    11. Quiz - Answer the Questions
    12. Answer to Quiz - Answer the Questions
    13. Clever Tricks on Aliasing Columns in a Derived Table
    14. Two Derived Tables Joining to a Permanent Table
    15. The Key to Multiple WITH Tables
    16. Joining Two WITH Tables to a Permanent Table
    17. Using a Derived Table and Row_Number
    18. LEAD
    19. Finding the First Occurrence
    20. Finding the Last Occurrence
    21. Creating a Temporary Table
    22. Creating, Populating and Querying a Temporary Table
    23. Creating a Temporary Table Using the LIKE Keyword
    24. Creating a Temporary Table Using the LIKE Keyword
    25. Creating a Temporary Table and Populating it Simultaneously
    26. Creating a Temporary Table that Joins Multiple Tables
    27. Many Users Can Use the Same Temporary Table Name
  20. Chapter 11 – Strings
    1. The LENGTH Command Counts Characters
    2. The LENGTH Command – Spaces can Count too
    3. The LENGTH Command Doesn't Count Trailing Spaces
    4. UPPER and LOWER Commands
    5. Using the LOWER Command
    6. A LOWER Command Example
    7. Using the UPPER Command
    8. An UPPER Command Example
    9. Non-Letters Are Unaffected by UPPER and LOWER
    10. SOUNDEX
    11. REGEXP_REPLACE
    12. Concatenation
    13. The TRIM Command trims both Leading and Trailing Spaces
    14. SUBSTRING and SUBSTR are equal, but use different syntax
    15. How SUBSTRING Works with NO ENDING POSITION
    16. Using SUBSTRING to move Backwards
    17. How SUBSTRING Works with an Ending Position of 0
    18. An Example using SUBSTRING and LENGTH Together
    19. Concatenation and SUBSTRING
    20. The Context_Ngrams Function
    21. Sentences Function
    22. Explode Ngrams Sentences to Find the 5 Most Popular Words
    23. Explode Ngrams Sentences to Find the 5 Most Two-Words
    24. Explode Ngrams Sentences for the Top 5 Trigrams
    25. Explode Ngrams Sentences Finding Words Following a Phrase
    26. Explode Ngrams Sentences Finding Words Following a Phrase
  21. Chapter 12 – Interrogating the Data
    1. Quiz – Fill in the Answers for the NULLIF Command
    2. Quiz – Fill in the Answers for the NULLIF Command
    3. The COALESCE Command – Fill In the Answers
    4. COALESCE is Equivalent to This CASE Statement
    5. Some Great CAST (Convert and Store) Examples
    6. Quiz - The Basics of the CASE Statements
    7. Answer to Quiz - The Basics of the CASE Statements
    8. Using an ELSE in the Case Statement
    9. Using an ELSE as a Safety Net
    10. Rules for a Valued Case Statement
    11. Rules for a Searched Case Statement
    12. Valued Case Vs. A Searched Case
    13. The CASE Challenge
    14. The CASE Challenge Answer
    15. Combining Searched Case and Valued Case
    16. A Trick for getting a Horizontal Case
    17. Nested Case
    18. Put a CASE in the ORDER BY
  22. Chapter 13 – View Functions
    1. The Fundamentals of Views
    2. Creating a Simple View to Restrict Sensitive Columns
    3. Describe a View
    4. Describe Extended a View
    5. You SELECT from a View
    6. Creating Views to Protect Sensitive Columns and Rows
    7. Querying Sensitive Columns and Rows in a View
    8. Basic Rules for Views
    9. How to Modify a View?
    10. An Exception to the ORDER BY Rule inside a View
    11. Views Are Sometimes CREATED for Formatting
    12. Creating a View to Join Tables Together
    13. How to Alias Columns in a View CREATE
    14. The Standard Way Most Aliasing is done
    15. What Happens When Both Aliasing Options Are Present?
    16. Resolving Aliasing Problems in a View CREATE
    17. Answer to Resolving Aliasing Problems in a View CREATE
    18. Aggregates on View Aggregates
    19. Altering a Table After a View Has Been Created
  23. Chapter 14 – Creating Databases and Tables
    1. Creating a Database
    2. The Basics of Creating a Table
    3. The ROW FORMAT will be Delimited or Serde
    4. Hive Data Type Fundamentals
    5. An Example of a Table Using All Basic Data Types
    6. Settings so Hive Can Automatically Partition a Table
    7. Creating a Partitioned Table
    8. Creating an External Table
    9. Creating an External Table With a Specific Location
    10. INSERT/SELECT is One Method of Loading Data
    11. Using Buckets for Table Joins
    12. Defining Skewed Tables
    13. Defining a Table Location
    14. Creating a Text File Table
    15. Distribute By for Loading Data
    16. Sort By on Data Loads
    17. Cluster By Distributes and Sorts by the Same Key
    18. Hive does Not Store Data, But HDFS Does in These Formats
    19. Creating Tables as a Text file
    20. Hive SerDes Means Serializer/Deserializer
    21. Creating a Table as a SERDE
    22. Creating Tables as a SERDE with Advanced Options
    23. Creating Tables as an RCFile
    24. Creating Tables as ORC files
    25. Altering a Table to Add a Column
    26. Renaming a Table
    27. Dropping a Table
    28. Creating a Table Using a CTAS
    29. Creating a Table Using a CTAS Join
    30. Creating a Temporary Table Using a CTAS
    31. Creating a Temporary Table Using a LIKE Command
    32. Collecting Statistics – Cost Based Optimization (CBO)
    33. Collecting Statistics on Particular Columns of a Table
    34. Best Practices for Hive Cost Based Optimization
    35. Setting the Following Properties to Enable CBO
    36. Vectorization
    37. Use the DESCRIBE FORMATTED Function to See Statistics
    38. Hadoop Numeric Data Types
    39. Hadoop Date/Time Data Types
    40. Hadoop String Data Types Continued
    41. Hadoop Miscellaneous Data Types Continued
  24. Chapter 15 – Data Manipulation Language (DML)
    1. INSERT Syntax # 1
    2. INSERT example with Syntax 1
    3. INSERT Syntax # 2
    4. INSERT example with Syntax 2
    5. INSERT/SELECT Command
    6. INSERT/SELECT example using All Columns (*)
    7. INSERT/SELECT example with Less Columns
    8. DELETE and TRUNCATE Examples
  25. Chapter 16 – Statistical Aggregate Functions
    1. Numeric Manipulation Functions
    2. Finding the Cube Root
    3. Ceiling Gets the Smallest Integer Not Smaller Than X
    4. Floor Finds the Largest Integer Not Greater Than X
    5. The Round Function and Precision
    6. The Conv Function
    7. The Stats Table
    8. Compute_Stats Function
    9. The STDDEV_POP Function
    10. A STDDEV_POP Example
    11. The STDDEV_SAMP Function
    12. A STDDEV_SAMP Example
    13. The VAR_POP Function
    14. A VAR_POP Example
    15. The VAR_SAMP Function
    16. A VAR_SAMP Example
    17. The VARIANCE Function
    18. A VARIANCE Example
    19. The CORR Function
    20. A CORR Example
    21. Another CORR Example so you can Compare
    22. The COVAR_POP Function
    23. A COVAR_POP Example
    24. Another COVAR_POP Example so you can Compare
    25. The COVAR_SAMP Function
    26. A COVAR_SAMP Example
    27. Another COVAR_SAMP Example so you can Compare
    28. Using GROUP BY
  26. Chapter 17 – Hadoop EXPLAIN
    1. There are Many Options to See an EXPLAIN Plan
    2. Explain Output has Three Parts
    3. EXPLAIN EXTENDED and the Abstract Syntax Tree
    4. EXPLAIN EXTENDED Stage Plans and Stage Dependencies
    5. EXPLAIN DEPENDENCY Keywords in an Explain
    6. EXPLAIN AUTHORIZATION Keywords in an Explain
    7. Using a WHERE Clause Explains a Predicate
    8. EXPLAIN with an ORDER BY Statement