5. Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets

The concept of data warehousing has long been in the domain of large enterprises. A huge industry has developed to attack the problem of quickly asking questions about data from across an entire organization. Data warehouse practices encompass the design of complicated ETL pipelines and the art of processing data from transactional databases using OLAP cubes and star schemas. This mature field is being challenged by new approaches to dealing with data warehousing issues: approaches that, in some cases, can be more scalable and performant as well as cheaper.

The Hadoop project provides an open-source platform for distributing data processing tasks across clusters of low-cost ...

Get Data Just Right: Introduction to Large-Scale Data & Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.