25

–––––––––––––––––––––––

Data Partitioning for Designing and Simulating Efficient Huge Databases

Ladjel Bellatreche, Kamel Boukhalfa, Pascal Richard, and Soumia Benkrid

25.1   INTRODUCTION

Data warehousing is becoming more complex in terms of applications, data size, and queries, including joins and aggregations. Data warehouse projects always stress performance and scalability because of the data volumes and the query complexity. For instance, eBay's data warehouse includes 2 petabytes of user data and millions of queries per day.1 Data distribution for ensuring high parallelism becomes a crucial issue for the research community.

Most of the major commercial database systems support data distribution and parallelism (Teradata, Oracle, IBM, Microsoft SQL Server 2008 R2 Parallel Data Warehouse, Sybase, etc.). Data warehouses store large volumes of data mainly in relational models such as star or snowflake schemas. A star schema contains a large fact table and various dimension tables. A star schema is usually queried in various combinations involving many tables. The most used operations are joins, aggregations, and selections [1]. Joins are well known to be expensive operations, especially when the involved relations are substantially larger than the size of the main memory [2], which is usually the case in business intelligence applications. The typical queries defined on the star schema are commonly referred to as star join queries and exhibit the following two characteristics: ...

Get Scalable Computing and Communications: Theory and Practice now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.