Solving Operational Business Intelligence with InfoSphere Warehouse Advanced Edition

Chapter 7. Understand and address data latency requirements 269

򐂰 Processing window

The ingest processing window is the period where the data is available for

processing by the ingest process. For example, if data is presented to the

data warehouse at 4 p.m. and the processing must be complete by 5 p.m.,

then the processing window is one hour.

򐂰 Data volume

Data volume refers to the quantity of data presented for ingest during each

processing window. Consider both average and peak volumes and plan for

peak volumes.

7.1.2 Calculate the data ingest rate

Use the values from the SLOs to calculate the rate of ingest required for each of

the data sources. Express the target data ingest rate per data node as

megabytes per second (MBps).

Calculate the ingest rate as follows:

򐂰 Estimate the volume of data in megabytes (MB) to be ingested per ingest

cycle.

򐂰 State the time in seconds allowed for the ingest process to complete.

򐂰 Divide the volume by the time. In a partitioned database, then divide by the

number data nodes receiving data.

The data volume and ingest rate each refer to the raw data to be ingested. Data

transformations and the implementation of indexes, materialized query table

(MQTs), and multi-dimensional clustering (MDC) tables can significantly increase

the actual data and the number of transactions needed in the database. After

they are identified, execution time for these additional transactions must be

accommodated.

Through this process you will understand at what rate you need to be able to

ingest data to meet your service level objectives. Your infrastructure must have

the capacity to support the ingest rate and also support your service level

objectives for the query and maintenance workloads. This must be the focus of

your initial infrastructure and data throughput tests.

7.1.3 Analyze your ETL scenarios

This section provides a checklist to help you identify the key characteristics of

your ETL application in a systematic fashion. The key distinctions and options

are presented for each item.

270 Solving Operational Business Intelligence with InfoSphere Warehouse Advanced Edition

Choose the one that best matches your situation. A checklist item can have

multiple answers because there might be different answers for different data

sources and tables in your project.

The checklist has the following sections:

򐂰 Determining your data ingest pattern

򐂰 Transformations involving the target database

򐂰 Data volume and latency

򐂰 Populating summary (or aggregate) tables

Determine your ETL pattern

In an operational data warehouse environment, the luxury of having an offline

window at the end of each day to process data in large batches is not always

available. It is expected that data is presented for processing at frequent intervals

during the day and that data must be ingested online without affecting the

availability of data to the business. The different patterns can be described as

follows:

򐂰 Continuous feed

Data arrives continually in the form of individual records from a data source or

data feed using messaging middleware (or by OS pipe, or through SQL

operations). The ETL processes run continuously and ingest each insert and

update as it arrives. Thus, new data is constantly becoming available to

business users rather than at fixed intervals.

򐂰 Concurrent batch (“Intra-day batch”)

Several times a day, data is extracted from source system and prepared for

ingesting into the target database. The ETL processes data in batches (files)

as they arrive or on a schedule. The target table is updated at scheduled

intervals, ranging from twice a day to every 15 minutes.

򐂰 Dedicated batch window (“daily batch”)

After the close of the business day (for example, 5 p.m.), data is extracted

from a source system and prepared for ingesting into the target database.

The ETL application populates the target project table during a dedicated,

scheduled batch window (for example, 5 p.m. to midnight).

A given database, star-schema (or even a given dimension or fact table) might be

populated using more than one pattern.

Although the pattern labels emphasize that each pattern differs in terms of

latency, that is not the only or primary difference. Each pattern requires a

somewhat different approach in articulating service level objectives and deciding

which ingest methods might be suitable.

Get Solving Operational Business Intelligence with InfoSphere Warehouse Advanced Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Solving Operational Business Intelligence with InfoSphere Warehouse Advanced Edition by Whei-Jen Chen, Pat Bates, Timothy Donovan, Garrett Fitzsimons, Jon Lind, Rogerio Silva

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly