Chapter 7. Understand and address data latency requirements 269
򐂰 Processing window
The ingest processing window is the period where the data is available for
processing by the ingest process. For example, if data is presented to the
data warehouse at 4 p.m. and the processing must be complete by 5 p.m.,
then the processing window is one hour.
򐂰 Data volume
Data volume refers to the quantity of data presented for ingest during each
processing window. Consider both average and peak volumes and plan for
peak volumes.
7.1.2 Calculate the data ingest rate
Use the values from the SLOs to calculate the rate of ingest required for each of
the data sources. Express the target data ingest rate per data node as
megabytes per second (MBps).
Calculate the ingest rate as follows:
򐂰 Estimate the volume of data in megabytes (MB) to be ingested per ingest
cycle.
򐂰 State the time in seconds allowed for the ingest process to complete.
򐂰 Divide the volume by the time. In a partitioned database, then divide by the
number data nodes receiving data.
The data volume and ingest rate each refer to the raw data to be ingested. Data
transformations and the implementation of indexes, materialized query table
(MQTs), and multi-dimensional clustering (MDC) tables can significantly increase
the actual data and the number of transactions needed in the database. After
they are identified, execution time for these additional transactions must be
accommodated.
Through this process you will understand at what rate you need to be able to
ingest data to meet your service level objectives. Your infrastructure must have
the capacity to support the ingest rate and also support your service level
objectives for the query and maintenance workloads. This must be the focus of
your initial infrastructure and data throughput tests.
7.1.3 Analyze your ETL scenarios
This section provides a checklist to help you identify the key characteristics of
your ETL application in a systematic fashion. The key distinctions and options
are presented for each item.
270 Solving Operational Business Intelligence with InfoSphere Warehouse Advanced Edition
Choose the one that best matches your situation. A checklist item can have
multiple answers because there might be different answers for different data
sources and tables in your project.
The checklist has the following sections:
򐂰 Determining your data ingest pattern
򐂰 Transformations involving the target database
򐂰 Data volume and latency
򐂰 Populating summary (or aggregate) tables
Determine your ETL pattern
In an operational data warehouse environment, the luxury of having an offline
window at the end of each day to process data in large batches is not always
available. It is expected that data is presented for processing at frequent intervals
during the day and that data must be ingested online without affecting the
availability of data to the business. The different patterns can be described as
follows:
򐂰 Continuous feed
Data arrives continually in the form of individual records from a data source or
data feed using messaging middleware (or by OS pipe, or through SQL
operations). The ETL processes run continuously and ingest each insert and
update as it arrives. Thus, new data is constantly becoming available to
business users rather than at fixed intervals.
򐂰 Concurrent batch (“Intra-day batch”)
Several times a day, data is extracted from source system and prepared for
ingesting into the target database. The ETL processes data in batches (files)
as they arrive or on a schedule. The target table is updated at scheduled
intervals, ranging from twice a day to every 15 minutes.
򐂰 Dedicated batch window (“daily batch”)
After the close of the business day (for example, 5 p.m.), data is extracted
from a source system and prepared for ingesting into the target database.
The ETL application populates the target project table during a dedicated,
scheduled batch window (for example, 5 p.m. to midnight).
A given database, star-schema (or even a given dimension or fact table) might be
populated using more than one pattern.
Although the pattern labels emphasize that each pattern differs in terms of
latency, that is not the only or primary difference. Each pattern requires a
somewhat different approach in articulating service level objectives and deciding
which ingest methods might be suitable.

Get Solving Operational Business Intelligence with InfoSphere Warehouse Advanced Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.