Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

A guest post by Ben Keeping, a freelance developer based in the UK, specializing in applications written in .NET and Ruby, using lean and agile techniques. He can be contacted @benkeeping.

The world runs on data, and business’s demand for intelligence on how that data is being used can only increase as more and more data is made public. Demand for real time, or near real time reporting, statistics and aggregates is now the norm, as opposed to those reports being run once per week. This blog post will explore some of the approaches for delivering business intelligence.

The Problem

You have a highly normalized database with millions of rows, and you are unable to run real-time queries in order to generate statistics or aggregations because the query would be too slow, or use too many of the database server’s resources.

Imagine an online ordering site which is backed by this kind of database schema:


(Note that I’m not saying this is a great schema – it’s just an example of a database that is highly normalized for the purposes of this post. Although NoSQL is gaining popularity, the majority of companies probably have normalized SQL-based dbs).

The business wishes to run a suite of reports, which could include:

  • Product types per country per order date
  • Product types per account types
  • Product types per user type

… and so on.

Classic Batch Job/ ETL-like Process

Extract-Transform-Load (ETL) is a classic and popular method for generating data warehouses. The general theory is that you have one or more sources of data that you extract data from. That data is then transformed into your desired format, and the results are loaded into your target database.

Perhaps as part of the transformation process, you de-normalize some data. For example, instead of having a Users table that has a key to User_Types, you create a Users table with the actual user type of Commercial or Partner or Individual.

The reports are then run against the reporting database, and emailed out, or made available on a website, or perhaps you’re using some tool like Crystal Reports or Business Objects, so people can query the data away from your production database.

This kind of approach works for a lot of people, and can be fairly powerful, because you’re not customizing the data too much, you’re retaining much of the original schema, and it’s fairly quick to write new reports.

The downside is that because the data is still fairly normalized, if you start hitting the millions of rows mark, and you have lots of users querying your reporting database, performance could once again become an issue.

Disk is Cheap

Another approach would be to embrace de-normalization, because disk space is cheap. Instead of generally retaining the normalized schema, you create tables designed to meet a specific report.

So during your ETL process, you create a table per report. For example, for the report ‘Product types per user type,’ you create a Product_types_per_user_type table, which can be queried much more simply because you’ve done all of the hard work of joining the tables during the ETL.

So you’re table would look something like this:

Count Product type User type
4345089 Electrical equipment Individual
1242368 Electrical equipment Commercial
352966 Electrical equipment Partner
493611 DVDs Individual
36583 DVDs Commercial
265892 DVDs Partner

With your data highly optimized for your report, the reports can be run quickly, with further sub queries and aggregations possible in real time.

But what if your ETL process is taxing your production database too much when you run the batch job? Perhaps the job was running overnight, because traffic is slower to your site then, but now it’s still running at 9 a.m. and impacting your production system?

Domain Events & Pub/Sub

Sometimes handling large amounts of business intelligence data requires a different approach.

A domain event captures the state of a domain object at a certain time. Often implemented using an asynchronous publish/subscribe pattern, they provide an excellent way of acting upon domain changes non-invasively – i.e., without affecting the core control of the code issuing the domain event.

For example in our ordering system, when creating an order, a domain event could be raised:

Once this domain event is fired, the subscribed handlers will pick up the change, extract the required information for a particular report, and then save that change.
So for our report that aggregates product types by user types, you may implement something such as:

You’d implement different handlers to update the data for each report. So the above handler may be just extracting product type and user type in order to update the aggregated products by users table.

But you may have another handler that focused on products by countries, or products by account types and so on. Each of these handlers would be pushing data into their own report-based table, meaning that you’re generating near-real-time report data without impacting your production database.

NoSQL-based Solutions

So far we’ve been talking about SQL-based solutions, but there are also the suite of NoSQL-based technologies, which can act as great reporting data stores. Using the domain event & pub/sub patterns, means you can really leverage different technology without affecting your core codebase. NonSQL DBs such as Mongo have some great features for aggregations.

In an asynchronous system, one problem of updating counts of things is that you may run into thread safety issues. So, if you had several threads attempting to update a count of DVD’s ordered by the user type ‘Commerical’, then we could potentially end up with an incorrect count.

Mongo has a great operator called $inc, which updates a field by an amount in an atomic operation.

So imagine our ‘products by user types’ document looked like this :

We could issue a mongo command against our reporting collection like this:

This means that rather than trying to work out how to increment our aggregated products by user types collection in a thread-safe manner, Mongo can handle it for us.


Handling massive amounts of data, and reports for business intelligence that are real time, or near real time, is a complicated issue. There are lots of ways to implement it, and lots of technologies to help you do it. De-normalizing data may be enough to make a reporting system perform well, or you may need to generate specialized tables. Domain events have a certain complexity to them, but can be an excellent solution that operates without impacting your production system.

For more details about Business Intelligence, see the resources below from Safari Books Online.

Not a subscriber? Sign up for a free trial.

Safari Books Online has the content you need

Business Intelligence, 2nd Edition is a full overview of what comprises business intelligence. It is intended to provide an introduction to the concepts to uncomplicate the learning process when implementing a business intelligence program.
Business Intelligence illustrates how organizations can make better, faster decisions about their customers, partners, and operations by turning mountains of data into valuable business information that’s always at the fingertips of decision makers. You’ll learn what’s involved in using business intelligence to bring together information, people, and technology to create successful business strategies-and how to execute those strategies with confidence.
Organizational Applications of Business Intelligence Management offers a deep look into the latest research, tools, implementations, frameworks, architectures, and case studies within the field of Business Intelligence Management.

About the author

Ben Keeping is a freelance developer based in the UK. He currently specializes in applications written in .NET and Ruby, using lean and agile techniques. He can be contacted @benkeeping.

Tags: Business Intelligence, Database Schema, databases, ETL, Extract-Transform-Load, Mongo,

Comments are closed.