Data Warehouse 201

So, continuing on from the previous article, Data Warehouse 101, I would like to delve a little bit deeper. First, I would like to backtrack on how I defined a data warehouse in the Data Warehouse 101 article. I defined a data warehouse as follows.:

a large set of databases where data from disparate systems may be stored before their use for reporting

This is one way to construct a data warehouse, but the requirement that data come from disparate systems is not true. Another type of data warehouse is one where there is a transactional system, perhaps an in-house transactional system. The performance of this transactional system is very important, so we don’t want to run report queries against its database. The source system for the data warehouse (Enterprise Data Warehouse or EDW) is only one system, namely, our in-house system.

The transactional system is comprised of one or many “normalized” databases. A “normalized” database has many tables in order to reduce the redundancy of data. Normalized databases are not efficient for reporting, so we need an ETL process (Extract, Transform, and Load) to copy the normalized data to a new set of databases and schemas. These schemas will each end up being architected into a “star” configuration. Hence, they are called “star” schemas. “Star” schemas are the de facto standard architecture for reporting in an EDW. Simply put, a “star” schema is comprised of a fact table with foreign keys, each of which point to a dimension table. A fact table contains quantities of things that are already pre-calculated and ready to report on.

Dimension tables answer the questions of “who, what, where, when, and how?”. A fact table answers the question of “how many or how much?”. For example, a fact table named FactSales could contain the following.: DateId, StoreId, ProductId, NumberOfUnitsSold. The FactSales table would be at the center of the “star”. The NumberOfUnitsSold is the lone fact in this table. This fact table answers “how many?” -> NumberOfUnitsSold. The other attributes are foreign keys. Each of the foreign keys point to a dimension table. There would be 3 dimension tables: DimProduct, DimStore, and DimDate. Each of these answer “what?” -> (Product), “where?” -> (Store), and “when?” -> (Date). The schema looks like this.:

star_schema_example

(source: www.zentut.com)

The Fact_Sales table may be useful, in this case, if there are enough different combinations of date, store, and product that identify numberOfUnitsSold. Imagine 10 years of history on many permutations of these attributes.

But, what about data warehouses for healthcare? A healthcare data warehouse may not look like one for product sales. What would it look like? It may not be that easy to define. We will take a look at some of these challenges in another post.

Thanks for reading!