Data Warehouse 101

What is a data warehouse?

I like to answer this question by defining a warehouse first.  What is a warehouse?

Here is the definition provided by Google: a large building where raw materials or manufactured goods may be stored before their export or distribution for sale

So, where do these raw materials or manufactured goods come from?

Raw materials and manufactured goods come from a variety of different sources.  They are harvested or produced by different companies.  It is the warehouse that provides the item to the next step in the supply chain.

So, the data warehouse  is the same thing.  The data warehouse is supplied by many different systems.  Each source database will define its own primary keys for each of its tables, but those primary keys are specifically designed for each specific source database.  Since a data warehouse is comprised of multiple sources, then it follows that each set of primary keys (for each source database) are designed differently than another source database.  There also could be duplicate primary keys between source databases.

So, what needs to happen?

We need to add a new primary key called a surrogate key so that we can guarantee primary key uniqueness.

We also need to cleanse the data, meaning possibly reformatting it.  For example, if source database 1 has phone numbers in its tables that has dashes in it, and source database 2 has phone numbers with no dashes, then we need to make a decision on how to format the phone number in the data warehouse, because we want it to be in the same format for searching purposes.

Back to the original question, what is a data warehouse?  I tweaked the definition of a physical warehouse noted above to be the following.:

a large set of databases where data from disparate systems may be stored before their use for reporting

Next up, we will dive a little deeper into how a data warehouse is constructed.  Stay tuned!