Total Pageviews

Tuesday, 8 January 2013

Surrogate Keys

When building a data warehouse, it is important that primary keys of dimension tables remain stable.  To accomplish this, it is strongly recommended that surrogate keys be created and used for primary keys for all dimension tables instead of using natural keys.  Surrogate keys are keys that are maintained within the data warehouse instead of natural keys which are taken from source data systems.  There are several reasons for the use of surrogate keys:

Allows integrating data from multiple source systems, (i.e. in case two source systems use the same value in natural key field)
Legacy systems that provide historical data might have used a different numbering system than a current online transaction processing system.  A surrogate key uniquely identifies each entity in the dimension table regardless of its source key.  A separate field can be used to contain the key used in the source system.Systems developed independently in company divisions may not use the same keys, or they may use keys that conflict with data in the systems of other divisions.  This situation may not cause problems when each division independently reports summary data, but it cannot be permitted in the data warehouse where data is consolidated.

Protect from changes in the source system, (i.e. renaming of value in natural key field)
This situation is usually less likely than others, but some systems have been known to reuse keys belonging to obsolete data.  However, the key may still be in use in historical data in the data warehouse, and the same key cannot be used to identify different entities.

Allows for slowly changing dimensions, (i.e. type 2)
This can be a common situation.  For example, if a salesperson is transferred from one region to another, the company may prefer to track two things: sales data for the salesperson with the person’s original region for data prior to the transfer date, and sales data for the salesperson in the person’s new region after the transfer date.  To represent this organization of data, the salesperson’s record must exist in two places in the sales force dimension table, which is not possible if the salesperson’s company employee identification number is used as the primary key for the dimension table.  A surrogate key allows the same salesperson to participate in different locations in the dimension hierarchy.
In this case, the salesperson will be represented twice in the dimension table with two different surrogate keys.  These surrogate keys are used to join the salesperson’s records to the sets of facts appropriate to the various locations in the hierarchy occupied by the salesperson.
The employee’s identification number should be carried in a separate column in the table so information about the employee can be reviewed or summarized regardless of the number of times the employee’s record appears in the dimension table.
Dimensions that exhibit this type of change are called slowly changing dimensions.
Another example of a situation that causes this type of change is the creation of a new version of a product, such as a reduced-fat version of a food item.  The item will receive a new SKU or Uniform Product Code (UPC), but may retain most of the same attributes of the original item, which is still manufactured and sold.  The appropriate use of surrogate keys can allow the two versions of the item to be summarized together or separately.

Allows you to create rows in the dimension that don’t exist in the source (i.e. Sales Rep Not Assigned Yet)
So can use a -1 value for rows in the fact table that are considered unassigned.  By adding a row in the dimension that has -1 as its surrogate key, those fact table rows with -1 in the dimension key field will fall into the unassigned bucket.

Improves performance (joins) and database size by using integer type instead of text
Surrogate fields are always integer types, so when joined to dimensions you will get maximum performance.
Two other minor reasons: You avoid reliance on awkward “smart” keys made up of codes from the dimension’s source systems; Space savings in the fact tables when these dimension keys are embedded in the fact tables as foreign keys.
The implementation and management of surrogate keys is the responsibility of the data warehouse.  OLTP systems are rarely affected by these situations, and the purpose of these keys is to accurately track history in the data warehouse.  Surrogate keys are maintained in the data preparation area during the data transformation process.

Why should I use a Data Warehouse?

Why should I use a Data Warehouse?

There are hundreds of reasons why a data warehouse is useful to your organization, I would suggest the following list be a good starting point: (If you have these needs you may need a true back-end enterprise scalable historical data store : or Enterprise Data Warehouse)….
ü  Real-time issues – your current systems aren’t enabled to integrate disparate sources of data and keep historical records of those integrations, in near real-time.
ü  Scalability issues – you have tons of historical data you need to gather in to an easily accessible place, common formats, common keys, and common access methods. AND you need to ensure that the system is scalable over the next 3 to 5 years.
ü  Avoidance of Siloed Solution Sets – if you have many different or disparate solutions already in existence, yet your corporation is unable to answer common questions requiring consistency across your enterprise.
ü  Enterprise Class System of Record – across historical and integrated data sets, if you have a need to do this, you probably need an enterprise data warehouse
ü  Disparate Source Systems along with Internal and External Data Sets – if you need to ingrate all of these for a single enterprise vision WITH HISTORY, then you need a data warehouse.
ü  Self-Service BI – if you have a need to eventually reach this goal, where users can “visualize” and construct their own reports, then you probably need an enterprise data warehouse, along with it’s highly integrated historical facts from all the different sources in your organization.
ü  Kick start for a Master Data Management initiative. If you want Master Data, then it is important to understand the nature of your history – where the problems exist, how the data does and does not align with business perception, and basically where to “get” the golden copies of records you want to begin populating your MDM solution (remember: MDM is NOT just a tool, it’s people, process, governance, and so on).. Yes, you can build MDM solutions without a Data Warehouse, but how good is your confidence that the data you selected is truly “gold copy” if you don’t have historical evidence to back it up?
ü  If you do ANY sort of data mining, you need a data warehouse. Data Mining is becoming (or already is) the heart-and-soul of better decision making in BI. And of course, the mining engine is only as smart as the domain of information that you provide to it, along with the model that is designed. Statistics say: you can project for 1/2 as much time as you have history for. So: with 2 years, you can project (with some accuracy) only 1 year out. The same goes for Data Mining initiatives, AND the better interconnected the data set is (by Business Keys across the enterprise) the better your Data Mining confidence ratings will be.