Total Pageviews

Tuesday, 8 January 2013

Surrogate Keys

When building a data warehouse, it is important that primary keys of dimension tables remain stable.  To accomplish this, it is strongly recommended that surrogate keys be created and used for primary keys for all dimension tables instead of using natural keys.  Surrogate keys are keys that are maintained within the data warehouse instead of natural keys which are taken from source data systems.  There are several reasons for the use of surrogate keys:

Allows integrating data from multiple source systems, (i.e. in case two source systems use the same value in natural key field)
Legacy systems that provide historical data might have used a different numbering system than a current online transaction processing system.  A surrogate key uniquely identifies each entity in the dimension table regardless of its source key.  A separate field can be used to contain the key used in the source system.Systems developed independently in company divisions may not use the same keys, or they may use keys that conflict with data in the systems of other divisions.  This situation may not cause problems when each division independently reports summary data, but it cannot be permitted in the data warehouse where data is consolidated.

Protect from changes in the source system, (i.e. renaming of value in natural key field)
This situation is usually less likely than others, but some systems have been known to reuse keys belonging to obsolete data.  However, the key may still be in use in historical data in the data warehouse, and the same key cannot be used to identify different entities.

Allows for slowly changing dimensions, (i.e. type 2)
This can be a common situation.  For example, if a salesperson is transferred from one region to another, the company may prefer to track two things: sales data for the salesperson with the person’s original region for data prior to the transfer date, and sales data for the salesperson in the person’s new region after the transfer date.  To represent this organization of data, the salesperson’s record must exist in two places in the sales force dimension table, which is not possible if the salesperson’s company employee identification number is used as the primary key for the dimension table.  A surrogate key allows the same salesperson to participate in different locations in the dimension hierarchy.
In this case, the salesperson will be represented twice in the dimension table with two different surrogate keys.  These surrogate keys are used to join the salesperson’s records to the sets of facts appropriate to the various locations in the hierarchy occupied by the salesperson.
The employee’s identification number should be carried in a separate column in the table so information about the employee can be reviewed or summarized regardless of the number of times the employee’s record appears in the dimension table.
Dimensions that exhibit this type of change are called slowly changing dimensions.
Another example of a situation that causes this type of change is the creation of a new version of a product, such as a reduced-fat version of a food item.  The item will receive a new SKU or Uniform Product Code (UPC), but may retain most of the same attributes of the original item, which is still manufactured and sold.  The appropriate use of surrogate keys can allow the two versions of the item to be summarized together or separately.

Allows you to create rows in the dimension that don’t exist in the source (i.e. Sales Rep Not Assigned Yet)
So can use a -1 value for rows in the fact table that are considered unassigned.  By adding a row in the dimension that has -1 as its surrogate key, those fact table rows with -1 in the dimension key field will fall into the unassigned bucket.

Improves performance (joins) and database size by using integer type instead of text
Surrogate fields are always integer types, so when joined to dimensions you will get maximum performance.
Two other minor reasons: You avoid reliance on awkward “smart” keys made up of codes from the dimension’s source systems; Space savings in the fact tables when these dimension keys are embedded in the fact tables as foreign keys.
The implementation and management of surrogate keys is the responsibility of the data warehouse.  OLTP systems are rarely affected by these situations, and the purpose of these keys is to accurately track history in the data warehouse.  Surrogate keys are maintained in the data preparation area during the data transformation process.

No comments:

Post a Comment