Wednesday, October 17, 2007

Top Five Reasons to use Surrogate IDs on your Tables

Every once in awhile, I hear of a challenge to the relational model of databases. For example, I hear suggestions such as Lucene or couchDB to serve as a persistence layer for semi-structural models.
Even though I don't think the relation model is going to be overthrown any time in the near future: I would like to suggest an amendment as to how relational schemas are designed: You _should_ attach an indexed ID column of a common data type to each table you create.

For example, if I have a table [Pubs], I would like to expect an auto-incrementing ID number as a primary key name "PubID" on the table.

Why?

Reason #1: Joins.
For the sake of consistency, if every table has a guaranteed unique [TableName] + ID column, joins between tables becomes much more intuitive. Foreign key relationships become extremely easy to understand. Many-to-many join tables become an afterthought.

Reason#2: Indexing on a non-business logic relevant field.
Since the "ID" column is surrogate, it has no coneptual relationship to the rest of the data on the row. Since Business logic can change, it makes for a weaker primary key. Note that ANY refactoring of the business related columns can occur while keeping the model intact if you have an "ID" column. Whats going to happen if, for some reason down the road, you have to dramatically change the data in one of your columns, and you always assumed this to be the primary key? You'll be in some trouble there.

Reason #3: The ORM advantage.
ORMs exist to bridge the gap between the rich object model and the rigid relational persistence layer. A common ORM phenomenon is an object for every table. Of course, it would be nice to have a common ancestor amongst all of our persistable objects (active record plays into this concept heavily), and wouldnt it be nice if every persistable object had an ID, such that the ID could be abstracted? If the ID is is pushed up the inheritance tree into an abstract class, then we have the ability to perform basic ID-based operations on abstract persistent objects. For example, consider the active-record flavored persistent object code below:


using NHibernate;
namespace Basic.Entities
{
public abstract class PersistentObject
{
#region members
private int id = 0;
#endregion

#region properties
public int Id
{
get { return id; }
}
public bool IsPersisted
{
get { return id != 0; }
}
#endregion

#region methods
public void Save()
{
ISession session = NHibernateHelper.GetCurrentSession();
session.SaveOrUpdateCopy(this);
session.Flush();
}
public virtual void Delete()
{
ISession session = NHibernateHelper.GetCurrentSession();
session.Delete(this);
session.Flush();
}
public static T Get(int id)
{
ISession session = NHibernateHelper.GetCurrentSession();
return session.Get(id);
}
#endregion
}
}
Of course, in my persistence code, I can leverage the



ID generator feature of NHibernate if I arrange my code in this way, which is very nice.


Reason #4: You automatically know if the data has been persisted.

This is an afterthought to the auto-incrementing ID plan, if the ID of a row is null, or zero (depending on your choice of a surrogate data type), then you automatically know if the row has been persisted to the database. If you know if the row has been saved, you know to perform changes with an insert or an update (fortunately, sql 2008's merge function also mitigates this problem).

Reason #5: There is no need to allocate an ID.
If the ID column is automatically generated by the server, there is no need to search, allocate, and assign any kind of ID to the persistent objects, let the database do the work. Less code is better code.


When should I NOT use surrogate ID columns?
Some tables exist solely for the purpose of joining other tables together. A dedicated n-way many-to-many relationship table has no meaning in itself without the joined data oof the related tables. I never use surrogate IDs in this case. Tables that serve as links for enumerations in the code should not have auto-incrementing ID fields; the IDs of these rows should match the enumerated values within the code.

Common arguments I hear against surrogate ID columns

The IDs are just extra weight. They are getting in the way.
Perhaps you are not leveraging any of the top 5 reasons I explained above. For most purposes, an int adds 32 bits to every field, and can store over 2 billion unique IDs, if this isnt enough, consider the BigInt, which costs an extra 32 bits and can store 9,223,372,036,854,775,807 unique IDs. Are you telling me at an extra 32 bits is going to ruin the performance of your application?

If I use surrogate IDs as my primary key, then the data will be randomly scattered out based on its time of creation.
This is nonsense; clustered indexes dictate the physical arrangement of your data on the disk, and primary keys do NOT have to be the clustered index. A better candidate for a clustered index is the foreign key relationship to a table's parent.


Common Anti-Patterns:


IDs are a persistence-related concern and have nothing to do with business logic.
The IDs are simply for persistence-related convenience. NO business logic should depend on them, no assumptions should be made about them. If you see any arithmetic involving IDs in your business logic, hang it up and fins a new job.

0 Comments:

Post a Comment

<< Home