Tuesday, October 23, 2007

Speaking Real Words

I'm a big fan of wikipedia, mainly how I can stumble from tangents of one article to another much like channel surfing is to TV. I was just reading about Pythia, the Oracle of Delphi, and this quote struck me:
It is often said that the Pythia delivered oracles in a frenzied state induced by vapors rising from the ground, and that she spoke gibberish which priests reshaped into the enigmatic prophecies preserved in Greek literature.
I find this has amusing parallels to my career. Sometimes, I visit recruiters or managers that initially treat me like a babbling oracle that requires translation to understand "brilliance". It's like developers have some unwritten license or expected behavior to be as incoherent as they please, and people will chalk it up to their esoteric thought process.

This notion is just silly. I don't claim to be the best communicator, but what good are your ideas if you cannot translate them to anyone else? Smart people aren't always eccentric, and eccentric people are not always smart. Think about this the next time someone tries to project this odd character upon you, why are they acting this way? There are some people that have communication troubles because of a unique train of thought, but this is a disability, a hindrance to be overcome, not a personality trait to be admired.

In my contracts, I don't have a manager, agent, or salesperson. My clients call me directly, and I always explain details to their full satisfaction, this is not anything new or special, it is simply common courtesy regardless of the profession.

Monday, October 22, 2007

SQL String Filters

I had this recent chore with reporting that had me thinking about cool ways to filter substring patterns from text.

The following is a common snippet of udf code that removes, or filters out non-alphanumeric characters from a sql field:

CREATE FUNCTION dbo.FilterNonAlphaNumericCharacters
(
@input VARCHAR(8000)
)
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE @i SMALLINT
SET @i = PATINDEX('%[^0-9A-Za-z ]%', @input)
WHILE @i > 0
BEGIN
SET @input = STUFF(@input, @i, 1, '')
SET @i = PATINDEX('%[^0-9A-Za-z]%', @input)
END
RETURN @input
END
GO
It would be executed like:
select dbo.FilterNonAlphaNumericCharacters(PubSectionName) from pubsections
Here's another one: Sometimes a sql value will have more than one space between words. Heres a udf for removing multiple spaces (again, very common across the internet)
CREATE FUNCTION dbo.FilterMultipleSpaces
(
@input varchar(8000)
)
RETURNS varchar(8000)
AS
BEGIN
WHILE CHARINDEX(' ', @input) > 0
BEGIN
SET @input = REPLACE(@input, ' ', ' ')
END
RETURN @input
END
GO
It would be executed like:
select dbo.FilterMultipleSpaces(PubSectionName) from pubsections
What are the common elements from these functions? We have a target pattern that we are trying to filter out of a string. Instead writing specialized filter after filter, I decided to write something a little more universal.

CREATE FUNCTION dbo.MyFilter
(
@input VARCHAR(8000),
@filterPattern VARCHAR(100),
@replacementVal varchar(100)
)
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE @i SMALLINT
declare @paddingval int

if( len(@replacementVal) = 0)
set @paddingVal = 1
else
set @paddingVal = len(@replacementVal)

SET @i = PATINDEX(@filterPattern, @input)
WHILE @i > 0
BEGIN
SET @input = STUFF(@input, @i, @paddingVal, @replacementVal)
SET @i = PATINDEX(@filterPattern, @input)
END
RETURN @input
END
GO

Now, to filter out non-alphanumeric characters, the following two statements are equivalent

select dbo.MyFilter(PubSectionName,'%[^0-9A-Za-z ]%','') from pubsections
select dbo.FilterNonAlphaNumericCharacters(PubSectionName) from pubsections

and these two statements are also equivalent for removing multiple spaces

select dbo.MyFilter(PubSectionName,'% %',' ') from pubsections
select dbo.FilterMultipleSpaces(PubSectionName) from pubsections

I think its cool, because it will replace any pattern with a specified string such as

select PubSectionName from pubsections
select dbo.MyFilter2(PubSectionName,'% %','-=-') from pubsections

Still the code above is getting more complex and inefficient I have hesitations about extending this functionality to something too generic.

Labels:

Friday, October 19, 2007

Semi-structure ideas: some better definitions

I took a look back at my post yesterday and realized that my subject really needs some context to give it substance.

What scenarios define semi-structural data?
Perhaps different types of semi-structural data calls for different approaches to persistence. I'll try to define the gamut of types I've observed. How does the structure of data change with respect to time?

Dynamic Attributes - This is far and away the most common phenomenon of semi-structured data. The classic scenario is: the client needs you to add/remove attributes on to an existing data structure to satisfy their particular needs. Most of the modern software best-practices for maintainability are founded around this scenario. The key to dynamic attributes is that the entities adhere to a central structural definition, the entities are defined by their central definition, like all objects adhere to classes. When the data structure changes, it is commonly an expansion, not a removal, and to maintain a uniform definition, all legacy data is filled with default values to accommodate the change. If the frequency of structure change is rapid, developers tend to use the relational database in creative ways.

Dynamic Entities - Instead of dynamic structure definitions, we can also have dynamic structures. An example of this is user-defined objects, or dynamically created tables. If the structures are to be dynamic, the relationship between the structures must also be dynamic as well. One thing dynamic entities and dynamic attributes have in common is that every entity is defined by its schema, much like the object/class relationship.



Diverse Entities - This scenario is true semi-structural data, where entities are dynamically defined, and have little or no relation at all to a central schema definition. A common place you will see this is full-text indexing on documents; each entity is a document, they have JSON or URL-based relationships to one another, but each document is unique. The only thing all documents are guaranteed to have in common is that they are documents. What I find interesting about this scenario is the limited adherence to a rigid central definition. What if entities had limited adherence to multiple central definitions at the same (for example, the entity is slightly like the definition of a sales item and slightly like the definition of a physical location, such that the entity can rightly be used in either scenario?)

At the moment, this is an unfinished post, I plan on discussing common and creative ways people currently satisfy these scenarios in the relational database (even though I suspect there is a better way).

Thursday, October 18, 2007

Persistence of Semi-Structural Data

Earl - "Look at the snowflakes, someone once told me, no two snowflakes are the same"
Mooch - "Earl, no two anythings are the same!"
-Patrick McDonnell (Mutts Comic)

It amuses me when we set our own limits around our abilities: how often we place arbitrary or imaginary boundaries on what can and cannot be accomplished. When something that was originally assumed impossible is now proven possible, it suddenly becomes easy to do and simple to understand.

One of the boundaries Ive been witnessing is the constrained form of the relational database. For performance and traditional reasons, the schemas we work with are static and rigid. Furthermore, we use databases and software to track and model real-life phenomena. The problem is, unlike our databases, real life is rarely static and rigid; it is dynamic, it is diverse, it is in flux, never the same. Its a shame when people work in a fashion that is dictated simply by the limits of their software, and not vice versa isnt it? We as computer scientists are forced to be creative to find ways to represent the dynamic phenomenon in the static fashion by using as many commonalities as we can. In a sense, we try to make the static representation as dynamic as possible. That is our Art as I see it anyways.

Why this limit on the databases ability? Why can't I find some sort of a persistence medium that works with dynamic (semi-structural) data? I read this article from a very bright developer ayende rahien who explores the use of lucene as a semi-structural persistence layer. I explored alternative approaches using couchdb, but all of these have the sensation of trying to pound the square peg into the round hole. Dynamic indexing seems to be a requirement for flexible querying, but these systems are designed for document persistence, not an object persistence layer. I would assume the cost of dynamic indexing causes a ding in scalability.

Still, when the structure of our data becomes dynamic, do we fundamentally lose the ability of flexible queries and indexing performance, or is this notion of a rigid schema an imaginary boundary?

Wednesday, October 17, 2007

Top Five Reasons to use Surrogate IDs on your Tables

Every once in awhile, I hear of a challenge to the relational model of databases. For example, I hear suggestions such as Lucene or couchDB to serve as a persistence layer for semi-structural models.
Even though I don't think the relation model is going to be overthrown any time in the near future: I would like to suggest an amendment as to how relational schemas are designed: You _should_ attach an indexed ID column of a common data type to each table you create.

For example, if I have a table [Pubs], I would like to expect an auto-incrementing ID number as a primary key name "PubID" on the table.

Why?

Reason #1: Joins.
For the sake of consistency, if every table has a guaranteed unique [TableName] + ID column, joins between tables becomes much more intuitive. Foreign key relationships become extremely easy to understand. Many-to-many join tables become an afterthought.

Reason#2: Indexing on a non-business logic relevant field.
Since the "ID" column is surrogate, it has no coneptual relationship to the rest of the data on the row. Since Business logic can change, it makes for a weaker primary key. Note that ANY refactoring of the business related columns can occur while keeping the model intact if you have an "ID" column. Whats going to happen if, for some reason down the road, you have to dramatically change the data in one of your columns, and you always assumed this to be the primary key? You'll be in some trouble there.

Reason #3: The ORM advantage.
ORMs exist to bridge the gap between the rich object model and the rigid relational persistence layer. A common ORM phenomenon is an object for every table. Of course, it would be nice to have a common ancestor amongst all of our persistable objects (active record plays into this concept heavily), and wouldnt it be nice if every persistable object had an ID, such that the ID could be abstracted? If the ID is is pushed up the inheritance tree into an abstract class, then we have the ability to perform basic ID-based operations on abstract persistent objects. For example, consider the active-record flavored persistent object code below:


using NHibernate;
namespace Basic.Entities
{
public abstract class PersistentObject
{
#region members
private int id = 0;
#endregion

#region properties
public int Id
{
get { return id; }
}
public bool IsPersisted
{
get { return id != 0; }
}
#endregion

#region methods
public void Save()
{
ISession session = NHibernateHelper.GetCurrentSession();
session.SaveOrUpdateCopy(this);
session.Flush();
}
public virtual void Delete()
{
ISession session = NHibernateHelper.GetCurrentSession();
session.Delete(this);
session.Flush();
}
public static T Get(int id)
{
ISession session = NHibernateHelper.GetCurrentSession();
return session.Get(id);
}
#endregion
}
}
Of course, in my persistence code, I can leverage the



ID generator feature of NHibernate if I arrange my code in this way, which is very nice.


Reason #4: You automatically know if the data has been persisted.

This is an afterthought to the auto-incrementing ID plan, if the ID of a row is null, or zero (depending on your choice of a surrogate data type), then you automatically know if the row has been persisted to the database. If you know if the row has been saved, you know to perform changes with an insert or an update (fortunately, sql 2008's merge function also mitigates this problem).

Reason #5: There is no need to allocate an ID.
If the ID column is automatically generated by the server, there is no need to search, allocate, and assign any kind of ID to the persistent objects, let the database do the work. Less code is better code.


When should I NOT use surrogate ID columns?
Some tables exist solely for the purpose of joining other tables together. A dedicated n-way many-to-many relationship table has no meaning in itself without the joined data oof the related tables. I never use surrogate IDs in this case. Tables that serve as links for enumerations in the code should not have auto-incrementing ID fields; the IDs of these rows should match the enumerated values within the code.

Common arguments I hear against surrogate ID columns

The IDs are just extra weight. They are getting in the way.
Perhaps you are not leveraging any of the top 5 reasons I explained above. For most purposes, an int adds 32 bits to every field, and can store over 2 billion unique IDs, if this isnt enough, consider the BigInt, which costs an extra 32 bits and can store 9,223,372,036,854,775,807 unique IDs. Are you telling me at an extra 32 bits is going to ruin the performance of your application?

If I use surrogate IDs as my primary key, then the data will be randomly scattered out based on its time of creation.
This is nonsense; clustered indexes dictate the physical arrangement of your data on the disk, and primary keys do NOT have to be the clustered index. A better candidate for a clustered index is the foreign key relationship to a table's parent.


Common Anti-Patterns:


IDs are a persistence-related concern and have nothing to do with business logic.
The IDs are simply for persistence-related convenience. NO business logic should depend on them, no assumptions should be made about them. If you see any arithmetic involving IDs in your business logic, hang it up and fins a new job.

Monday, October 15, 2007

The art within a windows crash

Sometimes, I amuse me


Wednesday, October 03, 2007

Just when you think youve seen everything...

Scott Guthrie has announced the downloadable source code for the .NET 3.5 CLR

Thats right, microsoft's flagship development platform is finally transparent. This changes the game so much its not even funny. I wonder if people will start to tailor their code more around the internals of a given CLR version and less around the exposed interfaces... the result? Higher dependency on a specific version of the runtime.

Im disappointed to say that in the few screenshots I got to see of the CLR there were many examples of code I didnt like.

There are no secrets anymore.