Sunday, June 01, 2008

Stored procedures are not always the best tool for selecting

Working as a consultant in the past few years has given me an opportunity to view a diverse set of code bases, and see come of the common trends and habits of developers all over the map.

Literally every single codebase including (some frameworks) I have seen in the past few years bear a common design that bugs me:

READ-ONLY SQL STATEMENTS DONT NECESSARILY BELONG IN STORED PROCEDURES, THATS WHAT FUNCTIONS ARE FOR.

If you dont know what a database function is, just think of it as a parametric view. Now let me back this up some facts, and a code example that I feel justifies this argument once and for all.
  1. There is NO performance difference between a user-defined function and a stored procedure. None at all. Since the introduction of UDFs in SQL Server 2000, functions and stored procedures leverage the same compiled/cached query optimizations. In sql 2005, both stored procedures and functions can be written in the CLR environment. Since a function is guaranteed to be read-only and deterministic, in some cases a function will have a more optimal query plan than the equivalent stored procedure.
  2. A function GUARANTEES there are no side effects to the data. Since UDFs are parametric views, they are by definition deterministic and read-only opterations, so when you call them you never have to worry about any surprise side effects to your data (stored procedures cannot guarantee this).
  3. Stored procedures are not as modular and flexible as functions. In many cases, you may need to continue ETL operations on the result set from a compiled query. Every wished you could select or join from a stored procedure? Consider this scenario: I need a northwind query to retrieve employees with a HiredDate between a given date range;
    Stored procedure:
    CREATE PROCEDURE sp_GetEmployeesHiredWithindateRange
    @startDate datetime,
    @endDate datetime
    AS
    BEGIN
    select * from employees where hireDate between @startDate and @endDate
    END
    Function equivalent:
    CREATE FUNCTION fn_GetEmployeesHiredWithindateRange
    (
    @startDate datetime,
    @endDate datetime
    )
    RETURNS TABLE
    AS
    RETURN
    (
    select * from employees where hireDate between @startDate and @endDate
    )
    I call them like this:
    select * from fn_GetEmployeesHiredWithindateRange('5/25/1990','5/25/2005')
    exec sp_GetEmployeesHiredWithindateRange '5/25/1990','5/25/2005'
    and they will return the same result set. Now, imagine I have a special case where I want to combine this query with whom each employee in the original results reports to.
    With a function, I just modify the query such as:
    select * from fn_GetEmployeesHiredWithindateRange('5/25/1990','5/25/2005') e1
    inner join employees e2 on e2.employeeid = e1.reportsto
    and I am done. If I chose the stored procedure route, there is no flexible way to do this, I would need to write another stored procedure just for this case. Because you can select from the result set of functions, you can do some GREAT code reuse design on the database side that would have been a nightmare in stored procedures.
  4. Stored procedures ARE NOT the only way to protect against sql injection. I see quite a few people defending their choice of stored procedures because the rigid syntax of a sproc call protects the system against query injections. There are easier ways! Let me show you an example of how to write a UDF call that has the same level of protection:
           string sqlQuery = "select * from fn_GetEmployeesHiredWithindateRange(@startDate,@enddate)";
    DataTable dataTable = new DataTable();
    SqlCommand sqlCommand = new SqlCommand(query, sqlConnection);
    sqlCommand.CommandType = CommandType.Text;
    sqlCommand.CommandTimeout = 300;
    sqlCommand.Parameters.AddWithValue("@startDate",new DateTime(1990,5,25));
    sqlCommand.Parameters.AddWithValue("@endDate",new DateTime(2005,5,25));
    SqlDataAdapter sqlDataAdapter = new SqlDataAdapter(sqlCommand);
    sqlDataAdapter.Fill(dataTable);
    return dataTable;
    I am writing ad-hoc sql inside of my C#, and parameterizing it out like a stored procedure. You know what that ends up looking like on the sql profiler?
    exec sp_executesql N'select * from fn_GetEmployeesHiredWithindateRange(@startDate,@endDate)',N'@startDate datetime, @endDate datetime',@startDate='5/25/1990',@endDate='5/25/2005'
    sp_executesql is a call from ADO.NET for running generated queries in a controlled stored procedure fashion. Subsequent calls of a similar structure enjoy the speed boost of a compiled query. This is the magic behind NHibernate's query-generation logic. In terms of SQL injection, this is every bit as safe as a stored procedure call, but with more flexibility.
To sum this all up:
  • Stored procedures are the right choice for compiled database functions with side effects (eg: inserts and updates)
  • Functions are the right choice for compiled parametric queries (eg selects)
  • When you think of functions, think of them as views with arguments.

Some other notes:
  • You cannot "order by" inside of a function, but you can in a stored procedure. You cannot order the result set from a stored procedure, but you can order the result set from a function.
  • Stored procedures have output parameters, functions don't.
  • Functions can return tables or single values, for example:
    Here is the calling syntax for a single-valued output function:
    select fn_getStartDateForEmployee(employeeID) as startDate from employees
    and here is the calling syntax for a table-valued output function:
    select * from fn_GetEmployeesHiredWithindateRange('1/1/2000','1/1/2005')

Labels:

Sunday, December 09, 2007

Linq to SQL vs NHibernate Part 1: What do they have in common?

Choosing a technology such as object persistence is one of the first steps in any major project, and it's a tough call to make. We spent some time at my company trying to figure out if Linq to sql was a better ORM than NHibernate. After some experiments, I came to the engineer's conclusion: It depends. As Linq gains popularity, people will be wondering the same questions, so I'm writing a few unbiased posts to sort out their differences (just as a warning in advance, my expertise is with NHibernate)

Crash Course:
Linq is the query syntax added to the C# language for 3.0. Trees, relational data, objects, xml etc can all be queried using the common Linq syntax reminiscent of SQL. It is easy and flexible, strongly typed, and compiled. Linq to SQL is a natural extension of Linq into an ORM, and it is touted as a "lightweight" data mapper, and heavily hyped by microsoft in previous beta versions. Linq to SQL is built specifically for sql server 2005 and above.

NHibernate is a mature open source project designed specifically to solve ORM problems. It is an extremely flexible and configurable ORM, and its been battle-proven for many enterprise projects. It is database agnostic, and supports a wide array of different database brands. Like many active open-source projects, it is undergoing constant evolution, which makes good documentation hard to find.

What do they have in common?

Mapping syntax
Any object/data mapping system is going to need object definitions and and a corresponding DDL. Both libraries are extremely flexible with their initial configuration, but there is a way to use both of them in a similar fashion.

Just like any good ORM, your objects are simply plain old objects that happen to be persistable as an afterthought. The object definition is any class file. The XML mapping gives the ORM library the links between objects and their tables. It defines the mappings between an objects properties and the columns in the database. It defines the relationship between objects (collections and encapsulation) and the corresponding data relations (many-to-many, many-to-one, one-to-one)

Both NHibernate and Linq accept these mapping files as arguments to their initialization.
Persistent Object Lifecycles
Scoping - Within a common scope, the loading of two instances of the same database row should yield two references to the same object. This scope in NHibernate is called a "Session", in Linq, it is called a "DataContext", and both guarantee reference equality between two instances of the same data under the same scope.

Version Management - Once you have loaded objects under a scope, you should be able to efficiently synchronize the database with the updated state of your objects. Linq's DataContext exposes a SubmitChanges() method for this very purpose, and NHibernate has a Flush() method.

Adjustable Fetching Schemes - Loading objects from the database is a bit of a catch-22, you don't want to load the entire database up in to an object graph, but you don't want a roundtrip to the database every time you need a new object. Both of the libraries support highly configurable lazy and eager fetching schemes. Both of them use lazy loading by default, and both use left-outer-join as a default behavior when eagerly fetching peripheral objects.

Concurrency Concerns - Enterprise data is volatile, and we need an ability to recognize and manage the scenarios when data is changed by external forces. By default behavior, NHibernate and Linq behave in an optimistic concurrency fashion, which basically loads rows without locking them, and throws exceptions if the objects you are saving have changed since you loaded them. Both libraries have multiple means of customizing concurrency behavior.

Custom Database Objects - There are some operations that are simply better off left to the database to perform, such as large scale "en-masse" updates and reporting. These operations are easily implemented as stored procedures or indexed views. Both libraries support the ability to interface with custom database objects. Linq has very strong integration with stored procedures and views, but it only works with sql server 2005 and above. NHibernate is database agnostic, but there code that references database objects is string-based, which makes the connection brittle in comparison.

Code Generators
Personally, I am not a fan of any code generator related to something as important as your DDL, but there seems to be a very big demand for code generation, and there are convenience tools for both ORMs.
Linq Visual Designer - Linq comes with a built-in Visual studio designer for Linq Objects. It looks just like the visual dataset designer, because it was built by the same guy who made the visual dataset designer. IMO, this designer is a great way to get you nowhere in 30 seconds. It is only useful for the most trivial of object graph complexities, it uses partial classes to separate the mapping code from your user code, and it is brittle code at best.
Linq SQLMetal - In terms of codegens, this is Linq's saving grace right here. Given a database connection, sqlmetal can generate clean code for the objects, mappings, or both, with an array of options for fine-tuning the output code.
MyGeneration - A free 3rd-party codegen that has DDL "templates" for both NHibernate and Linq (amnong many others). This is a great way to generate code if you have an existing database schema.
NHibernate SchemaExport - All of the codegens above deal with the conversion of an existing database schema into object and mapping code. SchemaExport goes in the other direction, building a database schema from the mappings. I spoke on SchemaExport in the past, I am a very big fan of this one.

Integration -
Since the two technologies are not necessarily in direct competition, the is currently a push to harness the power of the linq-style querying in to nHibernate. More about Linq for NHibernate can be found here, here and here.

This post covers some of the commonalities, and in the next few days, I'll be comparing some of the more important factors such as performance, flexibility, and usability.


UPDATE:
PerpetuumSoft is a 3rd-party company has filled the dire need for a database synchronization tool with Linq To Sql. Given your object model definitions in Lint to SQL, their Database Restyle application is a royalty-free component that gives you the essential ability to synchronize a schema from a changing object design, so you can design from a truly object-centric point of view.

Labels: , ,

Monday, December 03, 2007

Top three "DUH" features of SQL 2008...

I'm very excited to get my hands dirty with SQL 2008.

Don't get me wrong, there is much to love with this current 2005 version, but some of these features I hear coming down the pipeline have me wondering why we didn't do things like this in the first place.
  1. The MERGE command. Is this data a new row, or is it changes to an existing row? How many times do we write SaveOrUpdate(..) methods in our code with this "IF its new, insert it, otherwise update it" code? This line of thinking became a subconscious action, and we were used to writing this logic over and over. Merge allows you to synchronize two sets of data in an "en masse" approach. Suppose you have a dataset, xml, or some other input called "src" that is the data you want to merge into a table called "dest":
    MERGE dest USING src
    ON (src.key = dest.key)
    WHEN MATCHED THEN
    UPDATE [all dest values to update... regular update command here]
    WHEN NOT MATCHED THEN
    INSERT [a simple insert command from src to dest!]
    Cmon tell me that is not cool! You can wrap this in a transaction and a try..catch block to make it better. Duh! We needed this years ago!
  2. Heirachical Data. Tree-structured data is a fact of life. When you need it , you need it, but SQL server support for this scenario has been very shaky in the past, and everyone needs to get creative to store trees in the relational database. Some ways are compliacted, some ways are non-performant, and all of them expose some ugly compromise. Finally, sql 2008 supports clustered indexing in a hierarchical fashion (breadth-first or depth-first ordering). Consider this table:
    CREATE TABLE ParentChildOrg
    (
    EmployeeID int PRIMARY KEY,
    ManagerId int REFERENCES ParentChildOrg(EmployeeID),
    EmployeeName nvarchar(50)
    ) ;
    GO
    Using this definition, here is some sample code from the docs that runs an optimized insert:
    DECLARE @Manager hierarchyid
    SELECT @Manager = CAST('/3/1/' AS hierarchyid)
    INSERT HumanResources.EmployeeDemo (OrgNode, LoginID, Title, HireDate)
    VALUES
    (@Manager.GetDescendant(NULL, NULL),
    'adventure-works\FirstNewEmployee', 'Application Intern', '3/11/07') ;
    Granted, I'm still trying to learn the good and bad things that come with the syntax of the hierarchyId data type, this is much needed support for sql server. Is there a need for native support for trees? duh!
  3. Native file support for BLOBS. I have learned some serious hard lessons about this one in the past. You need to store blobs, if you store them directly in your database, then you are setting yourself up for a scalability nightmare scenario. If you are storing URLs to the blob data, then you need to manually synchronize your data with sql server. Finally, our 2008 version allows us to store blobs as an INTERNAL file structure to sql server, while exposing them as filestreams and not large unwieldy hunks of heavy data that wreak your I/O. We've all been treating sql server like a file server in the past, because we need it to serve files, not just chunks of data. DUH!
Its like a cell phone, or email... I mean, once you have it, you ponder how you ever got anything done without it, and those are the signs of Good ideas.

Labels:

Wednesday, November 28, 2007

Open source enterprise sample using ActiveRecord

On the 29th, Ill be working an ACM lecture at UNH on the life of a computer consultant.

Something that is important to me: code speaks louder than words. Its cool to hear new ideas, new philosophies, but coded examples of these ideas backs them up with a little reality. The problem is, its hard to give code examples without lots and lots of surrounding context to make the code real.

I put together a new open source project on google code. It is a "zen-garden" style project for enterprise-level data applications using Castle ActiveRecord under the hood. I kick-started this project by giving it a foundation, a domain, some controllers, and a web-based user interface. Now it has some critical mass. The application itself is not important so much as the ability to practice and learn new skills by playing with the code.

This is the link to the code project homepage. Be sure to view the project wiki for the nitty gritty details. This is the link to the subversion trunk. To download and run the application you will need:
Downloading the Project Code
If you don't want to join the open source project, you can simply download the app from the download tab. Make sure you have winrar or some rar extractor equivalent.

Joining and Contributing to the Project
  1. Contact me and I will add you to the project as a user, so you can reside under the subversion source control system. Dont be shy, I promise I will get you everything you need.
  2. You need to get a google account. If you have a gmail email address, you have a google account. If you don't, then go and register. (Google accounts are very worthwhile to any internet-junkie like myself)
  3. Select a folder where you want to download the code.
  4. Using tortise svn, right click on the selected folder, and select ToriseSVN--Checkout


use this url for the project when prompted:
https://activerecordenterprisesample.googlecode.com/svn/trunk/



You will be prompted for a username. Use your google account username.
You will be prompted for a password. Use the google code password I gave you.
With that, you will become an active member of the project.

Getting the code to run
  1. Look at the EntityLayer project. Modify the App.config file, be sure to set the sqldialect so it matches your database. Modify the connectionstring so it can reach your database.
  2. Open the DatabaseSetup.cs class. See the test that creates an empty database? Execute it using testdriven-net by right-clicking on the [Test] keyword. This will create your database.
  3. Once this works, modify all of the *.config files in the project in the same manner.
  4. Set the WebLayer as the startup project, by right clicking on the project in the solution explorer. Hit F5, and you should have the website up and running.

Labels: , , ,

Monday, October 22, 2007

SQL String Filters

I had this recent chore with reporting that had me thinking about cool ways to filter substring patterns from text.

The following is a common snippet of udf code that removes, or filters out non-alphanumeric characters from a sql field:

CREATE FUNCTION dbo.FilterNonAlphaNumericCharacters
(
@input VARCHAR(8000)
)
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE @i SMALLINT
SET @i = PATINDEX('%[^0-9A-Za-z ]%', @input)
WHILE @i > 0
BEGIN
SET @input = STUFF(@input, @i, 1, '')
SET @i = PATINDEX('%[^0-9A-Za-z]%', @input)
END
RETURN @input
END
GO
It would be executed like:
select dbo.FilterNonAlphaNumericCharacters(PubSectionName) from pubsections
Here's another one: Sometimes a sql value will have more than one space between words. Heres a udf for removing multiple spaces (again, very common across the internet)
CREATE FUNCTION dbo.FilterMultipleSpaces
(
@input varchar(8000)
)
RETURNS varchar(8000)
AS
BEGIN
WHILE CHARINDEX(' ', @input) > 0
BEGIN
SET @input = REPLACE(@input, ' ', ' ')
END
RETURN @input
END
GO
It would be executed like:
select dbo.FilterMultipleSpaces(PubSectionName) from pubsections
What are the common elements from these functions? We have a target pattern that we are trying to filter out of a string. Instead writing specialized filter after filter, I decided to write something a little more universal.

CREATE FUNCTION dbo.MyFilter
(
@input VARCHAR(8000),
@filterPattern VARCHAR(100),
@replacementVal varchar(100)
)
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE @i SMALLINT
declare @paddingval int

if( len(@replacementVal) = 0)
set @paddingVal = 1
else
set @paddingVal = len(@replacementVal)

SET @i = PATINDEX(@filterPattern, @input)
WHILE @i > 0
BEGIN
SET @input = STUFF(@input, @i, @paddingVal, @replacementVal)
SET @i = PATINDEX(@filterPattern, @input)
END
RETURN @input
END
GO

Now, to filter out non-alphanumeric characters, the following two statements are equivalent

select dbo.MyFilter(PubSectionName,'%[^0-9A-Za-z ]%','') from pubsections
select dbo.FilterNonAlphaNumericCharacters(PubSectionName) from pubsections

and these two statements are also equivalent for removing multiple spaces

select dbo.MyFilter(PubSectionName,'% %',' ') from pubsections
select dbo.FilterMultipleSpaces(PubSectionName) from pubsections

I think its cool, because it will replace any pattern with a specified string such as

select PubSectionName from pubsections
select dbo.MyFilter2(PubSectionName,'% %','-=-') from pubsections

Still the code above is getting more complex and inefficient I have hesitations about extending this functionality to something too generic.

Labels: