Using SQL to find missing numbers
Just recently I was given this really cool challenge where I had a set of numbers within a range and I needed to use SQL to find the missing numbers within the range such that the set was complete.
Because of its simple and declarative yet powerful syntax, Sql is my favorite language to work with. At a glance, I thought this problem would be easy:
Now what fills in the blanks? We cant select out of nothing, can we? For this to work, I would need to select from an already complete set to find the missing values in the original set. OK then, how do I declare the complete set?
Nope. I cannot use the identity function in the manner I would prefer. I get this error:
Well that sucks. What do I do now? Make temp tables to "select into"?
No fun. I wont bother exploring the use of cursors, iteration, or any TSQL approach yet. There's got to be a better way right? Consider this, you could select n+1 from the existing set such that n is in the set but n+1 is not.
Its a mathematically inductive-flavored way of finding the gaps in the set. Cool, but there's still a few problems: It only returns the first number (n+1) in a sequenced gap. It fails to find any number less than the min in the existing set (not n-1). Since this finds the lowest missing number in a sequence gap, the following will find the highest number in a sequence gap:
We could infuse these two methods together to find all of the gaps in a number sequence. I would look something like the following:
This would return a result set like:
This is better, but it still works only within the min and max of the number sequence and not the full range. It couldnt find missing numbers such as {1} or {9,10}. At this point I was in a rut when I found this solution online:
http://vadivel.blogspot.com/2007/04/find-missing-numbers-gaps-within-table.html
This is very elaborate, but once I started to test it, I found a fatal flaw, its recursive nature. For example, if I select a very large range of numbers I get the error:
Unfortunately, thats just not acceptable, and I had to continue searching. The solution I finally ended up using was a bit of a cop-out: I leveraged a CLR routine to do my dirty work.
To call this method, we can now simply write:
and we are done.
I'm not 100% happy with this solution, because I failed to find a simple way to write my original query using ANSI sql. This solution requires sql 2005 or later. I dont have any speed benchmarks to compare it with, but increasing the range size in orders of magnitude didn't seem to slow down the response by a visible amount, so at least it scales well. The interesting part is: the sql language forces you to work with solid sets, not ranges, and this problem exposed a cool weakness in the sql language.
select * from ... where not exists ... in [Existing Set] and [Its within the range]
Now what fills in the blanks? We cant select out of nothing, can we? For this to work, I would need to select from an already complete set to find the missing values in the original set. OK then, how do I declare the complete set?
select * from IDENTITY(smallint, 1, 1) as foo
where not exists x in [Existing Set] and [Its within the range]
Nope. I cannot use the identity function in the manner I would prefer. I get this error:
The IDENTITY function can only be used when the SELECT statement has an INTO clause.
Well that sucks. What do I do now? Make temp tables to "select into"?
declare @b table( g int);
select top 100 identity(int,1,1) into @b
No fun. I wont bother exploring the use of cursors, iteration, or any TSQL approach yet. There's got to be a better way right? Consider this, you could select n+1 from the existing set such that n is in the set but n+1 is not.
Select s.n + 1 from [Set] s where not exists
(
Select 1 from [Set] s2 where s2.n = s.n + 1
)
and s.n < @upperBound
Its a mathematically inductive-flavored way of finding the gaps in the set. Cool, but there's still a few problems: It only returns the first number (n+1) in a sequenced gap. It fails to find any number less than the min in the existing set (not n-1). Since this finds the lowest missing number in a sequence gap, the following will find the highest number in a sequence gap:
Select s.n - 1 from [Set] s where not exists
(
Select 1 from [Set] s2 where s2.n = s.n - 1
)
and s.n > @lowerBound
We could infuse these two methods together to find all of the gaps in a number sequence. I would look something like the following:
select low.n + 1 as start, min(high.n) - 1 as stop
from [Set] as low
left outer join [Set] as r on low.n = r.n - 1
left outer join [Set] as high on low.n
where r.n is null and high.n is not null
group by low.n, r.n
having low.n >@lowerBound and r.n < @higherBound
This would return a result set like:
start | stop
-----------
3 | 3
5 | 5
8 | 9
This is better, but it still works only within the min and max of the number sequence and not the full range. It couldnt find missing numbers such as {1} or {9,10}. At this point I was in a rut when I found this solution online:
http://vadivel.blogspot.com/2007/04/find-missing-numbers-gaps-within-table.html
With tempData (result) as
(
Select distinct FG.n + 1 from [Set] FG where not exists
(
Select 1 from [Set] FGP where FGP.n = FG.n + 1
) and FG.n < @upperBound
Union All
Select TD.result + 1 from tempData TD where not exists
(
Select 1 from [Set] FGP where FGP.n = TD.result + 1
) and TD.result < @upperBound
)
Select result as 'Missing Numbers' from tempData order by result;
This is very elaborate, but once I started to test it, I found a fatal flaw, its recursive nature. For example, if I select a very large range of numbers I get the error:
The statement terminated. The maximum recursion 100 has been exhausted before statement completion.
Unfortunately, thats just not acceptable, and I had to continue searching. The solution I finally ended up using was a bit of a cop-out: I leveraged a CLR routine to do my dirty work.
class MyInt : IEnumerator
{
public int counter = -1;
public int max_count;
public MyInt(int start, int end)
{
max_count = end;
counter = start - 1;
}
#region IEnumerator Members
object IEnumerator.Current
{
get { return this; }
}
bool IEnumerator.MoveNext()
{
counter++;
return (counter <= max_count);
}
void IEnumerator.Reset()
{
counter = -1;
}
#endregion
}
...
[SqlFunction(
IsDeterministic = true,
DataAccess = DataAccessKind.None,
IsPrecise = true,
FillRowMethodName = "Test_Ints_FillRow")]
public static IEnumerator Test_Ints(int start, int end)
{
return new MyInt(start, end);
}
public static void Test_Ints_FillRow(Object obj, out int myInt)
{
MyInt mi = obj as MyInt;
myInt = mi.counter;
}
...
CREATE FUNCTION TestInts( @start int, @end int ) RETURNS TABLE( MyInt int ) EXTERNAL NAME Assembly.CLRMethods.Test_Ints
....
To call this method, we can now simply write:
select x.myInt from testints(@lowerBound, @upperBound) as x where x.myInt not in [Set]
and we are done.
I'm not 100% happy with this solution, because I failed to find a simple way to write my original query using ANSI sql. This solution requires sql 2005 or later. I dont have any speed benchmarks to compare it with, but increasing the range size in orders of magnitude didn't seem to slow down the response by a visible amount, so at least it scales well. The interesting part is: the sql language forces you to work with solid sets, not ranges, and this problem exposed a cool weakness in the sql language.

