Pete On Software

RSS Feed

Archive for January, 2010

Row_Number() Part One

Sometimes when you are performing queries, you have data that does not have a primary key and contains duplicate records. If it had had a primary key, you could group on the columns that constitute a duplicate and just keep the MIN or MAX of the primary key, deleting the others. However, lacking that primary key, you have no way of uniquely identifying a row. Additionally, UNIQUEIDENTIFIER types don’t work with MAX and MIN, so you’d have a hard time picking a row in that case as well.

One option is of course to add an identity column to the table and do the operation that I described. However, sometimes you don’t have the appropriate permissions or just aren’t allowed to modify the table to do that. You can use the ROW_NUMBER() function to assign a unique row number to each row by partition. I’ll talk a little more about partition later, but when not explicitly specified in the query, the partition refers to the entire table.

Examine the script below. (All Sql Code contained here will only work on Sql Server 2005 and later)

You get the results shown here:
Initial duplicates.

If we want to add a rownumber to the table, we would alter the part of the query that is

to say this

and get these results
Initial row number.

That helps some, but wouldn’t it be better if we could get a row number assigned within groups? That is what the PARTITION BY clause does. Change the SELECT sql again so that it now looks like this (note: you can PARTITION BY and ORDER BY any columns you want, and they can be the same or different from each other)

Now, we get this
Row Number with Partition

Okay, now all we have to do is keep the records that get assigned a one, forsaking all others. Change that same section of code again to read like this and then run the script. Note, I am using the CTE syntax here.

Now we have a table with no duplicates.
No more duplicates

That’s it, that is all there is to it. In my next blog, I will cover how we can use this trick in joins to ensure that we only get a 1 to 1 join, even when the data is 1 to Many.

Common Table Expressions

CTE (Common Table Expression) syntax can be a little tricky at first, but once you have a little primer, you will love using them as much as I do. Examine the following code below that uses a derived table from which to select. (Please realize that this is a contrived example and you could of course just join the tables in the query directly.)

The query between the parentheses creates a table structure (that I aliased as “c” and “t” for my two) that only lasts for the life of the query. I can then reference those tables in the query in any way that I would a regular table. However, that syntax can make it very difficult to read, especially in long queries with several complicated derived tables.

Sql Server 2005 introduced a new syntax to be used instead called Common Table Expressions (CTEs). Using a CTE, the above query would be written like this

Alternatively, you can also explictly declare you column names in the CTE definition and not have to alias the columns in the query itself.

One “gotcha” is that the CTE – beginning with the WITH keyword – must be the first statement that is encountered. All previous statements above it must be terminated with a GO or a semicolon. If you ran the following

You’d get the errors shown below (or something very close)

Some people even just make a habit of actually prefixing the WITH with a semicolon every time to make sure that everything goes off without a hitch. If I do run this script instead there are no issues.

This is just a quick primer on CTEs, but I hope that you can find it useful. I really find that it cleans my code up quite a bit.