Row_Number() Part One – Pete on Software

Sometimes when you are performing queries, you have data that does not have a primary key and contains duplicate records. If it had had a primary key, you could group on the columns that constitute a duplicate and just keep the MIN or MAX of the primary key, deleting the others. However, lacking that primary key, you have no way of uniquely identifying a row. Additionally, UNIQUEIDENTIFIER types don’t work with MAX and MIN, so you’d have a hard time picking a row in that case as well.

One option is of course to add an identity column to the table and do the operation that I described. However, sometimes you don’t have the appropriate permissions or just aren’t allowed to modify the table to do that. You can use the ROW_NUMBER() function to assign a unique row number to each row by partition. I’ll talk a little more about partition later, but when not explicitly specified in the query, the partition refers to the entire table.

Examine the script below. (All Sql Code contained here will only work on Sql Server 2005 and later)

USE tempdb
GO

CREATE TABLE HasDupes (ProductName VARCHAR(50), UpcCode VARCHAR(10))
GO

INSERT HasDupes (ProductName, UpcCode)
SELECT 'Diapers', '0123456780'
UNION ALL
SELECT 'Diapers', '0123456780'
UNION ALL
SELECT 'Diapers', '0123456780'
UNION ALL
SELECT 'Diapers', '0123456780'
UNION ALL
SELECT 'Formula', '0987654320'
UNION ALL
SELECT 'Formula', '0987654320'
UNION ALL
SELECT 'Formula', '0987654320'
UNION ALL
SELECT 'Diet Coke', '0564534230'
UNION ALL
SELECT 'Chair', '0872372530'
GO

SELECT *
FROM HasDupes
GO

DROP TABLE HasDupes
GO

You get the results shown here:
Initial duplicates.

If we want to add a rownumber to the table, we would alter the part of the query that is

SELECT *
FROM HasDupes
GO

to say this

SELECT ProductName, UpcCode, ROW_NUMBER() OVER (ORDER BY ProductName ASC) AS RowNum
FROM HasDupes
GO

and get these results
Initial row number.

That helps some, but wouldn’t it be better if we could get a row number assigned within groups? That is what the PARTITION BY clause does. Change the SELECT sql again so that it now looks like this (note: you can PARTITION BY and ORDER BY any columns you want, and they can be the same or different from each other)

SELECT ProductName, UpcCode, ROW_NUMBER() OVER (PARTITION BY UpcCode ORDER BY ProductName ASC) AS RowNum
FROM HasDupes
GO

Now, we get this
Row Number with Partition

Okay, now all we have to do is keep the records that get assigned a one, forsaking all others. Change that same section of code again to read like this and then run the script. Note, I am using the CTE syntax here.

WITH ProductDupes
AS
(
	SELECT ProductName, UpcCode, ROW_NUMBER() OVER (PARTITION BY UpcCode ORDER BY ProductName ASC) AS RowNum
	FROM HasDupes
)

DELETE FROM ProductDupes WHERE RowNum > 1
GO

SELECT *
FROM HasDupes
GO

Now we have a table with no duplicates.
No more duplicates

That’s it, that is all there is to it. In my next blog, I will cover how we can use this trick in joins to ensure that we only get a 1 to 1 join, even when the data is 1 to Many.

Pete on Software

Related posts:

Pete

Leave a Reply