NEWS

Wednesday, March 24, 2010

New Techniques for Removing Duplicate Rows in SQL Server 2005

SQL Server 2005 has introduced the row_number() function, which provides an alternative means of identifying duplicates. Rewriting the first example, for tables with no primary key, we can now assign a row number to each row in a duplicate group, with a command such as:



DECLARE  @duplicateTable4 TABLE (data VARCHAR(20))


INSERT INTO @duplicateTable4 VALUES ('not duplicate row')


INSERT INTO @duplicateTable4 VALUES ('duplicate row')


INSERT INTO @duplicateTable4 VALUES ('duplicate row')


INSERT INTO @duplicateTable4 VALUES ('second duplicate row')


INSERT INTO @duplicateTable4 VALUES ('second duplicate row')


 


SELECT  data


      , row_number() OVER ( PARTITION BY data ORDER BY data ) AS nr


FROM    @duplicateTable4


The result will show:



data                 nr


-------------------- --------------------


duplicate row        1


duplicate row        2


not duplicate row    1


second duplicate row 1


second duplicate row 2


In the above example, we specify an ordering and partitioning for the row_number() function. Note that the row_number() is a ranking window function, therefore the ORDER BY and the PARTITION BY in the OVER clause are used only to determine the value for the nr column, and they do not affect the row order of the query. Also, while the above is similar to our previous GROUP BY clause, there is a big difference concerning the returned rows. With GROUP BY you must use an aggregate on the columns that are not listed after the GROUP BY. With the OVER clause there is no such restriction, and you can get access to the individual rows in the groups specified by the PARTITION BY clause. This gives us access to the individual duplicate rows, so we can get not only the number of occurrences, but also a sequence number for the individual duplicates. To filter out the duplicate rows only, we could just put the above query into a CTE or a subquery. The CTE approach is as follows:



DECLARE  @duplicateTable4 TABLE (data VARCHAR(20))


INSERT INTO @duplicateTable4 VALUES ('not duplicate row')


INSERT INTO @duplicateTable4 VALUES ('duplicate row')


INSERT INTO @duplicateTable4 VALUES ('duplicate row')


INSERT INTO @duplicateTable4 VALUES ('second duplicate row')


INSERT INTO @duplicateTable4 VALUES ('second duplicate row')


;


WITH    numbered


          AS ( SELECT   data


                      , row_number() OVER ( PARTITION BY data ORDER BY data ) AS nr


               FROM     @duplicateTable4


             )


    SELECT  data


    FROM    numbered


    WHERE   nr > 1


This is not really any different from what we could do on SQL Server 2000.  However, here comes an absolutely amazing feature in SQL Server 2005 and later: We can refer to, and identify, a duplicate row based on the row_number() column and then, with the above CTE expression, we can use a DELETE statement instead of a SELECT, and directly remove the duplicate entries from our table.


We can demonstrate this technique with the following example:



DECLARE  @duplicateTable4 TABLE (data VARCHAR(20))


INSERT INTO @duplicateTable4 VALUES ('not duplicate row')


INSERT INTO @duplicateTable4 VALUES ('duplicate row')


INSERT INTO @duplicateTable4 VALUES ('duplicate row')


INSERT INTO @duplicateTable4 VALUES ('second duplicate row')


INSERT INTO @duplicateTable4 VALUES ('second duplicate row')


;


WITH    numbered


          AS ( SELECT   data


                      , row_number() OVER ( PARTITION BY data ORDER BY data ) AS nr


               FROM     @duplicateTable4


             )


    DELETE  FROM numbered


    WHERE   nr > 1


This solution will even work with large objects, if you stick to the new large object types introduced in SQL Server 2005: i.e. use VARCHAR(MAX) instead of TEXT, NVARCHAR(MAX) instead of NTEXT, and VARBINARY(MAX) instead of IMAGE. These new types are comparable to the deprecated TEXT, NTEXT and IMAGE, and they have the advantage that you will be able to use them with both DISTINCT and row_number().


 I find this last solution, using CTE, ROW_NUMBER() and DELETE, fascinating. Partly because now we can identify rows in a table when there is no other alternative way of doing it, and partly because it is a solution to a problem that should not, in theory, exist at all since production tables will have a unique constraint or a primary key to prevent duplicates getting into the table in the first place.

No comments:

Post a Comment