Efficient SQL Query to Remove Duplicates with ROW_NUMBER

abhishek
Mar 01
766
0
1

Blog

Using ROW_NUMBER() and PARTITION BY (Preferred Approach)

The ROW_NUMBER() function assigns a unique row number to each record within a partition (group). We can use this to identify and delete duplicates while keeping only the required data.

Query Syntax

WITH CTE AS (
    SELECT 
        *, 
        ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id) AS row_num
    FROM table_name
    WHERE condition  -- Apply filtering condition here
) 
DELETE FROM CTE 
WHERE row_num > 1;

Example

Consider a Customer table with duplicate entries based on Email.

ID	Name	Email	City
1	John	[email protected]	NY
2	Jane	[email protected]	LA
3	John	[email protected]	NY
4	Sam	[email protected]	TX

Removing Duplicates While Keeping the First Entry.

;WITH CTE AS (
    SELECT ID 
    FROM (
        SELECT ID, 
               ROW_NUMBER() OVER (PARTITION BY NAME, Email, City ORDER BY ID) AS RN  
        FROM Customers 
        WHERE City = 'NY'  -- Only NY state filtering condition
    ) AS sub
    WHERE RN > 1
)
DELETE FROM Customers 
WHERE ID IN (SELECT ID FROM CTE);

Explanation of the Query

Identifies Duplicates
- ROW_NUMBER() OVER (PARTITION BY Name, Email, City ORDER BY ID)
- Assign a row number (RN) for each duplicate group, keeping the first record (RN = 1).
Filters Out Duplicates (RN > 1): Only marks duplicate records where City = 'NY'.
Deletes Duplicate Records: Deletes only IDs from the CTE that have RN > 1
This syntax will be useful when we are joining more tables and deleting duplicates from one specific table.

Please comment on your ideas, and follow for more.