SQL tips: de-duping a table

Duplicate rows are one of those headaches every data engineer runs into. They creep in through messy source files, flawed joins, or upstream jobs that didn’t quite behave. Once inside, they can throw off reports, inflate metrics, and generally waste compute.

The tricky part isn’t just removing duplicates, but making sure you’re keeping the right record.

In this post, we’ll walk through a clean, repeatable way to remove duplicates using SQL window functions.

Setting up the example

The idea is simple. We’ll start by identifying the duplicates with ROW_NUMBER(). Then, we keep the first record per group based on our business key, and safely delete the rest, leaving a tidy dataset.

Let’s create a sample table with some duplicates to play with:

CREATE TABLE MBDupes
(
EntityId NVARCHAR(128) NOT NULL,
EntitySite NVARCHAR(128) NOT NULL,
EntityType NVARCHAR(32) NOT NULL,
ObjectId INT NOT NULL
);
GO

-- Insert data including duplicates
INSERT INTO MBDupes(EntityId, EntitySite, EntityType, ObjectId) VALUES
(N'10021', N'Investec', N'JOB', 696969),
(N'10411', N'Investec', N'JOB', 254815),
(N'10026', N'Lovells', N'PROPERTY', 676869),
(N'10215605', N'MawsonCollins', N'PROPERTY', 102461),
(N'10539', N'Source', N'JOB', 253044),
(N'1085', N'Living Room', N'PROPERTY', 221915),
(N'10-Labourers', N'RavenscroftConstruction', N'JOB', 253328),

-- Our duplicate rows
(N'10215605', N'MawsonCollins', N'PROPERTY', 1269),
(N'10539', N'Source', N'JOB', 6987),
(N'10539', N'Source', N'JOB', 6969);

Here’s the data setup (duplicates made bold):

EntityId EntitySite EntityType ObjectId
10021 Investec JOB 696969
10411 Investec JOB 254815
1085 Living Room PROPERTY 221915
10026 Lovells PROPERTY 676869
10215605 MawsonCollins PROPERTY 1269
10215605 MawsonCollins PROPERTY 102461
10-Labourers RavenscroftConstruction JOB 253328
10539 Source JOB 6969
10539 Source JOB 6987
10539 Source JOB 253044

The approach: tag, then delete

We use the ROW_NUMBER() window function to assign a sequential number to each row within the same business key (in this case, EntitySite and EntityId). Duplicates get a number greater than 1.

SELECT
EntitySite,
EntityId,
EntityType,
ObjectId,
ROW_NUMBER() OVER (PARTITION BY EntitySite, EntityId ORDER BY ObjectId) AS row_num
FROM dbo.MBDupes;

Sample output:

EntitySite EntityId EntityType ObjectId row_num
Investec 10021 JOB 696969 1
Investec 10411 JOB 254815 1
Living Room 1085 PROPERTY 221915 1
Lovells 10026 PROPERTY 676869 1
MawsonCollins 10215605 PROPERTY 1269 1
MawsonCollins 10215605 PROPERTY 102461 2
RavenscroftConstruction 10-Labourers JOB 253328 1
Source 10539 JOB 6969 1
Source 10539 JOB 6987 2
Source 10539 JOB 253044 3

Deleting the duplicates

Using a common table expression (CTE), we delete all rows with row_num > 1, keeping only the first occurrence of each duplicate set, prioritising by ObjectId descending (keeping the latest):

WITH cte AS (
SELECT
EntitySite,
EntityId,
ROW_NUMBER() OVER (PARTITION BY EntitySite, EntityId ORDER BY ObjectId DESC) AS row_num
FROM dbo.MBDupes
)
DELETE FROM cte WHERE row_num > 1;

Result:

Only unique rows remain, with duplicates removed.

Final table (duplicates removed)

EntityId EntitySite EntityType ObjectId
10021 Investec JOB 696969
10411 Investec JOB 254815
10026 Lovells PROPERTY 676869
10215605 MawsonCollins PROPERTY 102461
10539 Source JOB 253044
1085 Living Room PROPERTY 221915
10-Labourers RavenscroftConstruction JOB 253328

Summary

This pattern is simple, reliable, and easy to adapt whether you’re cleaning a staging table or tightening controls in production pipelines. Keep it in your toolkit, it’s one of those small fixes that saves big headaches later.