SQL tips: de-duping a table

Duplicate rows are one of those headaches every data engineer runs into. They creep in through messy source files, flawed joins, or upstream jobs that didn’t quite behave. Once inside, they can throw off reports, inflate metrics, and generally waste compute.

The tricky part isn’t just removing duplicates, but making sure you’re keeping the right record.

In this post, we’ll walk through a clean, repeatable way to remove duplicates using SQL window functions.

Setting up the example

The idea is simple. We’ll start by identifying the duplicates with ROW_NUMBER(). Then, we keep the first record per group based on our business key, and safely delete the rest, leaving a tidy dataset.

Let’s create a sample table with some duplicates to play with:

CREATE TABLE MBDupes
(
EntityId NVARCHAR(128) NOT NULL,
EntitySite NVARCHAR(128) NOT NULL,
EntityType NVARCHAR(32) NOT NULL,
ObjectId INT NOT NULL
);
GO

-- Insert data including duplicates
INSERT INTO MBDupes(EntityId, EntitySite, EntityType, ObjectId) VALUES
(N'10021', N'Investec', N'JOB', 696969),
(N'10411', N'Investec', N'JOB', 254815),
(N'10026', N'Lovells', N'PROPERTY', 676869),
(N'10215605', N'MawsonCollins', N'PROPERTY', 102461),
(N'10539', N'Source', N'JOB', 253044),
(N'1085', N'Living Room', N'PROPERTY', 221915),
(N'10-Labourers', N'RavenscroftConstruction', N'JOB', 253328),

-- Our duplicate rows
(N'10215605', N'MawsonCollins', N'PROPERTY', 1269),
(N'10539', N'Source', N'JOB', 6987),
(N'10539', N'Source', N'JOB', 6969);

Here’s the data setup (duplicates made bold):

EntityId	EntitySite	EntityType	ObjectId
10021	Investec	JOB	696969
10411	Investec	JOB	254815
1085	Living Room	PROPERTY	221915
10026	Lovells	PROPERTY	676869
10215605	MawsonCollins	PROPERTY	1269
10215605	MawsonCollins	PROPERTY	102461
10-Labourers	RavenscroftConstruction	JOB	253328
10539	Source	JOB	6969
10539	Source	JOB	6987
10539	Source	JOB	253044

The approach: tag, then delete

We use the ROW_NUMBER() window function to assign a sequential number to each row within the same business key (in this case, EntitySite and EntityId). Duplicates get a number greater than 1.

SELECT
EntitySite,
EntityId,
EntityType,
ObjectId,
ROW_NUMBER() OVER (PARTITION BY EntitySite, EntityId ORDER BY ObjectId) AS row_num
FROM dbo.MBDupes;

Sample output:

EntitySite	EntityId	EntityType	ObjectId	row_num
Investec	10021	JOB	696969	1
Investec	10411	JOB	254815	1
Living Room	1085	PROPERTY	221915	1
Lovells	10026	PROPERTY	676869	1
MawsonCollins	10215605	PROPERTY	1269	1
MawsonCollins	10215605	PROPERTY	102461	2
RavenscroftConstruction	10-Labourers	JOB	253328	1
Source	10539	JOB	6969	1
Source	10539	JOB	6987	2
Source	10539	JOB	253044	3

Deleting the duplicates

Using a common table expression (CTE), we delete all rows with row_num > 1, keeping only the first occurrence of each duplicate set, prioritising by ObjectId descending (keeping the latest):

WITH cte AS (
SELECT
EntitySite,
EntityId,
ROW_NUMBER() OVER (PARTITION BY EntitySite, EntityId ORDER BY ObjectId DESC) AS row_num
FROM dbo.MBDupes
)
DELETE FROM cte WHERE row_num > 1;

Result:

Only unique rows remain, with duplicates removed.

Final table (duplicates removed)

EntityId	EntitySite	EntityType	ObjectId
10021	Investec	JOB	696969
10411	Investec	JOB	254815
10026	Lovells	PROPERTY	676869
10215605	MawsonCollins	PROPERTY	102461
10539	Source	JOB	253044
1085	Living Room	PROPERTY	221915
10-Labourers	RavenscroftConstruction	JOB	253328

Summary

This pattern is simple, reliable, and easy to adapt whether you’re cleaning a staging table or tightening controls in production pipelines. Keep it in your toolkit, it’s one of those small fixes that saves big headaches later.