SQL tips: de-duping a table
Duplicate rows are one of those headaches every data engineer runs into. They creep in through messy source files, flawed joins, or upstream jobs that didn’t quite behave. Once inside, they can throw off reports, inflate metrics, and generally waste compute.
The tricky part isn’t just removing duplicates, but making sure you’re keeping the right record.
In this post, we’ll walk through a clean, repeatable way to remove duplicates using SQL window functions.
Setting up the example
The idea is simple. We’ll start by identifying the duplicates with ROW_NUMBER(). Then, we keep the first record per group based on our business key, and safely delete the rest, leaving a tidy dataset.
Let’s create a sample table with some duplicates to play with:
CREATE TABLE MBDupes
(
EntityId NVARCHAR(128) NOT NULL,
EntitySite NVARCHAR(128) NOT NULL,
EntityType NVARCHAR(32) NOT NULL,
ObjectId INT NOT NULL
);
GO
-- Insert data including duplicates
INSERT INTO MBDupes(EntityId, EntitySite, EntityType, ObjectId) VALUES
(N'10021', N'Investec', N'JOB', 696969),
(N'10411', N'Investec', N'JOB', 254815),
(N'10026', N'Lovells', N'PROPERTY', 676869),
(N'10215605', N'MawsonCollins', N'PROPERTY', 102461),
(N'10539', N'Source', N'JOB', 253044),
(N'1085', N'Living Room', N'PROPERTY', 221915),
(N'10-Labourers', N'RavenscroftConstruction', N'JOB', 253328),
-- Our duplicate rows
(N'10215605', N'MawsonCollins', N'PROPERTY', 1269),
(N'10539', N'Source', N'JOB', 6987),
(N'10539', N'Source', N'JOB', 6969);
Here’s the data setup (duplicates made bold):
| EntityId | EntitySite | EntityType | ObjectId |
|---|---|---|---|
| 10021 | Investec | JOB | 696969 |
| 10411 | Investec | JOB | 254815 |
| 1085 | Living Room | PROPERTY | 221915 |
| 10026 | Lovells | PROPERTY | 676869 |
| 10215605 | MawsonCollins | PROPERTY | 1269 |
| 10215605 | MawsonCollins | PROPERTY | 102461 |
| 10-Labourers | RavenscroftConstruction | JOB | 253328 |
| 10539 | Source | JOB | 6969 |
| 10539 | Source | JOB | 6987 |
| 10539 | Source | JOB | 253044 |
The approach: tag, then delete
We use the ROW_NUMBER() window function to assign a sequential number to each row within the same business key (in this case, EntitySite and EntityId). Duplicates get a number greater than 1.
SELECT
EntitySite,
EntityId,
EntityType,
ObjectId,
ROW_NUMBER() OVER (PARTITION BY EntitySite, EntityId ORDER BY ObjectId) AS row_num
FROM dbo.MBDupes;
Sample output:
| EntitySite | EntityId | EntityType | ObjectId | row_num |
|---|---|---|---|---|
| Investec | 10021 | JOB | 696969 | 1 |
| Investec | 10411 | JOB | 254815 | 1 |
| Living Room | 1085 | PROPERTY | 221915 | 1 |
| Lovells | 10026 | PROPERTY | 676869 | 1 |
| MawsonCollins | 10215605 | PROPERTY | 1269 | 1 |
| MawsonCollins | 10215605 | PROPERTY | 102461 | 2 |
| RavenscroftConstruction | 10-Labourers | JOB | 253328 | 1 |
| Source | 10539 | JOB | 6969 | 1 |
| Source | 10539 | JOB | 6987 | 2 |
| Source | 10539 | JOB | 253044 | 3 |
Deleting the duplicates
Using a common table expression (CTE), we delete all rows with row_num > 1, keeping only the first occurrence of each duplicate set, prioritising by ObjectId descending (keeping the latest):
WITH cte AS (
SELECT
EntitySite,
EntityId,
ROW_NUMBER() OVER (PARTITION BY EntitySite, EntityId ORDER BY ObjectId DESC) AS row_num
FROM dbo.MBDupes
)
DELETE FROM cte WHERE row_num > 1;
Result:
Only unique rows remain, with duplicates removed.
Final table (duplicates removed)
| EntityId | EntitySite | EntityType | ObjectId |
|---|---|---|---|
| 10021 | Investec | JOB | 696969 |
| 10411 | Investec | JOB | 254815 |
| 10026 | Lovells | PROPERTY | 676869 |
| 10215605 | MawsonCollins | PROPERTY | 102461 |
| 10539 | Source | JOB | 253044 |
| 1085 | Living Room | PROPERTY | 221915 |
| 10-Labourers | RavenscroftConstruction | JOB | 253328 |
Summary
This pattern is simple, reliable, and easy to adapt whether you’re cleaning a staging table or tightening controls in production pipelines. Keep it in your toolkit, it’s one of those small fixes that saves big headaches later.