How to Deduplicate Data from Tables in SQL Server ✠Wiki Web Pedia✠

Remove Duplicate Records from the Database Using Queries

Importing records sometimes fails, and the database administrator or programmer finds hundreds (sometimes thousands) of records that are duplicates in the system. The mistake degrades data integrity, and these records should be removed from the table. An administrator can dedupe data in a table using the identifier column and a temporary table. A temporary table holds the records the administrator plans to delete. It’s a way to verify before running the delete query. These steps help implement a quick and dirty way to rid tables from duplicate records.

Transfer Dedupe Data to a Temporary Table

To test the query before deleting duplicate records, send the recordset to a temporary table. Temporary tables can be created in SQL Server on-the-fly. The hash (“#”) tag indicates a temporary table when it’s used. The database programmer can create and remove these tables using console queries directly from Microsoft Management Studio or in a stored procedure. Before the dedupe data process, the administrator needs to decide what field sets the trigger, indicating a duplicate record. Suppose the table is a list of customers. Administrators can use a phone number, email or social security number to indicate a duplicate record. The trigger that sets the criteria is dependent on the business and data contained in the table. The following query copies records with the same phone number to a temporary table named “#myTemp.”

select phone, count(*) as myDupes, (select top 1 customerId from customer c2 where c2.phone=c1.phone)
into #myTempTable from customer c1
group by phone
having count(*) > 1

This query is separated in four parts. The first part is the select section. The first part retrieves the phone number. The second takes a count of how many times the same phone number occurs in the table. The third part is a “sub-select.” This select statement takes only one of each duplicate record that is found and places the unique customer ID into the temporary table. This element is the key to delete the duplicate records.

Deleting the Duplicate Database Records

Now that the temporary table is created, the administrator can run a secondary query to dedupe the data from the original table. The key used to link the customer table with the temporary table is the customerID copied to the temporary table. Since the first query only copied one of the duplicate records, the duplicate is deleted while the original remains. An example of a query that can be used to dedupe data from a temporary table is as follows:

delete from customer
where customerID = (select customerId from #myTempTable)

This query deletes all records in the customer table that match the duplicate records located in the temporary table. The advantage in using this method is that the dedupe data records are still located in the temporary table, so if the administrator needs to restore the database tables, he can revert back to the original table. For an added storage advantage, the database administrator can also save these records to a CSV file or external backup in case of an emergency.