Cleansing Files of Duplicate records via Azure Data Factory / Synapse

Nandan Hegde
1y
4.5k
0
0

Article

Problem Statement

Data Cleaning is an important part of ETL/ELT processes as it ensures that only high-quality data is loaded into the system. This helps to improve the accuracy of the data analytics.

So how to remove the duplicate records from the file as a part of the clean-up process via ADF / Synapse Pipeline.

Prerequisites

Azure Data Factory / Synapse

Solution

In the case of SQL, for the below scenario assuming the Uniqueness of the rows is based on Column C1 values; the rows highlighted indicate the duplicate rows.

solution

To remove the duplicate records in SQL, we can make use of the ROW_NUMBER and PARTITION / OVER BY function as seen below

removeDuplicates

To Cleanse the file of duplicate records via Azure Data Factory / Synapse, we can leverage the similar concept of ROW_NUMBER and PARTITION / OVER BY.

We would leverage Dataflow activity to achieve the use case.

Below is the flow for the same.

flow

Sample File

sampleFile

Create a Source flow and the corresponding dataset mapping to the Azure Blob storage (in this scenario) location wherein the duplicate data file is present.

Source location

SourceLocation

Dataset

Dataset

where Dataset Parameter

DatasetParams

SourceSettings

Data Preview

DataPreview

Select the Window function to derive the ROW_NUMBER functionality.
Select the column via which we need to Partition.
Note. In the case of Composite keys, we can select multiple columns.
In Sort, select the column via which we need to sort in case of duplicate rows.
Add a Column with the name, RowNbr, with the expression as rowNumber().