Problem Statement
Data Cleaning is an important part of ETL/ELT processes as it ensures that only high-quality data is loaded into the system. This helps to improve the accuracy of the data analytics.
So how to remove the duplicate records from the file as a part of the clean-up process via ADF / Synapse Pipeline.
Prerequisites
- Azure Data Factory / Synapse
Solution
In the case of SQL, for the below scenario assuming the Uniqueness of the rows is based on Column C1 values; the rows highlighted indicate the duplicate rows.
![solution]()
To remove the duplicate records in SQL, we can make use of the ROW_NUMBER and PARTITION / OVER BY function as seen below
![removeDuplicates]()
To Cleanse the file of duplicate records via Azure Data Factory / Synapse, we can leverage the similar concept of ROW_NUMBER and PARTITION / OVER BY.
We would leverage Dataflow activity to achieve the use case.
Below is the flow for the same.
![flow]()
Sample File
![sampleFile]()
Create a Source flow and the corresponding dataset mapping to the Azure Blob storage (in this scenario) location wherein the duplicate data file is present.
Source location
![SourceLocation]()
Dataset
![Dataset]()
where Dataset Parameter
![DatasetParams]()
![SourceSettings]()
Data Preview
![DataPreview]()
- Select the Window function to derive the ROW_NUMBER functionality.
![windowSetting]()
- Select the column via which we need to Partition.
Note. In the case of Composite keys, we can select multiple columns.
![Incoming Stream]()
- In Sort, select the column via which we need to sort in case of duplicate rows.
![IncomingRangeBy]()
![IncomingWindowColumn]()
- Add a Column with the name, RowNbr, with the expression as rowNumber().
Data Preview
![DataPreview]()
Filter the records with RowNbr equal to one as those represent unique rows via the Filter function.
![FilterSettings]()
//Filter On Expression :
RowNbr==1
Data Preview
![DataPreview1]()
Finally, add a Sink section to generate the cleansed file.
![SinkSection]()
where Dataset
![atasetConnection]()
With Sink Settings as below
![SinkSettings]()
And in Mapping, delete the additional column RowNbr mapping.
![Mapping]()
Output
Create a Pipeline, call the Dataflow via Dataflow activity, and trigger the pipeline.
![Output]()
Result
![Result]()