Fuzzy Grouping
The Fuzzy Grouping transform allows grouping of records by looking at the similarity between the values of various columns.
Two records in which a possible misspelling has occurred can be grouped together for further analysis, or found duplicates can be removed. The sensitivity toward differences between values can be adjusted.
1. Input
The Fuzzy Grouping transform requires one input transform that has at least one column.
Consider the following input as an example:
2. Add the transform
Click the connector link between two transforms to select it.
In the toolbar, choose Insert Other, then Fuzzy Grouping.
To edit/configure the transform, select it and choose Configure in the toolbar.
3. Configure
Steps to configure the Fuzzy Grouping transform:
- Uncheck any columns that should be excluded from the output.
- Drag and drop the columns that should be grouped from under Input to Grouping Columns.
- Enter the Probability Threshold. Valid values are from .0001 to 1.0, where a value of 1 will require input data to be an exact match for them to be grouped together.
- Select Ignore String Case if you want a non-case sensitive match.
- Select Output Top Level Records Only if you want to omit the records found to be duplicates.
4. Output
The figure below illustrates the output for our example.
- With Output Top Level Records Only unchecked:
- With Output Top Level Records Only checked: