Derived Column transformation | The Derived Column transformation, Derived Column, adds the _Similarity columns to each row and sets the column value to 1. |
Fuzzy Lookup transformation | The Fuzzy Lookup transformation, Fuzzy Lookup against Existing Customers, performs a fuzzy lookup to identify customer records that are fuzzy matches of existing customer records. The transformation adds a _Similarity column that contains a similarity score to each row. A score of 0.0 means no match was found, whereas a score of 1.0 means an exact match was found. A score between 0.0 and 1.0 is a measure of similarity in which a value closer to 1.0 indicates greater similarity. |
Conditional Split transformation | The first Conditional Split transformation, ConditionalSplit on Similarity, directs input rows to one of two outputs depending on the value of the similarity score determined by the fuzzy lookup. Rows with a similarity score >= .70 are written to the ExistingCustomerLeads table. Rows with similarity scores < 70 are probably valid new customer leads and additional cleaning is done on these rows. The second Conditional Split transformation, Conditional Split on Canonical Record for Group, directs input rows to one of two outputs depending on whether the data row is a duplicate. If the values of the keyin and keyout columns are equal, the row is used as the canonical row in the group, and the canonical row is inserted into the NewCustomerLeads table. If the keyin and key_out columns are not equal, the row is treated as a fuzzy duplicate and the row is inserted into the DuplicateCustomerLeads table. |
Union All transformation | The Union All transformation, Union All, merges rows of existing customers—both exact and fuzzy matches—into one dataset. |
Fuzzy Grouping transformation | The Fuzzy Grouping transformation, Fuzzy Grouping, groups customers who are likely duplicates. The transformation adds three columns keyin, keyout and score to each row. keyin is a unique identifier assigned to each input row and keyout contains the particular keyin assigned to the row that best represents all the rows in a fuzzy group. All rows in a fuzzy group will have the same keyout value. The score column is a value between 0.0 and 1.0 that describes the textual similarity between a given input row and the row selected to be the canonical value. |
OLE DB destinations | The OLE DB destination, OLE DB Destination - Existing Customers, inserts rows into the ExistingCustomerLeads table.The OLE DB destination, OLE DB Destination - Unique Customer Leads, inserts rows into the NewCustomerLeads table.The OLE DB destination, OLE DB Destination - Duplicate Customer Leads, inserts rows into the DuplicateCustomerLeads table. |
Execute SQL task | The Execute SQL task, Return Row Count for ExistingCustomersLeads table, runs a parameterized SQL statement. The statement returns a count for the number of rows in the ExistingCustomerLeads table where the Similarity column has a value of 1. The SQL statement stores this row count as a Single row result set in a package variable, RowCount. The Similarity column value is stored in a package variable that is mapped to the parameter in the SQL statement. A _Similarity value of 1 means that there is an exact match between the customer record in the input table, CustomerLeads, and the customer record in the Lookup reference dataset. |
Script task | The Script task, Display Row Count, displays the row count value for the ExistingCustomerLeads table, based on the value stored in the RowCount package variable. |
File connection manager | The File connection manager, CreateTables.sql, connects to the file that contains the SQL the package uses. |
OLE DB connection manager | The OLE DB connection manager, (local).AdventureWorks, connects to the AdventureWorks database on the local server. |