Note: |
---|
This sample uses the Fuzzy Grouping and Fuzzy Lookup transformations, which are available only in the Enterprise version of SQL Server. |
Important: | |
---|---|
Samples are provided for educational purposes only. They are not intended to be used in a production environment and have not been tested in a production environment. Microsoft does not provide technical support for these samples. | To learn more about data cleaning, search for the following articles in the MSDN Library:
RequirementsRunning this sample package requires the following:
Location of the Sample PackageIf the samples were installed to the default installation location, the Data Cleaning package is located in the following folder:C:\Program Files\Microsoft SQL Server\100\Samples\Integration Services\Package Samples\DataCleaning Sample\Data Cleaning\. The following files are required to run this sample package. | |
File | Description |
DataCleaning.dtsx | The sample package. |
CreateTables.sql | SQL statements to create tables. |
Note: |
---|
If you open the package in SSIS Designer and view the package properties, you will notice that the DelayValidation property is set to True. Validation of the package must be delayed because some tables used by the Data Cleaning sample package—the CustomerLeads, and the three output tables named ExistingCustomerLeads, NewCustomerLeads, and DuplicateCustomerLeads—are not created until the first time the package runs. If DelayValidation is set to False, a validation error occurs when you open the package in SSIS Designer before running the package. |
Element | Purpose |
---|---|
Execute SQL task | The Execute SQL task is named Create Customer Address Reference Table View, Populate NewCustomer Input Table and Create Output Tables. This task creates the input table, CustomerLeads, and also creates the three output tables named ExistingCustomerLeads, NewCustomerLeads, and DuplicateCustomerLeads. |
Data Flow task | The Data Flow task, Fuzzy Lookup Data Flow Task, executes the data flow in the package. |
OLE DB source | The OLE DB source, OLE DB Source - Customer Leads, reads records from the CustomerLeads table. |
Lookup transformation | The Lookup transformation, Lookup against Existing Customers, performs an exact lookup to identify existing customers. If the lookup succeeds, the record is inserted into the ExistingCustomerLeads table. |
Derived Column transformation | The Derived Column transformation, Derived Column, adds the _Similarity columns to each row and sets the column value to 1. |
Fuzzy Lookup transformation | The Fuzzy Lookup transformation, Fuzzy Lookup against Existing Customers, performs a fuzzy lookup to identify customer records that are fuzzy matches of existing customer records. The transformation adds a _Similarity column that contains a similarity score to each row. The score 0.0 means no match was found, whereas 1.0 means an exact match was found. A score between 0.0 and 1.0 is a measure of similarity in which a value closer to 1.0 indicates greater similarity. |
Conditional Split transformation | The first Conditional Split transformation, ConditionalSplit on Similarity, directs input rows to one of two outputs depending on the value of the similarity score determined by the fuzzy lookup. Rows with a similarity score >= .70 are written to the ExistingCustomerLeads table. Rows with similarity scores < 70 are probably valid new customer leads and additional cleaning is done on these rows. The second Conditional Split transformation, Conditional Split on Canonical Record for Group, directs input rows to one of two outputs depending on whether the data row is a duplicate. If the values of the keyin and keyout columns are equal, the row is used as the canonical row in the group, and the canonical row is inserted into the NewCustomerLeads table. If the keyin and key_out columns are not equal, the row is treated as a fuzzy duplicate and the row is inserted into the DuplicateCustomerLeads table. |
Union All transformation | The Union All transformation, Union All, merges rows of existing customers—both exact and fuzzy matches—into one dataset. |
Fuzzy Grouping transformation | The Fuzzy Grouping transformation, Fuzzy Grouping, groups customers who are likely duplicates. The transformation adds three columns keyin, keyout and score to each row. keyin is a unique identifier assigned to each input row and keyout contains the particular keyin assigned to the row that best represents all the rows in a fuzzy group. All rows in a fuzzy group will have the same keyout value. The score column is a value between 0.0 and 1.0 that describes the textual similarity between a given input row and the row selected to be the canonical value. |
OLE DB destinations | The OLE DB destination, OLE DB Destination - Existing Customers, inserts rows into the ExistingCustomerLeads table.The OLE DB destination, OLE DB Destination - Unique Customer Leads, inserts rows into the NewCustomerLeads table.The OLE DB destination, OLE DB Destination - Duplicate Customer Leads, inserts rows into the DuplicateCustomerLeads table. |
File connection manager | The File connection manager, CreateTables.sql, connects to the file that contains the SQL the package uses. |
OLE DB connection manager | The OLE DB connection manager, (local).AdventureWorks, connects to the AdventureWorks database on the local server. |
Table | Description |
---|---|
ExistingCustomerLeads | Contains records that exactly match an existing customer, and records that fuzzily match an existing customer with very high textual similarity. |
NewCustomerLeads | Contains records for which there was no good match to an existing customer. If the list contained multiple instances of the same name, or a highly similar version of a particular name, only one record will be directed to NewCustomerLeads, and the duplicates will be directed to DuplicateCustomerLeads. |
DuplicateCustomerLeads | Contains duplicates of new customers. |
Select * from AdventureWorks.FuzzyLookupExample.ExistingCustomerLeads Select * from AdventureWorks.FuzzyLookupExample.NewCustomerLeads Select * from AdventureWorks.FuzzyLookupExample.DuplicateCustomerLeads | © 2007 Microsoft Corporation. All rights reserved. |