Overview

The MapPoint Batch Geocoder is an open source transformation component for SQL Server Integration Services that performs batch geocoding of address information directly within the SSIS pipeline using the geocoding capabilities of MapPoint's Customer Data Services. The transformation outputs both the original input data and the geodetic spatial data using the new Spatial Data Types of SQL Server 2008.

The transformation leverages the geocoding capabilities of MapPoint's Customer Data Services. Address information within the pipeline is buffered and then uploaded to MapPoint CDS in batches. Once MapPoint geocodes the addresses, the data is downloaded and reintegrated back into the SSIS pipeline. During reintegration, spatial data is joined back with the original input data and continued through the pipeline.

The output columns include both the original input columns as well as two additional columns:

Capabilities

Prerequisites

Installation

  1. Compile the MapPoint Batch Geocoder transformation component.
  2. Copy the transformation component (MapPoint Batch Geocoder.dll) to the PipelineComponents folder for SSIS (i.e. "C:\Program Files\Microsoft SQL Server\100\DTS\PipelineComponents\").
  3. Add the assembly to the Global Assembly Cache (GAC). (see article 'How to: Install an Assembly into the Global Assembly Cache' - http://msdn.microsoft.com/en-us/library/dkkx7f79.aspx)
  4. On the Data Flow tab of SSIS, right click in the toolbox and select "Choose Items…".
  5. Select the "SSIS Data Flow Items" tab.
  6. Choose "MapPoint Batch Geocoder" from the list of components, and click OK.
The component should now be available from the list of Data Flow Transformations in the toolbox. For convenience, an installation package is also available which installs the runtime version. However, it does not add the component to Visual Studio.

Usage

The MapPoint Batch Geocoder works just like any other transformation component; once you add a Data Flow task to the Control flow design surgace, drag and drop the "MapPoint Batch Geocoder" from the toolbox and place it somewhere between a source and destination. After connecting the transformation to upstream (input) item and selecting the available input columns, configuration of the geocoder can begin.

Transformation Configuration

Below is a list of the available configuration options:

Logging

Log events are generated during geocoding and can be captured using standard SSIS logging. All events are reported under the OnInformation event.

Design Considerations

Isolating the Geocoding Process

The possibility of connectivity failure increases anytime Internet-based services are involved, as is the case with this component’s dependency on the MapPoint’s CDS webservice. The ETL processes can be designed so that a geocoding failure does not cause the entire ETL process to fail, particularly if the ETL process is long and difficult/tedious to restart.

Instead of performing geocoding as an extra data flow step in an already complex package, a more resilient design is to have a separate package which performs the geocoding against records which have already been processed by ETL and landed in the target database. The package will only geocode records which have yet to be geocoded, indicated by a null or flag. Once the geocoding is complete, the package updates the existing rows and clears the flag.

Because of the design, the geocoding package can be executed out-of-band, which handles the case of a network failure or even a historical reload of data.

Data Cleansing

Due to the nature of how address information is entered by operational systems, it is quite common for this information to contain errors (formatting, spelling, typos, etc). However, the MapPoint Batch Geocoder will not perform any data cleansing of address information before submitting to MapPoint. Since error-prone address information will yield poor or incorrect geocoding results, consideration for adding data cleansing practices to ETL processes should be made.

The MatchCode value returned by MapPoint can be used to identify records that had an issue being geocoded. A decision should be made for each possible MatchCode value of whether the quality of the geocode is useful enough to be consumed by downstream systems and whether further address cleansing (i.e. manual intervention) is required.

Controlling the MapPoint Batch Size

For performance reasons, geocoding is performed in batch mode. The batch size is controlled by the SSIS buffer controls DefaultBufferMaxRows and DefaultBufferSize properties of a data flow task. Managing the batch size is useful for achieving the right balance between geocoding too much data at once (i.e. all is lost in case of failure, or MapPoint limitations), and overall process performance.

MapPoint Limitations

Because the MapPoint Batch Geocoder is based on the MapPoint CDS webservice, it must adhere to the rules and constraints dictated by the Customer Data Services. Although most of these restrictions are already accounted for in the transformation’s design, some may still require consideration. For instance: See Formatting MapPoint Web Service Data Source Files for more information - http://msdn.microsoft.com/en-us/library/cc534849.aspx