The Nasa Facilities example

Note that this tutorial has been created for BETL v0.1. which was developed for a customer that used an Inmon style data warehouse.

Current BETL version is v0.2 which is optimized for datavault. Just try to get the push procedure working...

 

 

I'm a big fan of learning by example, so I will show you the working of BETL by loading a 'random' sample dataset coming from NASA ( https://data.nasa.gov/data ).

I've saved the dataset as excel on my file system NASA_Facilities.xlsx. We are going to load this data into SQL server staging first and then we can decide how to transform this data into our datawarehouse.

Extracting the source data into Staging

Pre: We have an excel file F containing a snapshot of source data
Post: BETL_Staging contains a table called [NF].[NASA Facilities] containing an exact replica of the data in F.

To keep this tutorial simple, we will not elaborate on this step here.

Integrating the source data in our data warehouse model

Pre: BETL_Staging contains a table called [NF].[NASA Facilities] containing the latest snapshot of source data D.
Post: The source  data D is transformed and loaded into normalized tables in BETL_DWH. (In this example we have chosen an Inmon style 3NF Datawarehouse model).

Some of the things we have to do are:
1. Converting datatypes.
2. Normalize the data in 3NF
3. Detect natural keys
4. Historization ( detect changes)
5. Synonyms (duplicates)
6. Apply data warehouse terminology (rename columns).
See http://www.biblog.nl/category/bi-design-patterns/ for more on these steps or ETL patterns.

In our simple example we have three entities: Research Centers, Research Facilities and Contacts.
It's important to model the domain top down at this stage (e.g. look at the questions that the business has and model the data accordingly, keeping in mind that we have to be able to map the source data to this data warehouse model).

For each domain entity/table we define a view in My_Staging. In our case NF.Facility and NF.Center (I prefer to use short singular names).
Besides normalisation (2), we can also do the data conversion (1)  here and apply dwh naming (6). We also apply the multiple schema pattern (MS) (use different schemas for different source systems).

-- Step 1. For each domain entity/table we define a view in My_Staging. In our case NF.Facility and NF.Center

create view NF.Facility as
SELECT
      convert(varchar(255), [Facility]) facility_key
      ,convert(varchar(255), [Center]) center_key
      ,[Occupied] occupy_date
      ,convert(varchar(50), [Status])  facility_status
      ,convert(varchar(255), [URL Link]) url
      ,[Record Date] record_date
      ,[Last Update] last_update_date
      ,convert(varchar(100), [Country]) country_key
      ,convert(varchar(255),[Contact]) contact_key
      ,convert(varchar(20),[Phone]) facility_phone
      ,convert(varchar(100),[Location]) location
      ,convert(varchar(100),[City]) city
      ,convert(varchar(25),[State]) state_key
      ,convert(varchar(25),[Zipcode]) zipcode
  FROM [NF].[NASA Facilities]

* Note _I have a feeling that all these convert(varchar(x) statements could be implemented more elegantly, but I'm not sure how. Please feel free to donate your suggestion._

Each column has a content type which tells us whether it's a natural primary or foreign key or just an attribute or meta data. When content type is unknown, it is guessed by using the column name and ordinal position. E.g. natural keys are suffixed with _key. A natural primary key is an important identifier of an entity and may be used for change detection. Note that columns that end with _date are automatically converted into the date datetype. (for datetime use _dt suffix ). So all we did so far was creating a view in My_Staging (this is already a major step in our ETL development, as we will shortly see).

For the entity Center we will create a similar view.

create view NF.Center as
SELECT distinct
      convert(varchar(255), [Center]) center_key
  FROM [NF].[NASA Facilities]

Note that we use distinct here. Because center_key is a natural primary key and it would not be unique without distinct. If we would omit the distinct, we would get a primary key violation when NF.Center is transfered into DWH. For entity Contact we do not have to create a view because we only have a contact_key which will automatically create a Contact entity in DWH.

Pushing the data into the Data warehouse

Next step is calling the PUSH command in BETL on the just created views. This will read the views and column meta data and insert it into the BETL_DWH database, creating all necessary tables automatically. Wait just one minute before calling this command, because we need to set some meta data first.

exec dbo.run 'PUSH [Center]', 'NF'


* _Note that PUSH [Center] is the command string and 'NF' is the scope parameter (short for Nasa Facilities). The Scope variable is used for determining the scope of the argument in the command string. So Center is an entity that lives in [My_Staging].[NF].

vs* When using SSIS we can automatically fill in the scope by using a package variable. So we only have to name a Control flow item PUSH Center and the whole TSQL string above is generated by an expression in SSIS.

define scope

We must tell BETL that NF is the scope for all objects that live in the schema [LOCALHOST].[MyDWH_Staging].[NF]. We can do this by retrieving the object id for this schema and updating it's scope column.

exec dbo.run 'info [My_Staging]'

object_id           full_object_name
150                       [LOCALHOST].[My_Staging].[NF]  

update vw_object set scope= 'NF' where object_id = 150

define target schema

We want to transfer all views in the My_Staging.NF schema into the My_DWH.NF schema. First we create the destination schema:

USE [My_DWH]
go;

CREATE SCHEMA [NF] AUTHORIZATION [dbo]

Next, we make sure that this schema is known in vw_object (because we need to reference this schema using object_id):

exec run 'refresh My_DWH'

 

The output of this command will tell you the object_id of NF.

object_id           full_object_name
160                       [LOCALHOST].[My_DWH].[NF]

Now we can set the target schema:

The target_schema_id is implemented in BETL as a property. Properties are very flexible attributes that can be assigned to objects. View all properties by looking at vw_prop.

exec set_prop 'target_schema_id' ,160 , ' [LOCALHOST].[My_Staging].[NF]'

When successful this statement will select the property that was set.

PUSH

Now we are ready to execute the push. Note that push will create a target table if it does not exist. You can specify a transfer_method_id by editing vw_object. When NULL BETL will perform a truncate insert by default. (This is the only method currently implemented).

exec dbo.run 'PUSH [Center]', 'NF'