Managing Services
In ibi Data Quality, a service represents a workflow that implements the logic to verify and cleanse data. Services can either be generic or domain specific.
Built-in Services
ibi Data Quality provides a set of built-in services that can be exposed as Rules. These services are registered in the DQCore workspace and cannot be edited or replaced by end users. However, service developers can define and register new services in the Custom workspace.
For more information on the default services that are included in the product, see Packaged DQ Services.
DQ Service Requirements
To add a new service in ibi Data Quality, the new service must comply with the following ibi Data Quality service specifications.
Request Method
This requirement applies to external DQ Services. The Service must support HTTP POST.
Authentication Method
This requirement applies to external DQ Services. The service may require basic authentication (user ID and password), but no other security mechanism is supported.
Input Data
- The service should support input data as a single block of CSV text i.e. a string of CSV data with each row separated by a newline (\n) character.
- The input data must not include a column header.
- Input values must be provided in the order specified in the service definition provided to ibi Data Quality. For more information, see Managing Services.
- When creating the service definition, input column names should be prefaced with in_.
For example, an email cleanse service should have an input variable called in_email.
Output Data
- Output column names that represent cleansed input values should be prefaced by out_.
Example: An email cleanse service should have an output variable called out_email.
- In addition to the desired output columns, each output row should have these two additional columns
- Tag value: A short string (abbreviated) that represents a summary of the data defects or other meaningful facts about the results of the analysis.
Tag category : a string that summarizes the final outcome of the data analysis. Note - only one of these four supported values are allowed for tag_category.
Empty.the input data is empty string or only contains white spaces
VALID.Indicates the input data is valid and has the same value as the output data.
CLEANSED.Indicates that the input value is cleansed and a standardized, augmented, or enriched output value was generated.
INVALID. Indicates the input value is invalid, either the defects cannot be fixed or the value do not represent the corresponding class of data.
Warning - If a service generates an output tag_category value that does not belong to the above listed set of values then the service will fail to execute.
All input values should be included in the output and records should be emitted in the same order in which they were provided in the input data set.
The output should be a CSV block with the first line being a column header.
Parameters
Parameters represent name-value pairs that are passed in the Service request and are immutable values in the scope of a request executed by the Service. Rule authors use these parameters to create different Rules for a given service by setting different parameter values when creating Rules in the Rules Editor.
Example - The “cleanse_email” service specifies a parameter called “default_email” which allows a Rule author to set up a default email to replace invalid email addresses in the output.
Note- If you are developing a Python script to deploy into iDQ’s custom services, you need to support parameters by defining key-value pairs as arguments to the cleanse function.
DQ Service Definition
In order to register a new Service in iDQ, you need to create a definition for the service in JSON format (an example is included below).
Attributes | Description |
---|---|
name | Name of the service, has to be unique in the “custom” workspace |
description | A brief description of the service. |
location | Service location URL. |
sendFullDataSet | Optional. Set this to “true” if the service expects the entire data set as input as opposed to one row at a time. |
supportsJson | Optional. Set this to “true” if the service can interpret the request body in JSON format and generate a response in JSON format. |
batchSize | If your DQ Service generates exactly one output record for each input record then set this value to null. If your DQ Service generates more than one output record for each input record then set this value= 1. When batchSize is 1, users should expect performance degradation because iDQ will make separate calls to the DQ Service for each row of the input data set. If a DQ Service generates multiple output rows for a single input row the output data set will have multiple rows with the same dq_recid. Note - Users cannot use the Merge & Export feature when analyzing data sets with Rules that implement DQ Services with batchSize set to 1. |
input Columns | An array of paired values that have name - name of the input column description - brief description of the input column |
outputColumns |
An array of paired values that have name - name of the output column. description - brief description of the output column |
parameters | (Optional) An array of values that have name - name of the parameter description - brief description of the parameter |
credentials | (Optional) A pair of values that have: user - user name for the service credentials password - password for the service credentials |
The following is an example JSON document for a new service called cleanse_vin:
{
"name": "cleanse_vin",
"description": "Cleanse vehicle identification number",
"location": "/idq/custom/custom_cleanse_svcs/idq_cleanse_vin",
"supportsJson": false,
"sendFullDataSet": false,
"batchSize": null,
"inputColumns": [
{
"name": "in_vin",
"description": "value to be cleansed or verified"
}
],
"outputColumns": [
{
"name": "out_vin",
"description": "input value when tag_category is VALID, cleansed value
when tag_category is CLEANSED, default value when tag_category is MISSING or INVALID"
},
{
"name": "tag_value",
"description": "Tag value that provides explanation for malformed, unexpected or missing data"
},
{
"name": "tag_category",
"description": "Tag category that categorizes tags as Missing Data, Cleansed Data or Invalid Data"
}
],
"tags": [
{
"name": "EMPTY_OR_WHITESPACES",
"description": "Input value is an empty string or contains all whitespaces"
},
{
"name": "INVALID_VIN",
"description": "Value does not represent a valid email address"
},
{
"name": "VALID_VIN",
"description": "Input value is Valid"
},
{
"name": "NORMALIZED_VIN",
"description": "Input value was reformatted or cleansed"
}
],
"parameters": [
{
"name": "default_vin",
"description": "Default value when tag_category is MISSING or INVALID"
}
]
}
Adding custom DQ services
There are two different ways to add your custom DQ Service in ibi Data Quality.
-
External DQ Services
-
Develop and deploy your custom DQ logic as an external application with REST endpoints.
-
Register the new DQ Service.
-
-
Hosted DQ Services
-
Develop and deploy your custom DQ logic as a Python script in iDQ’s custom services container.
-
Register the new DQ Service.
-
Regardless of where you decide to host the new DQ Service, the service has to comply with the minimum requirements as mentioned in DQ Service Requirements.
Process Overview
Process to add:
Process to rollback
Creating Python DQ script
This section covers the detailed steps for a developer to directly deploy custom Python scripts in iDQ with minimal effort.
Important considerations
-
The only Python runtime environment supported in this release is 3.11.3.
-
All deployed scripts will run in the same Python environment, so developers will have to be extremely careful deploying scripts because conflicting dependencies may result in unpredictable behavior.
-
You can deploy more than one cleanse function in a single script provided each function name has idq_ as prefix.
For example, if you submit a script with the name “cleanse_svcs” that contains two different functions idq_cleanse_vin and idq_cleanse_ssn, two different endpoints will be available at the end of the deployment process
idq/custom/cleanse_svcs/idq_cleanse_vin
idq/custom/cleanse_svcs/idq_cleanse_ssn
-
You must test your python script before deploying in iDQ
High level steps:
-
Import required packages.
-
Create a cleanse function with “idq_” as the name prefix. Note - the system only recognizes functions that start with the name “idq_”, all other functions will be ignored.
-
The function should have two arguments
-
csvstr - represents a string of CSV data with each row separated by a newline (\n) character
-
params - represents a dictionary of key-value pairs that are passed in the request and are immutable values in the scope of a runtime execution of the script.
The following example will walk you through the steps of creating a python script that verifies Vehicle Identification Numbers as input values.
-
Steps:
-
Import all your required packages.
-
Define a function to cleanse VIN numbers, the name of the function should always start with idq_ and should always have two arguments.
-
Read input values and parameters.
-
Define output columns.
-
Replace input NULL values with empty strings and count the number of rows in the input data.
-
For each row, execute a sub function that verifies input value and returns the verified, cleansed and enriched output.
-
Create a Pandas DataFrame from the output list, convert the dataframe to CSV format and return the results.
In the sub function, each input row is analyzed and based on the analysis results, a tag_value and a tag_category is set.
Deploying Python DQ script
You will need an API client like Postman to deploy custom Python DQ scripts.
The following API services will enable you to deploy, test and register your new DQ Service. Before using these services, you will need to authenticate and authorize your user account. For more information, refer to the section Authorize.
Get Catalog
Description
Use this endpoint to retrieve a catalog of existing custom DQ Services that have already been deployed in the environment.
Endpoint
https://{{host}}:{{port}}//api/v1/custom/catalog
HTTP Method
GET
Request Body
NA
Response
Status | OK when the request is successful. |
Code | 200 when the request is successful. |
Message | NA |
developerMessage | NA |
responseType | java.util.ArrayList |
response | A list of RESTful service endpoints, one of each DQ function with the prefix idq_ |
exception | NA |
Add Python Script
Description
Use this endpoint to deploy your custom Python script.
Endpoint
https://{{host}}:{{port}}//api/v1/custom/module/{name of your python script}
Note - This name for your Python script can only contain alphabets, numbers and underscores. Since this name is going to become a part of the service endpoint URL, provide a meaningful and short name that is unique and does not contain spaces and special characters.
HTTP Method
POST
Request Body
Paste the contents of your Python script
Response
Status | OK when the request is successful. |
Code | 200 when the request is successful. |
Message | NA |
developerMessage | NA |
response Type | java.lang.String |
response | The name of the python script you deployed |
exception | Null |
Verify Functions
Depending on the number of functions and number of required packages imported in the script, the deployment process might take from a few seconds to several mins to complete.
In order to verify that your service has deployed correctly, you can run the Get Catalog request as mentioned above.
For example, we posted the request to create the cleanse VIN python script to this end point
https://{{host}}:{{port}}/api/v1/custom/module/cleanse_svcs
After successful deployment, if you call the Get Catalog you should see your functions listed in the response message.
Note - the response contains the endpoint that you will need to register the DQ Service in iDQ.
{
"status": "OK",
"code": 200,
"message": null,
"developerMessage": null,
"responseType": "java.util.ArrayList",
"response": [
"/idq/custom/cleanse_svcs/idq_cleanse_vin"
],
"exception": null
}
Test Function
Description
Use this endpoint to test the functions defined in your custom Python script.
Note - If you have defined more than one idq_ function in your script, you will have to test each function individually.
Endpoint
https://{{host}}:{{port}}/api/v1/service/test
HTTP Method
POST
Parameters
input
A single row of input data.
Request Body
JSON message that provides the location of the function you intend to test.
Example:
{ "location": "/idq/custom/custom_cleanse_svcs/idq_cleanse_vin"}
Response
Status | OK when the request is successful. |
Code | 200 when the request is successful. |
Message | NA |
developerMessage | NA |
response Type | com.tibco.tdq.common.model.butler.ServiceValidationResponse |
response |
url : End point of the service that was tested payload : Input data submitted as request for this test response : Comma separated sets of output values with each row separated by a newline (\n) character |
exception | Null |
Register DQ Service
The last step of the deployment process is to register the function as a DQ Service so Rule authors can use this service to create new Rules.
Note - If you have defined more than one idq_ function in your script, you will have to define and register each corresponding DQ Service individually.
Endpoint
https://{{host}}:9803/api/v1/service
HTTP Method
POST
Parameters
NA
Request Body
JSON message that describes the DQ Service. Refer to the section on Service Definition.
Example
JSON message that describes the DQ Service. Refer to the section on Service Definition.
Example:
{
"name": "cleanse_vin",
"description": "Cleanse vehicle identification number",
"location": "/idq/custom/custom_cleanse_svcs/idq_cleanse_vin",
"supportsJson": false,
"sendFullDataSet": false,
"batchSize": null,
"inputColumns": [
{
"name": "in_vin",
"description": "value to be cleansed or verified"
}
],
"outputColumns": [
{
"name": "out_vin",
"description": "input value when tag_category is VALID, cleansed value
when tag_category is CLEANSED, default value when tag_category is MISSING
or INVALID"
},
{
"name": "tag_value",
"description": "Tag value that provides explanation for malformed, unexpected or missing data"
},
{
"name": "tag_category",
"description": "Tag category that categorizes tags as Missing Data, Cleansed Data
or Invalid Data"
}
],
"tags": [
{
"name": "EMPTY_OR_WHITESPACES",
"description": "Input value is an empty string or contains all whitespaces"
},
{
"name": "INVALID_VIN",
"description": "Value does not represent a valid email address"
},
{
"name": "VALID_VIN",
"description": "Input value is Valid"
},
{
"name": "NORMALIZED_VIN",
"description": "Input value was reformatted or cleansed"
}
],
"parameters": [
{
"name": "default_vin",
"description": "Default value when tag_category is MISSING or INVALID"
}
]
}
Response
Status | OK when the request is successful. |
Code | 200 when the request is successful. |
Message | NA |
developerMessage | NA |
response Type | com.tibco.tdq.common.model.rules.Service |
response |
id : Unique identifier assigned to the service workspace : custom version : 1.0 Rest of the attributes as defined in the service definition |
exception | Null |
Verify DQ Service
After registering a DQ Service, use this endpoint to retrieve the service info and verify that your newly registered DQ Service is now available for use in iDQ.
Note -If you have defined more than one idq_ function in your script and registered them in iDQ, you will have to verify each DQ Service individually.
Endpoint
https://{{host}}:9803/api/v1/service
HTTP Method
GET
Parameters
name
name of the DQ Service you have registered
workspace
custom ( by default, all new DQ Services are deployed to custom workspace)
Request Body
NA
Response
Status | OK when the request is successful. |
Code | 200 when the request is successful. |
Message | NA |
developerMessage | NA |
response Type | com.tibco.tdq.common.model.rules.Service |
response |
Description of the DQ Service in JSON format |
exception | Null |
Upon successfully registering your new DQ Service, you can login to the iDQ user interface, go to the Service Registry page and find your newly registered DQ Service ready for use.
Delete DQ Service
Description
Use this endpoint to delete a previously registered DQ Service from iDQ.
Note - You can only remove DQ Services that were registered in the “custom” workspace. You cannot delete prepackaged DQ Services that are shipped with the product.
Endpoint
https://{{host}}:{{port}}/api/v1/service/{id of the service}
You can find the id of the service from the response of the Verify DQ Service call.
HTTP Method
DELETE
Parameters
NA
Request Body
NA
Response
Status | OK when the request is successful. |
Code | 200 when the request is successful. |
Message | NA |
developerMessage | NA |
response Type | org.springframework.http.ResponseEntity |
response |
body : 1 statusCode : OK statusCodeValue : 200 |
exception | Null |
Delete Python Script
Description
Use this endpoint to delete your Python script from the environment.
Note - if your script contains more than one idq_ function, all functions will be removed from the system.
Also, before removing a script make sure you remove the corresponding DQ Services by deleting them from the Service Registry.
Endpoint
https://{{host}}:{{port}}/api/v1/custom/module/{name of the python script}
HTTP Method
DELETE
Parameters
NA
Request Body
NA
Response
Status | OK when the request is successful. |
Code | 200 when the request is successful. |
Message | NA |
developerMessage | NA |
response Type | org.springframework.http.ResponseEntity |
response |
body : cleanse_svcs (name of the script that is deleted) statusCode : OK statusCodeValue : 200 |
exception | Null |