Setting up Custom Data Quality Checks for a Data Product

Data Quality System Overview

The Foundation backend provides a comprehensive data quality system that allows users to configure both automatic and custom data quality rules for their data products. The system supports:

  • Automatic Checks: Generated based on schema definitions (column types, constraints, etc.)

  • Custom Checks: User-defined rules using Great Expectations syntax

  • Quality scoring: Weighted scoring system with configurable thresholds

  • Validation execution: Automated quality checks and reporting

Key API Endpoints

1. Custom Expectation Management

Base URL: /api/data/data_product/quality/expectation/custom

Add Custom Expectation

POST /api/data/data_product/quality/expectation/custom?identifier={data_product_id}

Request Body (ExpectationItem):

{
  "type": "expect_column_values_to_be_between",
  "kwargs": {
    "column": "year",
    "min_value": 1980,
    "max_value": 2020
  },
  "meta": {
    "description": "Expect a year min max values"
  }
}

Update Custom Expectation

PUT /api/data/data_product/quality/expectation/custom?identifier={data_product_id}&custom_identifier={expectation_id}

Delete Custom Expectation

DELETE /api/data/data_product/quality/expectation/custom?identifier={data_product_id}&custom_identifier={expectation_id}

2. Quality Configuration

Get Current Expectations

GET /api/data/data_product/quality/expectation?identifier={data_product_id}

Update Quality Weights

PUT /api/data/data_product/quality/expectation/weights?identifier={data_product_id}

Request Body:

{
  "accuracy": 0.2,
  "completeness": 0.3,
  "consistency": 0.1,
  "uniqueness": 0.1,
  "validity": 0.3
}

Update Quality Thresholds

PUT /api/data/data_product/quality/expectation/thresholds?identifier={data_product_id}

Request Body:

{
  "table": 0.8,
  "columns": {
    "column_name": {
      "accuracy": 0.9,
      "completeness": 0.8,
      "consistency": 1.0,
      "uniqueness": 0.0,
      "validity": 0.5
    }
  }
}

3. Quality Execution and Results

Run Quality Checks

POST /api/data/data_product/compute/builder/run/quality?identifier={data_product_id}

Get Validation Results

GET /api/data/data_product/quality/validations?identifier={data_product_id}

Get Quality Overview

GET /api/data/data_product/quality/overview

Custom Expectation Types

The system supports all Great Expectations expectation types. Here are common examples:

Column Value Expectations

{
  "type": "expect_column_values_to_be_between",
  "kwargs": {
    "column": "age",
    "min_value": 0,
    "max_value": 120
  },
  "meta": {"description": "Age should be between 0 and 120"}
}

Column Type Expectations

{
  "type": "expect_column_values_to_be_of_type",
  "kwargs": {
    "column": "email",
    "type_": "StringType"
  },
  "meta": {"description": "Email should be a string"}
}

Uniqueness Expectations

{
  "type": "expect_column_values_to_be_unique",
  "kwargs": {
    "column": "user_id"
  },
  "meta": {"description": "User IDs should be unique"}
}

Null Value Expectations

{
  "type": "expect_column_values_to_not_be_null",
  "kwargs": {
    "column": "required_field"
  },
  "meta": {"description": "Required field cannot be null"}
}

Regex Pattern Expectations

{
  "type": "expect_column_values_to_match_regex",
  "kwargs": {
    "column": "phone_number",
    "regex": "^\\+?[1-9]\\d{1,14}$"
  },
  "meta": {"description": "Phone number should match international format"}
}

Complete Workflow

1. Configure Custom Expectations

# Add a custom expectation
curl -X POST "/api/data/data_product/quality/expectation/custom?identifier=123e4567-e89b-12d3-a456-426614174000" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "revenue",
      "min_value": 0,
      "max_value": 1000000
    },
    "meta": {
      "description": "Revenue should be between 0 and 1M"
    }
  }'

2. Set Quality Weights (Optional)

curl -X PUT "/api/data/data_product/quality/expectation/weights?identifier=123e4567-e89b-12d3-a456-426614174000" \
  -H "Content-Type: application/json" \
  -d '{
    "accuracy": 0.3,
    "completeness": 0.2,
    "consistency": 0.2,
    "uniqueness": 0.1,
    "validity": 0.2
  }'

3. Run Quality Checks

curl -X POST "/api/data/data_product/compute/builder/run/quality?identifier=123e4567-e89b-12d3-a456-426614174000" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "spark_config": {
        "spark.sql.adaptive.enabled": "true"
      }
    }
  }'

4. Review Results

# Get validation results
curl "/api/data/data_product/quality/validations?identifier=123e4567-e89b-12d3-a456-426614174000"

# Get current expectations
curl "/api/data/data_product/quality/expectation?identifier=123e4567-e89b-12d3-a456-426614174000"

Authentication & Permissions

All quality management endpoints require:

  • Manage permissions for creating/updating/deleting expectations

  • Read permissions for viewing results

  • Browse permissions for quality overview

The system uses the IAM framework to control access to data products and their quality configurations.

Last updated