Using the API to implement Data Transformations

Transformations

Transformations convert raw data into usable data products by defining processing steps needed to transform input data (from data objects or other data products) into the structure defined by the data product's schema.

Purpose

  • Data cleaning: Remove inconsistencies and errors from raw data

  • Data enrichment: Add calculated fields and business logic

  • Data integration: Combine data from multiple sources

  • Schema compliance: Ensure output matches data product schema exactly

Important: The output of your transformations must match the schema defined for the data product exactly.

Creating Transformations

Transformation Builder Structure

Endpoint: PUT /api/data/data_product/compute/builder?identifier={product_id}

{
  "config": {
    "docker_tag": "0.0.23",
    "executor_core_request": "800m",
    "executor_core_limit": "1500m",
    "executor_instances": 1,
    "executor_memory": "5120m",
    "driver_core_request": "0.3",
    "driver_core_limit": "800m",
    "driver_memory": "2048m"
  },
  "inputs": {
    "input_data_object_id": {
      "input_type": "data_object",
      "identifier": "data-object-id-here",
      "preview_limit": 10
    }
  },
  "transformations": [
    {
      "transform": "cast",
      "input": "input_data_object_id",
      "output": "casted_data",
      "changes": [
        {
          "column": "customer_id",
          "data_type": "integer",
          "kwargs": {}
        }
      ]
    }
  ],
  "finalisers": {
    "input": "casted_data",
    "enable_quality": true,
    "write_config": {"mode": "overwrite"},
    "enable_profiling": true,
    "enable_classification": false
  },
  "preview": false
}

Python Functions

Common Transformations

Cast - Convert Data Types

Select Columns - Choose Specific Columns

Filter - Apply Conditions

Group By - Aggregate Data

Join - Combine Data Sources

Expression - Apply SQL-like Expressions

Complete Example

Monitoring and Validation

Status Values

  • COMPLETED: Transformation finished successfully

  • FAILED: Transformation encountered an error

  • RUNNING: Transformation currently executing

  • STARTING_UP: Transformation is starting

  • SCHEDULED: Transformation scheduled for execution

  • UNSCHEDULED: Job could not be scheduled

Troubleshooting

Failed Transformations

  1. Check compute logs: Use get_transformation_logs() for detailed error information

  2. Verify schema match: Ensure transformation output matches product schema exactly

  3. Check quality validations: Use get_quality_validations() for schema mismatch details

Schema Validation Issues

  • If transformation status is FAILED but no error in logs, check quality validations

  • Schema mismatches are the most common cause of failed transformations

  • Use get_builder_schemas() to see schema evolution through transformation steps

Best Practices

  • Preview first: Always test with preview=True before running full transformations

  • Schema alignment: Ensure final output exactly matches data product schema

  • Error handling: Check both compute logs and quality validations for failures

  • Resource sizing: Adjust executor instances and memory based on data volume

  • Step validation: Use builder state to verify schema at each transformation step

  • Keep transformations simple and modular

  • Prefer built-in transformation types over custom SQL when possible

  • For large datasets, filter early in the pipeline to reduce data volume

Last updated