Configuring Data Transformations through the UI

Transformations convert raw data into usable data products by applying processing steps to input data. They define how raw data from your sources is processed to match the expected schema of your data product.

When to Use Transformations

  • When raw data needs processing before business use

  • When you need to standardize, clean, or enrich data

  • When you need to combine data from multiple sources

  • When you need to apply business rules or calculations

  • When you need to restructure data to match consumer requirements

Adding Transformations

  1. Navigate to the Data Catalog or the Data Landscape

  2. Search for the data product you want to add transformations for

  3. Navigate to the Lineage tab of your data product

  4. Click the Edit Transformation button

  5. In the "Add Transformation" dialog, you'll need to specify:

    • Data Frame: Select the input data source

    • Transformation: Choose the type of transformation to apply (e.g., "select_columns")

    • Transformation name: Give your transformation a descriptive name

    • Columns: Select which columns to include or modify

  6. For a "select_columns" transformation:

    • Check the boxes next to columns you want to include in your output

    • Each column shows its data type (e.g., VARCHAR)

  7. Click OK to add the transformation to your pipeline

Transformation Types in Detail

MeshX Foundation supports numerous transformation types to manipulate and refine your data:

1. Column Selection and Renaming

  • select_columns: Choose which columns to include in the output

    • Use the checkboxes to select columns to keep

    • Deselected columns will be excluded from the output

  • rename_column: Change column names for clarity or standardization

    • Specify old column names and their new names

    • Helpful for standardizing naming conventions or making cryptic column names more readable

2. Data Type Modifications

  • cast: Convert data types (e.g., string to integer, string to date)

    • Select columns to convert

    • Choose the target data type for each column

    • Supported types include: INTEGER, VARCHAR, DECIMAL, DATE, TIMESTAMP, BOOLEAN, etc.

3. Filtering and Conditions

  • filter_with_condition: Filter rows based on specific criteria

    • Create conditions like "column_name > 100" or "status = 'ACTIVE'"

    • Supports complex expressions using AND, OR, and parentheses

    • Only rows matching the condition are kept in the output

4. Data Aggregation

  • group_by: Aggregate data based on specific dimensions

    • Select grouping columns (dimensions)

    • Define aggregation functions for measure columns

    • Supported aggregations: sum, count, min, max, avg, etc.

    • Creates summary statistics for each unique combination of dimension values

5. Data Joining

  • join_rename_select: Join two dataframes and handle column naming

    • Select the type of join (inner, left, right, full)

    • Specify join conditions (equality or other comparisons)

    • Choose which columns to include from each source

    • Rename columns to resolve conflicts or improve clarity

    • Note: Columns are prefixed with "left_" and "right_" in the joined result

6. Data Reshaping

  • pivot: Restructure data from long to wide format

    • Select dimension columns that remain as rows

    • Choose the column whose values become new columns

    • Specify the values column to populate the new structure

    • Define aggregation functions for when multiple values map to the same cell

  • unpivot: Restructure data from wide to long format

    • Select key columns to keep as dimensions

    • Choose value columns to convert to rows

    • Specify name for the new category column

    • Specify name for the new value column

7. Advanced Transformations

  • select_expression: Apply expressions to create or transform columns

    • Use SQL-like expressions to create or modify columns

    • Include "*" to keep all existing columns

    • Create calculated fields with arithmetic operations

    • Apply conditional logic with CASE statements

  • deduplicate: Remove duplicate rows

    • Specify columns to check for duplicates

    • Control which duplicate record to keep (first, last)

  • union: Combine rows from two dataframes

    • Choose whether to require matching columns

    • Specify whether to remove duplicates

    • Useful for combining data from multiple sources with similar structure

Managing Multiple Transformations

  1. Transformations are applied in sequence, with the output of one feeding into the next

  2. You can add multiple transformations by clicking the Add Transformation button again

  3. Each transformation appears in the visual pipeline view

  4. You can edit or delete transformations by selecting them in the pipeline

Transformation Pipeline Configuration

The UI simplifies the process, but behind the scenes, the transformation pipeline consists of:

  1. Config Section: Controls the computing resources for your transformation

  2. Inputs Section: Defines the data sources to process

  3. Transformations Section: The sequence of transformations to apply

  4. Finalisers Section: How the transformed data should be written and validated

The UI handles most of these configurations automatically, but understanding the pipeline structure can help when building complex transformations.

Previewing and Testing Transformations

  1. Before applying transformations to your entire dataset, you can preview results:

    • Use the preview feature to see how transformations affect a small sample of data

    • Check that the column types and values appear as expected

    • Verify that the transformations produce the desired output structure

  2. Run the transformation on a limited dataset:

    • For large datasets, consider testing on a subset first

    • Monitor execution time and resource usage

    • Adjust transformation parameters as needed for performance

Monitoring Transformation Execution

Once you've configured your transformations:

  1. Execute the transformation pipeline

  2. Monitor progress in the execution log

  3. Check for success or failure status

  4. If failed, review the logs for error details:

    • Schema mismatches

    • Data quality issues

    • Resource limitations

    • Syntax errors

Important: If a transformation fails but no error appears in the logs, it may be due to a schema mismatch. The output data structure must match the schema defined for the data product. Check the Data Quality tab for validation results.

Best Practices

  • Preview first: Always test with preview=True before running full transformations

  • Schema alignment: Ensure final output exactly matches data product schema

  • Error handling: Check both compute logs and quality validations for failures

  • Resource sizing: Adjust executor instances and memory based on data volume

  • Step validation: Use builder state to verify schema at each transformation step

  • Keep transformations simple and modular

  • Prefer built-in transformation types over custom SQL when possible

  • For large datasets, filter early in the pipeline to reduce data volume

Last updated