Configuring Data Transformations through the UI

Transformations convert raw data into usable data products by applying processing steps to input data. They define how raw data from your sources is processed to match the expected schema of your data product.

When to Use Transformations

When raw data needs processing before business use
When you need to standardize, clean, or enrich data
When you need to combine data from multiple sources
When you need to apply business rules or calculations
When you need to restructure data to match consumer requirements

Adding Transformations

Navigate to the Data Catalog or the Data Landscape
Search for the data product you want to add transformations for
Navigate to the Lineage tab of your data product
Click the Edit Transformation button
In the "Add Transformation" dialog, you'll need to specify:
- Data Frame: Select the input data source
- Transformation: Choose the type of transformation to apply (e.g., "select_columns")
- Transformation name: Give your transformation a descriptive name
- Columns: Select which columns to include or modify
For a "select_columns" transformation:
- Check the boxes next to columns you want to include in your output
- Each column shows its data type (e.g., VARCHAR)
Click OK to add the transformation to your pipeline

Transformation Types in Detail

MeshX Foundation supports numerous transformation types to manipulate and refine your data:

1. Column Selection and Renaming

select_columns: Choose which columns to include in the output
- Use the checkboxes to select columns to keep
- Deselected columns will be excluded from the output
rename_column: Change column names for clarity or standardization
- Specify old column names and their new names
- Helpful for standardizing naming conventions or making cryptic column names more readable

2. Data Type Modifications

cast: Convert data types (e.g., string to integer, string to date)
- Select columns to convert
- Choose the target data type for each column
- Supported types include: INTEGER, VARCHAR, DECIMAL, DATE, TIMESTAMP, BOOLEAN, etc.

3. Filtering and Conditions

filter_with_condition: Filter rows based on specific criteria
- Create conditions like "column_name > 100" or "status = 'ACTIVE'"
- Supports complex expressions using AND, OR, and parentheses
- Only rows matching the condition are kept in the output

4. Data Aggregation

group_by: Aggregate data based on specific dimensions
- Select grouping columns (dimensions)
- Define aggregation functions for measure columns
- Supported aggregations: sum, count, min, max, avg, etc.
- Creates summary statistics for each unique combination of dimension values

5. Data Joining

join_rename_select: Join two dataframes and handle column naming
- Select the type of join (inner, left, right, full)
- Specify join conditions (equality or other comparisons)
- Choose which columns to include from each source
- Rename columns to resolve conflicts or improve clarity
- Note: Columns are prefixed with "left_" and "right_" in the joined result

6. Data Reshaping

pivot: Restructure data from long to wide format
- Select dimension columns that remain as rows
- Choose the column whose values become new columns
- Specify the values column to populate the new structure
- Define aggregation functions for when multiple values map to the same cell
unpivot: Restructure data from wide to long format
- Select key columns to keep as dimensions
- Choose value columns to convert to rows
- Specify name for the new category column
- Specify name for the new value column

7. Advanced Transformations

select_expression: Apply expressions to create or transform columns
- Use SQL-like expressions to create or modify columns
- Include "*" to keep all existing columns
- Create calculated fields with arithmetic operations
- Apply conditional logic with CASE statements
deduplicate: Remove duplicate rows
- Specify columns to check for duplicates
- Control which duplicate record to keep (first, last)
union: Combine rows from two dataframes
- Choose whether to require matching columns
- Specify whether to remove duplicates
- Useful for combining data from multiple sources with similar structure

Managing Multiple Transformations

Transformations are applied in sequence, with the output of one feeding into the next
You can add multiple transformations by clicking the Add Transformation button again
Each transformation appears in the visual pipeline view
You can edit or delete transformations by selecting them in the pipeline

Transformation Pipeline Configuration

The UI simplifies the process, but behind the scenes, the transformation pipeline consists of:

Config Section: Controls the computing resources for your transformation
Inputs Section: Defines the data sources to process
Transformations Section: The sequence of transformations to apply
Finalisers Section: How the transformed data should be written and validated

The UI handles most of these configurations automatically, but understanding the pipeline structure can help when building complex transformations.

Previewing and Testing Transformations

Before applying transformations to your entire dataset, you can preview results:
- Use the preview feature to see how transformations affect a small sample of data
- Check that the column types and values appear as expected
- Verify that the transformations produce the desired output structure
Run the transformation on a limited dataset:
- For large datasets, consider testing on a subset first
- Monitor execution time and resource usage
- Adjust transformation parameters as needed for performance

Monitoring Transformation Execution

Once you've configured your transformations:

Execute the transformation pipeline
Monitor progress in the execution log
Check for success or failure status
If failed, review the logs for error details:
- Schema mismatches
- Data quality issues
- Resource limitations
- Syntax errors

Important: If a transformation fails but no error appears in the logs, it may be due to a schema mismatch. The output data structure must match the schema defined for the data product. Check the Data Quality tab for validation results.

Best Practices

Preview first: Always test with preview=True before running full transformations
Schema alignment: Ensure final output exactly matches data product schema
Error handling: Check both compute logs and quality validations for failures
Resource sizing: Adjust executor instances and memory based on data volume
Step validation: Use builder state to verify schema at each transformation step
Keep transformations simple and modular
Prefer built-in transformation types over custom SQL when possible
For large datasets, filter early in the pipeline to reduce data volume

PreviousUsing the API to implement Data Transformations NextDefault Data Product Types

Last updated 5 months ago

hashtagWhen to Use Transformations

hashtagAdding Transformations

hashtagTransformation Types in Detail

hashtag1. Column Selection and Renaming

hashtag2. Data Type Modifications

hashtag3. Filtering and Conditions

hashtag4. Data Aggregation

hashtag5. Data Joining

hashtag6. Data Reshaping

hashtag7. Advanced Transformations

hashtagManaging Multiple Transformations

hashtagTransformation Pipeline Configuration

hashtagPreviewing and Testing Transformations

hashtagMonitoring Transformation Execution

hashtagBest Practices