Configuring Data Transformations through the UI
Transformations convert raw data into usable data products by applying processing steps to input data. They define how raw data from your sources is processed to match the expected schema of your data product.
When to Use Transformations
When raw data needs processing before business use
When you need to standardize, clean, or enrich data
When you need to combine data from multiple sources
When you need to apply business rules or calculations
When you need to restructure data to match consumer requirements
Adding Transformations
Navigate to the Data Catalog or the Data Landscape
Search for the data product you want to add transformations for
Navigate to the Lineage tab of your data product
Click the Edit Transformation button
In the "Add Transformation" dialog, you'll need to specify:
Data Frame: Select the input data source
Transformation: Choose the type of transformation to apply (e.g., "select_columns")
Transformation name: Give your transformation a descriptive name
Columns: Select which columns to include or modify
For a "select_columns" transformation:
Check the boxes next to columns you want to include in your output
Each column shows its data type (e.g., VARCHAR)
Click OK to add the transformation to your pipeline

Transformation Types in Detail
MeshX Foundation supports numerous transformation types to manipulate and refine your data:
1. Column Selection and Renaming
select_columns: Choose which columns to include in the output
Use the checkboxes to select columns to keep
Deselected columns will be excluded from the output
rename_column: Change column names for clarity or standardization
Specify old column names and their new names
Helpful for standardizing naming conventions or making cryptic column names more readable
2. Data Type Modifications
cast: Convert data types (e.g., string to integer, string to date)
Select columns to convert
Choose the target data type for each column
Supported types include: INTEGER, VARCHAR, DECIMAL, DATE, TIMESTAMP, BOOLEAN, etc.
3. Filtering and Conditions
filter_with_condition: Filter rows based on specific criteria
Create conditions like "column_name > 100" or "status = 'ACTIVE'"
Supports complex expressions using AND, OR, and parentheses
Only rows matching the condition are kept in the output
4. Data Aggregation
group_by: Aggregate data based on specific dimensions
Select grouping columns (dimensions)
Define aggregation functions for measure columns
Supported aggregations: sum, count, min, max, avg, etc.
Creates summary statistics for each unique combination of dimension values
5. Data Joining
join_rename_select: Join two dataframes and handle column naming
Select the type of join (inner, left, right, full)
Specify join conditions (equality or other comparisons)
Choose which columns to include from each source
Rename columns to resolve conflicts or improve clarity
Note: Columns are prefixed with "left_" and "right_" in the joined result
6. Data Reshaping
pivot: Restructure data from long to wide format
Select dimension columns that remain as rows
Choose the column whose values become new columns
Specify the values column to populate the new structure
Define aggregation functions for when multiple values map to the same cell
unpivot: Restructure data from wide to long format
Select key columns to keep as dimensions
Choose value columns to convert to rows
Specify name for the new category column
Specify name for the new value column
7. Advanced Transformations
select_expression: Apply expressions to create or transform columns
Use SQL-like expressions to create or modify columns
Include "*" to keep all existing columns
Create calculated fields with arithmetic operations
Apply conditional logic with CASE statements
deduplicate: Remove duplicate rows
Specify columns to check for duplicates
Control which duplicate record to keep (first, last)
union: Combine rows from two dataframes
Choose whether to require matching columns
Specify whether to remove duplicates
Useful for combining data from multiple sources with similar structure
Managing Multiple Transformations
Transformations are applied in sequence, with the output of one feeding into the next
You can add multiple transformations by clicking the Add Transformation button again
Each transformation appears in the visual pipeline view
You can edit or delete transformations by selecting them in the pipeline
Transformation Pipeline Configuration
The UI simplifies the process, but behind the scenes, the transformation pipeline consists of:
Config Section: Controls the computing resources for your transformation
Inputs Section: Defines the data sources to process
Transformations Section: The sequence of transformations to apply
Finalisers Section: How the transformed data should be written and validated
The UI handles most of these configurations automatically, but understanding the pipeline structure can help when building complex transformations.
Previewing and Testing Transformations
Before applying transformations to your entire dataset, you can preview results:
Use the preview feature to see how transformations affect a small sample of data
Check that the column types and values appear as expected
Verify that the transformations produce the desired output structure
Run the transformation on a limited dataset:
For large datasets, consider testing on a subset first
Monitor execution time and resource usage
Adjust transformation parameters as needed for performance
Monitoring Transformation Execution
Once you've configured your transformations:
Execute the transformation pipeline
Monitor progress in the execution log
Check for success or failure status
If failed, review the logs for error details:
Schema mismatches
Data quality issues
Resource limitations
Syntax errors
Important: If a transformation fails but no error appears in the logs, it may be due to a schema mismatch. The output data structure must match the schema defined for the data product. Check the Data Quality tab for validation results.
Best Practices
Preview first: Always test with
preview=Truebefore running full transformationsSchema alignment: Ensure final output exactly matches data product schema
Error handling: Check both compute logs and quality validations for failures
Resource sizing: Adjust executor instances and memory based on data volume
Step validation: Use builder state to verify schema at each transformation step
Keep transformations simple and modular
Prefer built-in transformation types over custom SQL when possible
For large datasets, filter early in the pipeline to reduce data volume
Last updated