Using a LightGBM Model

Model Overview

LightGBM (Light Gradient Boosting Machine) is a high-performance gradient boosting framework designed for regression and time-series forecasting tasks. Within Foundation, LightGBM excels at predicting continuous numerical values based on historical patterns and relationships in your data. The model is particularly effective for business metrics forecasting, demand prediction, performance indicators estimation, and any scenario where you need to predict future numerical values based on past observations.

The LightGBM implementation in Foundation provides advanced capabilities including automatic feature engineering for time-series data, intelligent feature selection based on importance scores, and automated hyperparameter tuning to optimize model performance.

Training and Inference Approach

LightGBM in Foundation operates exclusively through a training/inference approach, which means the workflow is divided into two distinct phases. First, you train a model using historical data, which Foundation then stores in its model repository. Subsequently, you can load this trained model to generate predictions on new data. This separation ensures model reproducibility, version control, and the ability to deploy the same model across different data products or time periods.

Model Metadata Data Product

The first step in implementing LightGBM is creating a data product that handles model training and stores the resulting metadata. This data product processes your feature data, trains the model according to your specifications, and outputs comprehensive information about the trained model including its location, performance metrics, and configuration.

Schema Configuration

The metadata data product requires a schema that captures all essential information about the trained model. Here's the standard schema structure:

{
    "details": {
      "data_product_type": "stored",
      "fields": [
        {
          "name": "metadata",
          "primary": false,
          "optional": true,
          "data_type": {
            "column_type": "VARCHAR"
          },
          "classification": "internal"
        },
        {
          "name": "model_path",
          "primary": false,
          "optional": true,
          "data_type": {
            "column_type": "VARCHAR"
          },
          "classification": "internal"
        },
        {
          "name": "metadata_path",
          "primary": false,
          "optional": true,
          "data_type": {
            "column_type": "VARCHAR"
          },
          "classification": "internal"
        },
        {
          "name": "version",
          "primary": false,
          "optional": true,
          "data_type": {
            "column_type": "VARCHAR"
          },
          "classification": "internal"
        },
        {
          "name": "model_type",
          "primary": false,
          "optional": true,
          "data_type": {
            "column_type": "VARCHAR"
          },
          "classification": "internal"
        },
        {
          "name": "created_at",
          "primary": false,
          "optional": true,
          "data_type": {
            "column_type": "TIMESTAMPTZ"
          },
          "classification": "internal"
        }
      ]
    }
}

Builder Configuration

The builder configuration defines how the model training transformation executes. It includes data preprocessing steps, the training transformation itself, and configuration for computational resources:

{
    "config": {
      "docker_tag": "0.0.58",
      "executor_core_request": "800m",
      "executor_core_limit": "4000m",
      "executor_instances": 1,
      "executor_memory": "7168m",
      "driver_memory": "3072m"
    },
    "inputs": {
      "input_ml_features": {
        "input_type": "data_product",
        "identifier": data_product_id,
        "preview_limit": 10
      }
    },
    "transformations": [
      {
        "transform": "filter_with_condition",
        "input": "input_ml_features",
        "output": "cleaned_data",
        "condition": "year >= 2023 AND date <= current_date()"
      },
      {
        "transform": "regression_training",
        "input": "cleaned_data",
        "output": "model_metadata",
        "timestamp_col": "date",
        "target_cols": ["performance_metric"],
        "feature_cols": [],
        "drop_cols": [],
        "random_seed": 42,
        "time_features": ["hour", "day_of_week", "month", "day_of_month"],
        "train_ratio": 0.8,
        "feature_selection_threshold": 0.01,
        "enable_parameter_tuning": true,
        "forecast_horizon": 1,
        "forecast_granularity": "day",
        "parameter_tuning": {
          "search_type": "random",
          "n_configs": 3,
          "numLeaves": [15, 31, 63],
          "maxDepth": [4, 6, 8],
          "learningRate": [0.01, 0.05, 0.1],
          "numIterations": [100, 200]
        },
        "model_bucket": "models",
        "project_name": "example_project"
      }
    ],
    "finalisers": {
      "input": "model_metadata",
      "enable_quality": false,
      "write_config": {"mode": "append"}
    }
}

The regression_training transformation accepts numerous parameters that control the training process. When feature_cols is left empty, the transformation automatically selects all numeric columns as features. The time_features parameter enables automatic generation of temporal features from the timestamp column, enriching the model with cyclical patterns. The enable_parameter_tuning option activates hyperparameter optimization, testing different configurations to find the optimal model settings.

Metadata Output Format

After successful training, the metadata data product produces a single record containing comprehensive model information:

{
  "metadata": {
    "model_name": "performance_forecast_model",
    "version": "v1",
    "training_date": "2025-01-15 14:30:00+00:00",
    "target_column": "performance_metric",
    "timestamp_column": "date",
    "feature_columns": [
      "feature_a", "feature_b", "day_of_week_cos", 
      "day_of_week_sin", "month", "rolling_mean_7"
    ],
    "metrics": {
      "rmse": 8.45,
      "mae": 6.32,
      "mape": 0.15
    },
    "hyperparameters": {
      "objective": "regression",
      "numLeaves": 31,
      "maxDepth": 6,
      "learningRate": 0.01,
      "numIterations": 200
    },
    "training_samples": 500,
    "test_samples": 125
  },
  "model_path": "models/example_project/performance_metric/v1/model",
  "metadata_path": "models/example_project/performance_metric/v1/metadata.json",
  "version": "v1",
  "created_at": "2025-01-15T14:30:00.000Z"
}

Predictions Data Product

Once a model is trained and stored, the predictions data product uses it to generate forecasts. This data product loads the specified model version (or the latest version if unspecified) and applies it to generate predictions for the defined forecast horizon.

Schema Configuration

The predictions schema defines the structure of the forecasted values:

{
    "details": {
      "data_product_type": "stored",
      "fields": [
        {
          "name": "date",
          "data_type": {"column_type": "DATE"},
          "classification": "internal"
        },
        {
          "name": "performance_metric",
          "data_type": {"column_type": "DOUBLE"},
          "classification": "internal"
        },
        {
          "name": "model_version",
          "data_type": {"column_type": "VARCHAR"},
          "classification": "internal"
        },
        {
          "name": "_predicted_at",
          "data_type": {"column_type": "TIMESTAMPTZ"},
          "classification": "internal"
        }
      ]
    }

}

Builder Configuration

The predictions builder loads the trained model and generates forecasts:

{
    "config": {
      "docker_tag": "0.0.58",
      "executor_core_request": "800m",
      "executor_core_limit": "4000m",
      "executor_instances": 1,
      "executor_memory": "7168m",
      "driver_memory": "3072m"
    },
    "inputs": {
      "input_ml_features": {
        "input_type": "product"
      }
    },
    "transformations": [
      {
        "transform": "filter_with_condition",
        "input": "input_ml_features",
        "output": "filtered_data",
        "condition": "year >= 2023 AND date <= current_date()"
      },
      {
        "transform": "regression_prediction",
        "input": "filtered_data",
        "output": "predictions",
        "target_cols": ["performance_metric"],
        "timestamp_col": "date",
        "model_bucket": "models",
        "project_name": "example_project",
        "version": null,
        "forecast_horizon": 30,
        "forecast_granularity": "day",
        "history_periods": 30,
        "forecast_method": "mean"
      },
      {
        "transform": "rename_column",
        "input": "predictions",
        "output": "final_predictions",
        "changes": {
          "performance_metric_prediction": "performance_metric"
        }
      }
    ],
    "finalisers": {
      "input": "final_predictions",
      "write_config": {"mode": "overwrite"}
    }
}

The regression_prediction transformation handles model loading and inference. Setting version to null automatically uses the latest model version, though you can specify a particular version for reproducibility. The forecast_horizon determines how many future periods to predict, while history_periods defines how much historical data to use for generating predictions. The forecast_method parameter controls how historical values are aggregated when creating lagged features.

Prediction Output Format

The predictions data product generates forecasted values for the specified horizon:

date

performance_metric

model_version

_predicted_at

2025-01-16

87.45

2025-01-15T14:30:00.000Z

2025-01-17

89.32

2025-01-15T14:30:00.000Z

2025-01-18

86.78

2025-01-15T14:30:00.000Z

2025-01-19

91.23

2025-01-15T14:30:00.000Z

2025-01-20

88.56

2025-01-15T14:30:00.000Z

Each prediction includes the forecasted date, the predicted value, the model version used, and a timestamp indicating when the predictions were generated. This structure enables tracking of prediction history and model performance over time.

Feature Engineering and Model Optimization

Foundation's LightGBM implementation includes sophisticated feature engineering capabilities that automatically enhance your input data. When time_features are specified, the system generates cyclical encodings for temporal patterns, ensuring the model captures seasonality effectively. The framework also creates lag features and rolling statistics based on the forecast_horizon and history_periods parameters, enriching the feature space with historical context.

The feature selection mechanism evaluates each feature's contribution to the model's predictive power, automatically excluding features that fall below the specified threshold. This process not only improves model performance by removing noise but also reduces computational requirements and helps prevent overfitting.

When parameter tuning is enabled, the system tests multiple hyperparameter configurations using cross-validation on the training data. The search_type parameter determines whether to use random search or grid search, with random search typically providing good results more efficiently for larger parameter spaces. The system evaluates each configuration based on the specified metrics and selects the best performing model for deployment.

PreviousFoundation's ML Approach NextUsing an LSTM Model

Last updated 2 months ago