Using a K-Means Model
Model Overview
K-Means is an unsupervised clustering algorithm that groups similar data points into distinct clusters based on their feature similarity. Foundation's implementation provides advanced capabilities beyond standard clustering, including automatic optimal cluster selection, statistical outlier detection, and comprehensive cluster analysis. The model is particularly effective for customer segmentation, anomaly detection, performance categorization, and identifying patterns in operational data.
The K-Means implementation in Foundation incorporates several sophisticated features that enhance its effectiveness. The algorithm uses K-means++ initialization, which intelligently selects initial cluster centers to improve convergence speed and final cluster quality. The model includes automatic outlier detection through statistical analysis of distances to cluster centroids, identifying data points that deviate significantly from their assigned clusters. Additionally, the implementation provides automatic feature scaling and preprocessing, ensuring all features contribute equally to the clustering process regardless of their original scales.
Dual Operating Modes
Foundation's K-Means implementation supports the two distinct operating modes, providing flexibility for different use cases and operational requirements.
Training/Inference Approach
This approach separates model training and application into distinct phases. During training, the model learns cluster centers from your data and stores them for future use. During inference, these stored cluster centers are applied to new data for consistent clustering and outlier detection. This mode is ideal for production scenarios where you need stable, versioned clustering models that can be applied consistently across different time periods or datasets.
Batch Processing Mode
This mode trains and applies the clustering model in a single execution without persistence. The model learns cluster centers from the input data and immediately applies them to generate cluster assignments and outlier classifications. This approach is optimal for one-time analyses, exploratory data analysis, or scenarios where fresh clusters are needed for each batch of data without the need for model versioning or reuse.
Training/Inference Approach
Model Metadata Data Product
The metadata data product handles K-Means model training and stores comprehensive information about the trained clustering model.
Schema Configuration
{
"details": {
"data_product_type": "stored",
"fields":
{
"name": "metadata",
"primary": false,
"optional": true,
"data_type": {
"column_type": "VARCHAR"
},
"classification": "internal"
},
{
"name": "model_path",
"primary": false,
"optional": true,
"data_type": {
"column_type": "VARCHAR"
},
"classification": "internal"
},
{
"name": "metadata_path",
"primary": false,
"optional": true,
"data_type": {
"column_type": "VARCHAR"
},
"classification": "internal"
},
{
"name": "version",
"primary": false,
"optional": true,
"data_type": {
"column_type": "VARCHAR"
},
"classification": "internal"
},
{
"name": "model_name",
"primary": false,
"optional": true,
"data_type": {
"column_type": "VARCHAR"
},
"classification": "internal"
},
{
"name": "model_type",
"primary": false,
"optional": true,
"data_type": {
"column_type": "VARCHAR"
},
"classification": "internal"
},
{
"name": "created_at",
"primary": false,
"optional": true,
"data_type": {
"column_type": "TIMESTAMPTZ"
},
"classification": "internal"
}
]
}
}Builder Configuration
{
"config": {
"docker_tag": "0.1.033",
"executor_core_request": "800m",
"executor_core_limit": "2000m",
"executor_instances": 1,
"min_executor_instances": 1,
"max_executor_instances": 1,
"executor_memory": "6144m",
"driver_core_request": "0.3",
"driver_core_limit": "1500m",
"driver_memory": "3072m"
},
"inputs": {
"input_operations_data": {
"input_type": "data_product",
"identifier": "operations-data-id",
"preview_limit": 10
}
},
"transformations": [
{
"transform": "filter_with_condition",
"input": "input_operations_data",
"output": "filtered_data",
"condition": "metric_value IS NOT NULL"
},
{
"transform": "k_means_training",
"input": "filtered_data",
"output": "model_metadata",
"features_to_use": [
"throughput_rate",
"efficiency_score",
"quality_metric",
"resource_utilization",
"cycle_time"
],
"drop_cols": [
"category",
"notes"
],
"number_clusters": null,
"min_number_clusters": 3,
"max_number_clusters": 10,
"optimal_cluster_selection": true,
"statistical_threshold": 3.0,
"random_seed": 42,
"model_bucket": "models",
"project_name": "operations_clustering",
"enable_mlflow": true
}
],
"finalisers": {
"input": "model_metadata",
"enable_quality": false,
"write_config": {
"mode": "overwrite"
},
"enable_profiling": false,
"enable_classification": false
},
"preview": false
}The k_means_training transformation provides extensive configuration options:
features_to_use: List of numeric columns to use for clustering. If not specified, all numeric columns are automatically selected.
drop_cols: Columns to exclude from the dataset before processing.
number_clusters: Fixed number of clusters to create. Used when optimal_cluster_selection is false.
min_number_clusters: Minimum number of clusters to consider during optimization (default: 2).
max_number_clusters: Maximum number of clusters to evaluate during optimization. If not specified, defaults to 15 or dataset_size/20, whichever is smaller.
optimal_cluster_selection: When true, automatically determines the optimal number of clusters using silhouette analysis, testing different k values to find the configuration with the best cluster separation.
statistical_threshold: Number of standard deviations from the mean distance to consider a point an outlier (default: 3.0). Lower values create more stringent outlier detection.
random_seed: Seed for reproducible clustering results.
Metadata Output Format
{
"metadata": {
"model_name": "operations_clustering_kmeans_model",
"version": "v1",
"training_date": "2025-01-22 09:45:00+00:00",
"model_type": "KMeans",
"configuration": {
"k_clusters": 5,
"features": [
"throughput_rate",
"efficiency_score",
"quality_metric",
"resource_utilization",
"cycle_time"
],
"random_seed": 42,
"statistical_threshold": 3.0,
"optimal_cluster_selection": true
},
"cluster_centers": [
{
"cluster_id": 0,
"centroid": [125.5, 0.82, 0.95, 0.75, 45.2]
},
{
"cluster_id": 1,
"centroid": [95.3, 0.65, 0.88, 0.60, 62.8]
}
],
"outlier_thresholds": {
"0": {
"mean_distance": 15.6,
"std_distance": 4.2,
"threshold": 28.2,
"cluster_size": 245
},
"1": {
"mean_distance": 18.3,
"std_distance": 5.1,
"threshold": 33.6,
"cluster_size": 189
}
}
},
"model_path": "models/operations_clustering/kmeans/v1/model",
"metadata_path": "models/operations_clustering/kmeans/v1/metadata.json",
"version": "v1",
"model_name": "operations_clustering_kmeans_model",
"model_type": "KMeans",
"created_at": "2025-01-22T09:45:00.000Z"
}Predictions Data Product
The predictions data product loads the trained K-Means model and applies it to new data for cluster assignment and outlier detection.
Schema Configuration
{
"details": {
"data_product_type": "stored",
"fields": [
{
"name": "record_id",
"primary": false,
"optional": true,
"data_type": {
"column_type": "VARCHAR"
},
"classification": "internal"
},
{
"name": "timestamp",
"primary": false,
"optional": true,
"data_type": {
"column_type": "TIMESTAMP"
},
"classification": "internal"
},
{
"name": "throughput_rate",
"primary": false,
"optional": true,
"data_type": {
"column_type": "DOUBLE"
},
"classification": "internal"
},
{
"name": "efficiency_score",
"primary": false,
"optional": true,
"data_type": {
"column_type": "DOUBLE"
},
"classification": "internal"
},
{
"name": "quality_metric",
"primary": false,
"optional": true,
"data_type": {
"column_type": "DOUBLE"
},
"classification": "internal"
},
{
"name": "resource_utilization",
"primary": false,
"optional": true,
"data_type": {
"column_type": "DOUBLE"
},
"classification": "internal"
},
{
"name": "cycle_time",
"primary": false,
"optional": true,
"data_type": {
"column_type": "DOUBLE"
},
"classification": "internal"
},
{
"name": "cluster_id",
"primary": false,
"optional": true,
"data_type": {
"column_type": "INTEGER"
},
"classification": "internal"
},
{
"name": "classification",
"primary": false,
"optional": true,
"data_type": {
"column_type": "VARCHAR"
},
"classification": "internal"
},
{
"name": "is_outlier",
"primary": false,
"optional": true,
"data_type": {
"column_type": "BOOLEAN"
},
"classification": "internal"
},
{
"name": "distance_to_centroid",
"primary": false,
"optional": true,
"data_type": {
"column_type": "DOUBLE"
},
"classification": "internal"
}
]
}
}
Builder Configuration
{
"config": {
"docker_tag": "0.1.032",
"executor_core_request": "800m",
"executor_core_limit": "2000m",
"executor_instances": 1,
"min_executor_instances": 1,
"max_executor_instances": 1,
"executor_memory": "6144m",
"driver_core_request": "0.3",
"driver_core_limit": "1500m",
"driver_memory": "3072m"
},
"inputs": {
"input_operations_data": {
"input_type": "data_product",
"identifier": "operations-data-id",
"preview_limit": 10
}
},
"transformations": [
{
"transform": "filter_with_condition",
"input": "input_operations_data",
"output": "filtered_data",
"condition": "metric_value IS NOT NULL"
},
{
"transform": "k_means_inference",
"input": "filtered_data",
"output": "predictions",
"model_bucket": "models",
"project_name": "operations_clustering",
"version": null,
"output_cols": [
"record_id",
"timestamp",
"throughput_rate",
"efficiency_score",
"quality_metric",
"resource_utilization",
"cycle_time"
],
"include_distance": true
}
],
"finalisers": {
"input": "predictions",
"enable_quality": false,
"write_config": {
"mode": "overwrite"
},
"enable_profiling": false,
"enable_classification": false
},
"preview": false
}The k_means_inference transformation parameters include:
version: Specific model version to load, or null for the latest version.
output_cols: List of columns from the original data to include in the output. If not specified, all original columns are retained.
include_distance: Whether to include the distance_to_centroid column in the output, useful for understanding clustering confidence.
Prediction Output Format
OP-001
2025-01-22 10:00:00
118.5
0.79
0
cluster_0
false
12.45
OP-002
2025-01-22 10:15:00
92.3
0.62
1
cluster_1
false
8.76
OP-003
2025-01-22 10:30:00
145.8
0.91
0
outlier
true
42.18
OP-004
2025-01-22 10:45:00
105.2
0.71
2
cluster_2
false
15.23
Batch Processing Mode
The batch processing mode combines training and inference in a single transformation, ideal for exploratory analysis and one-time clustering tasks.
Schema Configuration
The schema includes both input features and clustering results:
{
"details": {
"data_product_type": "stored",
"fields": [
{
"name": "record_id",
"primary": false,
"optional": true,
"data_type": {
"column_type": "VARCHAR"
},
"classification": "internal"
},
{
"name": "timestamp",
"primary": false,
"optional": true,
"data_type": {
"column_type": "TIMESTAMP"
},
"classification": "internal"
},
{
"name": "throughput_rate",
"primary": false,
"optional": true,
"data_type": {
"column_type": "DOUBLE"
},
"classification": "internal"
},
{
"name": "efficiency_score",
"primary": false,
"optional": true,
"data_type": {
"column_type": "DOUBLE"
},
"classification": "internal"
},
{
"name": "quality_metric",
"primary": false,
"optional": true,
"data_type": {
"column_type": "DOUBLE"
},
"classification": "internal"
},
{
"name": "cluster_id",
"primary": false,
"optional": true,
"data_type": {
"column_type": "INTEGER"
},
"classification": "internal"
},
{
"name": "classification",
"primary": false,
"optional": true,
"data_type": {
"column_type": "VARCHAR"
},
"classification": "internal"
},
{
"name": "is_outlier",
"primary": false,
"optional": true,
"data_type": {
"column_type": "BOOLEAN"
},
"classification": "internal"
},
{
"name": "distance_to_centroid",
"primary": false,
"optional": true,
"data_type": {
"column_type": "DOUBLE"
},
"classification": "internal"
}
]
}
}
Builder Configuration
{
"config": {
"docker_tag": "0.1.033",
"executor_core_request": "800m",
"executor_core_limit": "2000m",
"executor_instances": 1,
"min_executor_instances": 1,
"max_executor_instances": 1,
"executor_memory": "6144m",
"driver_core_request": "0.3",
"driver_core_limit": "1500m",
"driver_memory": "3072m"
},
"inputs": {
"input_operations_data": {
"input_type": "data_product",
"identifier": "operations-data-id",
"preview_limit": 10
}
},
"transformations": [
{
"transform": "filter_with_condition",
"input": "input_operations_data",
"output": "filtered_data",
"condition": "metric_value IS NOT NULL"
},
{
"transform": "k_means_batch_processing",
"input": "filtered_data",
"output": "batch_clusters",
"features_to_use": [
"throughput_rate",
"efficiency_score",
"quality_metric"
],
"drop_cols": [
"notes"
],
"number_clusters": 5,
"min_number_clusters": null,
"max_number_clusters": null,
"optimal_cluster_selection": false,
"statistical_threshold": 2.5,
"random_seed": 42,
"output_cols": [
"record_id",
"timestamp",
"throughput_rate",
"efficiency_score",
"quality_metric"
],
"include_distance": true,
"project_name": "operations_clustering",
"enable_mlflow": true
}
],
"finalisers": {
"input": "batch_clusters",
"enable_quality": false,
"write_config": {
"mode": "overwrite"
},
"enable_profiling": false,
"enable_classification": false
},
"preview": false
}
Batch Processing Output
OP-001
2025-01-22 10:00:00
118.5
0.79
1
cluster_1
false
10.25
OP-002
2025-01-22 10:15:00
92.3
0.62
3
cluster_3
false
7.48
OP-003
2025-01-22 10:30:00
145.8
0.91
1
outlier
true
38.92
Clustering Algorithm Features
Foundation's K-Means implementation incorporates several advanced features that enhance its effectiveness and reliability. The K-means++ initialization strategy selects initial cluster centers that are maximally spread apart, improving convergence speed and avoiding poor local optima. The algorithm uses Euclidean distance as the similarity metric, with automatic feature standardization ensuring all dimensions contribute equally to distance calculations.
The outlier detection mechanism analyzes the distribution of distances within each cluster, calculating mean and standard deviation of distances to the centroid. Points exceeding the threshold (mean + statistical_threshold × std) are flagged as outliers, receiving a special "outlier" classification while maintaining their cluster assignment for reference. This dual classification system enables both cluster-based analysis and anomaly detection simultaneously.
When optimal_cluster_selection is enabled, the system performs silhouette analysis across the specified range of cluster numbers. The silhouette coefficient measures how similar each point is to its own cluster compared to other clusters, with values ranging from -1 to 1. The algorithm selects the k value that maximizes the average silhouette score, ensuring well-separated and cohesive clusters.
The model includes comprehensive preprocessing with automatic handling of missing values through row exclusion, feature scaling using standardization (zero mean, unit variance), and validation of numeric data types. This preprocessing pipeline is stored with the model to ensure consistent application during inference.
Model Versioning and Management
Foundation automatically manages model versions, incrementing the version number with each training run. This versioning system enables you to track model evolution, compare performance across versions, and rollback to previous versions if needed. The model storage structure follows a consistent pattern: models/[project_name]/[target_column]/[version]/, making it easy to locate and manage models across different projects and targets.
The metadata stored alongside each model provides complete reproducibility, capturing not only the model parameters but also the feature columns, training date, and performance metrics. This comprehensive tracking ensures you can understand exactly how each model was created and how it performs, facilitating model governance and compliance requirements.
Last updated