Adding and Managing Data Sources

Data sources are the actual connection points that ingest data into Foundation from external systems. They represent the points where data is ingested from various external systems such as databases, file systems, APIs, and streaming sources. Each data source maintains the configuration and credentials needed to connect to and retrieve data from external systems.

When to Use Data Sources

Data sources are essential in the following scenarios:

External data ingestion: When you need to ingest data from external systems into your data platform
Reusable connections: When you want to establish a reusable connection configuration for a specific data source
Secure access: When you need to maintain secure credentials for accessing external systems
Pipeline automation: When you want to set up regular data ingestion pipelines from specific sources
Multiple data types: When working with various data formats (databases, cloud storage, APIs, streaming data)

Creating a Data Source

Creating a data source is a multi-step process that involves initial creation, linking to data systems, configuration, and setting up secrets.

Step 1: Initial Data Source Creation

Endpoint: POST /api/data/data_source

Request Body:

{
  "entity": {
    "name": "Data Source example",
    "entity_type": "origin",
    "label": "DSE",
    "description": "This is an example for data source"
  },
  "entity_info": {
    "owner": "[email protected]",
    "contact_ids": [
      "Data Source contact"
    ],
    "links": [
      "example.com"
    ]
  }
}

Required Headers

Authorization: Bearer {your_access_token}
x-org: {your_organization_name}

Key Fields Explanation

name: Descriptive name for the data source
entity_type: Always "origin" for data source endpoints
label: Short identification code (typically 3 letters)
description: Detailed description of what this data source provides
entity_info: Contact information for the person/team responsible for this data source

Response

The API returns a response with the data source details:

{
  "entity": {
    "identifier": "1a083d33-46a1-45e5-a709-fd3d5ac9823f",
    "urn": "urn:meshx:backend:data:root:origin:1a083d33-46a1-45e5-a709-fd3d5ac9823f",
    "name": "Data Source example test",
    "is_system": false,
    "description": "This is an example for data source",
    "label": "DSE",
    "created_at": "2025-04-10T13:21:18.078779Z",
    "state": {
      "code": "001",
      "reason": "Requires configuration.",
      "healthy": false
    },
    "owner": null
  },
  "entity_info": {
    "owner": "[email protected]",
    "contact_ids": [
      "Data Source contact"
    ],
    "links": [
      "example.com"
    ]
  },
  "links": {
    "parents": [],
    "children": []
  },
  "compute_identifier": null,
  "secrets": [],
  "connection": null
}

Important: Note that the initial state is "001" with reason "Requires configuration" and healthy status is "false". This is expected, as you'll need to set up the connection details next.

Step 2: Link Data Source to Data System

Endpoint: POST /api/data/link/data_system/data_source

Parameters:

identifier: The data system identifier
child_identifier: The data source identifier

Step 3: Configure Connection Details

Endpoint: PUT /api/data/data_source/connection?identifier={data_source_id}

Request Body (S3 example):

{
  "connection": {
    "connection_type": "s3",
    "url": "s3-endpoint-url",
    "access_key": {
      "env_key": "S3_ACCESS_KEY"
    },
    "access_secret": {
      "env_key": "S3_SECRET_KEY"
    }
  }
}

Step 4: Set Connection Secrets

Endpoint: POST /api/data/data_source/secret?identifier={data_source_id}

Request Body:

{
  "S3_ACCESS_KEY": "your_access_key_value",
  "S3_SECRET_KEY": "your_secret_key_value"
}

Supported Connection Types

Foundation currently supports the following connection types:

"database": For database connections (PostgreSQL, MySQL, etc.)
"s3": For S3-compatible storage systems
"synthetic": For generating synthetic data for testing purposes

Complete Python Example

Here's a comprehensive Python example that demonstrates the entire data source creation and configuration process:

def create_data_source(name, description, owner_email="[email protected]"):
    """Create a new data source"""
    data_source_resp = requests.post(
        f"{API_URL}/data/data_source",
        headers=get_headers(),
        json={
            "entity": {
                "name": name,
                "entity_type": "origin",
                "label": name[:3].upper(),
                "description": description
            },
            "entity_info": {
                "owner": owner_email,
                "contact_ids": [f"{name} contact"],
                "links": ["example.com"]
            }
        }
    )
    
    if data_source_resp.status_code == 200:
        return data_source_resp.json()["entity"]["identifier"]
    else:
        print(f"Error creating data source: {data_source_resp.text}")
        return None

def link_data_source_to_data_system(data_system_id, data_source_id):
    """Link a data source to a data system"""
    link_resp = requests.post(
        f"{API_URL}/data/link/data_system/data_source",
        headers=get_headers(),
        params={
            "identifier": data_system_id,
            "child_identifier": data_source_id
        }
    )
    
    return link_resp.status_code == 200

def configure_s3_data_source(data_source_id, s3_url):
    """Configure S3 connection for a data source"""
    connection_resp = requests.put(
        f"{API_URL}/data/data_source/connection?identifier={data_source_id}",
        headers=get_headers(),
        json={
            "connection": {
                "connection_type": "s3",
                "url": s3_url,
                "access_key": {"env_key": "S3_ACCESS_KEY"},
                "access_secret": {"env_key": "S3_SECRET_KEY"}
            }
        }
    )
    
    return connection_resp.status_code == 200

def set_s3_secrets(data_source_id, access_key, secret_key):
    """Set S3 access credentials for a data source"""
    secrets_resp = requests.post(
        f"{API_URL}/data/data_source/secret?identifier={data_source_id}",
        headers=get_headers(),
        json={
            "S3_ACCESS_KEY": access_key,
            "S3_SECRET_KEY": secret_key
        }
    )
    
    return secrets_resp.status_code == 200

Managing Existing Data Sources

Once you have created and configured data sources, you can perform various management operations:

List All Data Sources

GET /api/data/data_source/list

Get Specific Data Source

GET /api/data/data_source?identifier={data_source_id}

Update Data Source

PUT /api/data/data_source?identifier={data_source_id}

Delete Data Source

DELETE /api/data/data_source?identifier={data_source_id}

Important Notes

Secret Management: The keys in the secrets JSON object must match the env_key values you specified in the connection configuration
Connection State: After proper configuration, the data source state should change from "001" (Requires configuration) to a healthy state
Cascading Effects: Remember that data sources feed into data objects, so any connection issues will affect downstream data processing
Security: Always use environment variables or secure secret management for sensitive connection details

PreviousOverview NextImporting Files into Foundation

Last updated 3 months ago