Importing Files into Foundation

This guide provides comprehensive instructions for importing files into the Foundation platform for data processing. It covers both UI-based and programmatic approaches for file upload, along with the subsequent steps to transform uploaded files into data objects ready for processing.

What This Guide Covers

This guide specifically addresses scenarios where you need to:

  • Upload files directly to Foundation's managed storage

  • Create data objects from uploaded files

  • Set up the complete pipeline from file upload to data ingestion

Prerequisites

Before importing files, ensure you have:

  • A configured Data Source that can connect to Foundation's storage

  • Appropriate permissions to upload files and create data objects

  • Access to your Foundation environment's storage interface

Method 1: Importing Files via Storage UI

The Storage UI provides a visual interface for uploading files directly to Foundation's managed storage buckets.

Accessing the Storage UI

Navigate to your Foundation's storage interface:

https://storage.<env_name>.meshx.foundation/ui

Replace <env_name> with your specific Foundation name (e.g., stg, prod, dev).

Understanding Storage Buckets

When you access the storage interface, you will see several pre-configured buckets, each serving a specific purpose:

  • connectors: Configuration files for data connectors

  • iceberg: Iceberg table storage

  • models: Machine learning model storage

  • samples: Sample data and test files (recommended for user uploads)

  • warehouse: PostgreSQL table exports

For general file uploads, we recommend using a dedicated bucket (like samples) or creating your own bucket for better organization.

Uploading Files

  1. Select Your Target Bucket

    1. Click on the bucket where you want to store your files

    2. For testing and general use, samples is typically appropriate

  2. Initiate Upload

    1. Click the "Upload" button in the top-right corner of the interface

  1. Select Files or Folders

    1. Choose Files: Select individual files from your local machine

    2. Choose Folder: Upload entire folder structures

    3. Maximum file size: 5GB per file

  2. Complete Upload

    1. Files will be uploaded to the current directory in the selected bucket

    2. You can create subdirectories by navigating into them before uploading

File Organization Best Practices

samples/
├── project-alpha/
│   ├── raw-data/
│   │   ├── transactions-2024.csv
│   │   └── customers-2024.csv
│   └── processed/
└── project-beta/
    └── daily-exports/

Method 2: Importing Files via AWS CLI

Foundation's Storage Engine provides an S3-compatible proxy that allows you to use standard AWS S3 commands regardless of the underlying cloud provider. This enables programmatic and batch file uploads.

Configuration

First, configure your AWS CLI with the appropriate endpoint and credentails:

# Set the endpoint URL for your environment
export S3_ENDPOINT_URL="https://storage.<env_name>.meshx.foundation/ui"

# Example for staging environment
export S3_ENDPOINT_URL="https://storage.stg.meshx.foundation"

# Configure AWS credentials for Foundation storage
aws configure set aws_access_key_id <your_storage_username>
aws configure set aws_secret_access_key <your_storage_password>
aws configure set region us-east-1  # Default region, adjust if needed

# Alternative: Set credentials as environment variables
export AWS_ACCESS_KEY_ID=<your_storage_username>
export AWS_SECRET_ACCESS_KEY=<your_storage_password>

Note: The storage credentials (username/password) should be provided by your Foundation administrator. These are specific to the Foundation storage system and are different from your Foundation UI login credentials. For persistent configuration, you can also create a profile in your AWS credentials file:

# Edit ~/.aws/credentials
[foundation-storage]
aws_access_key_id = <your_storage_username>
aws_secret_access_key = <your_storage_password>

# Then use the profile in commands
aws s3 ls --endpoint-url $S3_ENDPOINT_URL --profile foundation-storage

Common S3 Operations

List Available Buckets

aws s3 mb s3://<bucket_name> --endpoint-url $S3_ENDPOINT_URL

# Example
aws s3 mb s3://my-data-bucket --endpoint-url $S3_ENDPOINT_URL

Create a New Bucket

aws s3 mb s3://<bucket_name> --endpoint-url $S3_ENDPOINT_URL

# Example
aws s3 mb s3://my-data-bucket --endpoint-url $S3_ENDPOINT_URL

Upload a Single File

aws s3 cp <local_file_path> s3://<bucket_name>/<destination_path> --endpoint-url $S3_ENDPOINT_URL

# Example
aws s3 cp /home/user/data/sales_2024.csv s3://samples/sales/sales_2024.csv --endpoint-url $S3_ENDPOINT_URL

Upload Multiple Files (Sync Directory)

aws s3 sync <local_directory> s3://<bucket_name>/<destination_path> --endpoint-url $S3_ENDPOINT_URL

# Example
aws s3 sync "/mnt/c/Users/user/data/cargo-output" s3://samples/cargo-synthetic/ --endpoint-url $S3_ENDPOINT_URL

List Files in a Bucket

aws s3 ls s3://<bucket_name>/<path> --endpoint-url $S3_ENDPOINT_URL --recursive

# Example
aws s3 ls s3://samples/cargo-synthetic/ --endpoint-url $S3_ENDPOINT_URL --recursive

Download Files

aws s3 cp s3://<bucket_name>/<file_path> <local_destination> --endpoint-url $S3_ENDPOINT_URL

# Example
aws s3 cp s3://samples/sales/report.csv ./local_reports/report.csv --endpoint-url $S3_ENDPOINT_URL

Creating Data Objects from Uploaded Files

After uploading files to Foundation storage, you need to create Data Objects to make them available for processing.

Step 1: Ensure Data Source Configuration

Your data source must be configured to connect to Foundation's S3 storage:

# Configure S3 data source (if not already done)
PUT /api/data/data_source/connection?identifier={data_source_id}

{
  "connection": {
    "connection_type": "s3",
    "url": "https://storage.<env_name>.meshx.foundation",
    "access_key": {
      "env_key": "S3_ACCESS_KEY"
    },
    "access_secret": {
      "env_key": "S3_SECRET_KEY"
    }
  }
}

Step 2: Create Data Object

Create a data object pointing to your uploaded file:

POST /api/data/data_object

{
  "entity": {
    "name": "Cargo Transactions Upload",
    "entity_type": "data_object",
    "label": "CTU",
    "description": "Uploaded cargo transaction data from local system"
    "owner": "[email protected]",
  },
  "entity_info": {
    "owner": "[email protected]",
    "contact_ids": ["Data Engineering Team"],
    "links": []
  }
}

Connect the data object to your S3 data source:

POST /api/data/link/data_source/data_object

Parameters:
- identifier: {s3_data_source_id}
- child_identifier: {data_object_id}

Step 4: Configure File Path

Point the data object to your uploaded file:

PUT /api/data/data_object/config?identifier={data_object_id}

{
  "configuration": {
    "data_object_type": "csv",
    "path": "/samples/cargo-synthetic/transactions_2024.csv",
    "has_header": true,
    "delimiter": ",",
    "quote_char": "\"",
    "escape_char": null,
    "multi_line": false
  }
}

For different file types, adjust the configuration:

Monitoring Import Status

After creating a data object, monitor its ingestion status:

def check_import_status(data_object_id):
    """Check if file import completed successfully"""
    
    # Get data object details
    resp = requests.get(
        f"{API_URL}/data/data_object?identifier={data_object_id}",
        headers=get_headers()
    )
    
    if resp.status_code != 200:
        return "Unable to fetch status"
    
    data = resp.json()
    
    # Check state
    state = data["entity"]["state"]
    print(f"Status: {state['code']} - {state['reason']}")
    print(f"Healthy: {state['healthy']}")
    
    # Check compute job if exists
    compute_id = data.get("compute_identifier")
    if compute_id:
        compute_resp = requests.get(
            f"{API_URL}/data/compute?identifier={compute_id}",
            headers=get_headers()
        )
        if compute_resp.status_code == 200:
            job_status = compute_resp.json()["status"]["status"]
            print(f"Ingestion job: {job_status}")
    
    return state["healthy"]

Troubleshooting Common Issues

File Not Found After Upload

  • Verify the bucket and path in your data object configuration

  • Ensure the path starts with / and includes the bucket name

  • Check file permissions in the S3 storage

Data Object Stuck in Unhealthy State

  • Review compute job logs using /api/data/compute/log?identifier={compute_id}

  • Verify data source credentials are correctly configured

  • Check file format matches the configured data_object_type

Authentication Issues

  • Verify your S3 credentials are correctly set in the data source

  • Ensure your user has necessary permissions for the bucket

  • Check endpoint URL matches your environment

Best Practices

  1. Organize Files Logically: Create a clear folder structure in your storage buckets

  2. Use Descriptive Names: Include dates and data types in file names

  3. Batch Similar Files: Group related files in the same upload session

  4. Validate Before Upload: Check file formats and data quality locally first

  5. Monitor Ingestion: Always verify data objects reach healthy state after configuration

  6. Document Data Sources: Maintain clear documentation of what each uploaded file contains

  7. Clean Up Old Files: Regularly remove obsolete files from storage to manage costs

Next Steps

After successfully importing files and creating data objects:

  1. Create Source-Aligned Data Products (SADPs) to transform the raw data

  2. Set up regular ingestion schedules if files are updated periodically

  3. Configure data quality checks on the ingested data

  4. Build Consumer-Aligned Data Products (CADPs) for specific use cases

Last updated