Importing Files into Foundation

This guide provides comprehensive instructions for importing files into the Foundation platform for data processing. It covers both UI-based and programmatic approaches for file upload, along with the subsequent steps to transform uploaded files into data objects ready for processing.

What This Guide Covers

This guide specifically addresses scenarios where you need to:

Upload files directly to Foundation's managed storage
Create data objects from uploaded files
Set up the complete pipeline from file upload to data ingestion

Prerequisites

Before importing files, ensure you have:

A configured Data Source that can connect to Foundation's storage
Appropriate permissions to upload files and create data objects
Access to your Foundation environment's storage interface

Method 1: Importing Files via Storage UI

The Storage UI provides a visual interface for uploading files directly to Foundation's managed storage buckets.

Accessing the Storage UI

Navigate to your Foundation's storage interface:

https://storage.<env_name>.meshx.foundation/ui

Replace <env_name> with your specific Foundation name (e.g., stg, prod, dev).

Understanding Storage Buckets

When you access the storage interface, you will see several pre-configured buckets, each serving a specific purpose:

connectors: Configuration files for data connectors
iceberg: Iceberg table storage
models: Machine learning model storage
samples: Sample data and test files (recommended for user uploads)
warehouse: PostgreSQL table exports

For general file uploads, we recommend using a dedicated bucket (like samples) or creating your own bucket for better organization.

Uploading Files

Select Your Target Bucket
1. Click on the bucket where you want to store your files
2. For testing and general use, samples is typically appropriate
Initiate Upload
1. Click the "Upload" button in the top-right corner of the interface

Select Files or Folders
1. Choose Files: Select individual files from your local machine
2. Choose Folder: Upload entire folder structures
3. Maximum file size: 5GB per file
Complete Upload
1. Files will be uploaded to the current directory in the selected bucket
2. You can create subdirectories by navigating into them before uploading

File Organization Best Practices

samples/
├── project-alpha/
│   ├── raw-data/
│   │   ├── transactions-2024.csv
│   │   └── customers-2024.csv
│   └── processed/
└── project-beta/
    └── daily-exports/

Method 2: Importing Files via AWS CLI

Foundation's Storage Engine provides an S3-compatible proxy that allows you to use standard AWS S3 commands regardless of the underlying cloud provider. This enables programmatic and batch file uploads.

Configuration

First, configure your AWS CLI with the appropriate endpoint and credentails:

# Set the endpoint URL for your environment
export S3_ENDPOINT_URL="https://storage.<env_name>.meshx.foundation/ui"

# Example for staging environment
export S3_ENDPOINT_URL="https://storage.stg.meshx.foundation"

# Configure AWS credentials for Foundation storage
aws configure set aws_access_key_id <your_storage_username>
aws configure set aws_secret_access_key <your_storage_password>
aws configure set region us-east-1  # Default region, adjust if needed

# Alternative: Set credentials as environment variables
export AWS_ACCESS_KEY_ID=<your_storage_username>
export AWS_SECRET_ACCESS_KEY=<your_storage_password>

Note: The storage credentials (username/password) should be provided by your Foundation administrator. These are specific to the Foundation storage system and are different from your Foundation UI login credentials. For persistent configuration, you can also create a profile in your AWS credentials file:

# Edit ~/.aws/credentials
[foundation-storage]
aws_access_key_id = <your_storage_username>
aws_secret_access_key = <your_storage_password>

# Then use the profile in commands
aws s3 ls --endpoint-url $S3_ENDPOINT_URL --profile foundation-storage

Common S3 Operations

List Available Buckets

aws s3 mb s3://<bucket_name> --endpoint-url $S3_ENDPOINT_URL

# Example
aws s3 mb s3://my-data-bucket --endpoint-url $S3_ENDPOINT_URL

Create a New Bucket

aws s3 mb s3://<bucket_name> --endpoint-url $S3_ENDPOINT_URL

# Example
aws s3 mb s3://my-data-bucket --endpoint-url $S3_ENDPOINT_URL

Upload a Single File

aws s3 cp <local_file_path> s3://<bucket_name>/<destination_path> --endpoint-url $S3_ENDPOINT_URL

# Example
aws s3 cp /home/user/data/sales_2024.csv s3://samples/sales/sales_2024.csv --endpoint-url $S3_ENDPOINT_URL

Upload Multiple Files (Sync Directory)

aws s3 sync <local_directory> s3://<bucket_name>/<destination_path> --endpoint-url $S3_ENDPOINT_URL

# Example
aws s3 sync "/mnt/c/Users/user/data/cargo-output" s3://samples/cargo-synthetic/ --endpoint-url $S3_ENDPOINT_URL

List Files in a Bucket

aws s3 ls s3://<bucket_name>/<path> --endpoint-url $S3_ENDPOINT_URL --recursive

# Example
aws s3 ls s3://samples/cargo-synthetic/ --endpoint-url $S3_ENDPOINT_URL --recursive

Download Files

aws s3 cp s3://<bucket_name>/<file_path> <local_destination> --endpoint-url $S3_ENDPOINT_URL

# Example
aws s3 cp s3://samples/sales/report.csv ./local_reports/report.csv --endpoint-url $S3_ENDPOINT_URL

Creating Data Objects from Uploaded Files

After uploading files to Foundation storage, you need to create Data Objects to make them available for processing.

Step 1: Ensure Data Source Configuration

Your data source must be configured to connect to Foundation's S3 storage:

# Configure S3 data source (if not already done)
PUT /api/data/data_source/connection?identifier={data_source_id}

{
  "connection": {
    "connection_type": "s3",
    "url": "https://storage.<env_name>.meshx.foundation",
    "access_key": {
      "env_key": "S3_ACCESS_KEY"
    },
    "access_secret": {
      "env_key": "S3_SECRET_KEY"
    }
  }
}

Step 2: Create Data Object

Create a data object pointing to your uploaded file:

POST /api/data/data_object

{
  "entity": {
    "name": "Cargo Transactions Upload",
    "entity_type": "data_object",
    "label": "CTU",
    "description": "Uploaded cargo transaction data from local system"
    "owner": "[email protected]",
  },
  "entity_info": {
    "owner": "[email protected]",
    "contact_ids": ["Data Engineering Team"],
    "links": []
  }
}

Step 3: Link to Data Source

Connect the data object to your S3 data source:

POST /api/data/link/data_source/data_object

Parameters:
- identifier: {s3_data_source_id}
- child_identifier: {data_object_id}

Step 4: Configure File Path

Point the data object to your uploaded file:

PUT /api/data/data_object/config?identifier={data_object_id}

{
  "configuration": {
    "data_object_type": "csv",
    "path": "/samples/cargo-synthetic/transactions_2024.csv",
    "has_header": true,
    "delimiter": ",",
    "quote_char": "\"",
    "escape_char": null,
    "multi_line": false
  }
}

For different file types, adjust the configuration:

Monitoring Import Status

After creating a data object, monitor its ingestion status:

def check_import_status(data_object_id):
    """Check if file import completed successfully"""
    
    # Get data object details
    resp = requests.get(
        f"{API_URL}/data/data_object?identifier={data_object_id}",
        headers=get_headers()
    )
    
    if resp.status_code != 200:
        return "Unable to fetch status"
    
    data = resp.json()
    
    # Check state
    state = data["entity"]["state"]
    print(f"Status: {state['code']} - {state['reason']}")
    print(f"Healthy: {state['healthy']}")
    
    # Check compute job if exists
    compute_id = data.get("compute_identifier")
    if compute_id:
        compute_resp = requests.get(
            f"{API_URL}/data/compute?identifier={compute_id}",
            headers=get_headers()
        )
        if compute_resp.status_code == 200:
            job_status = compute_resp.json()["status"]["status"]
            print(f"Ingestion job: {job_status}")
    
    return state["healthy"]

Troubleshooting Common Issues

File Not Found After Upload

Verify the bucket and path in your data object configuration
Ensure the path starts with / and includes the bucket name
Check file permissions in the S3 storage

Data Object Stuck in Unhealthy State

Review compute job logs using /api/data/compute/log?identifier={compute_id}
Verify data source credentials are correctly configured
Check file format matches the configured data_object_type

Authentication Issues

Verify your S3 credentials are correctly set in the data source
Ensure your user has necessary permissions for the bucket
Check endpoint URL matches your environment

Best Practices

Organize Files Logically: Create a clear folder structure in your storage buckets
Use Descriptive Names: Include dates and data types in file names
Batch Similar Files: Group related files in the same upload session
Validate Before Upload: Check file formats and data quality locally first
Monitor Ingestion: Always verify data objects reach healthy state after configuration
Document Data Sources: Maintain clear documentation of what each uploaded file contains
Clean Up Old Files: Regularly remove obsolete files from storage to manage costs

Next Steps

After successfully importing files and creating data objects:

Create Source-Aligned Data Products (SADPs) to transform the raw data
Set up regular ingestion schedules if files are updated periodically
Configure data quality checks on the ingested data
Build Consumer-Aligned Data Products (CADPs) for specific use cases

PreviousAdding and Managing Data Sources NextAdding and Managing Data Objects

Last updated 4 months ago

hashtagWhat This Guide Covers

hashtagPrerequisites

hashtagMethod 1: Importing Files via Storage UI

hashtagAccessing the Storage UI

hashtagUnderstanding Storage Buckets

hashtagUploading Files

hashtagFile Organization Best Practices

hashtagMethod 2: Importing Files via AWS CLI

hashtagConfiguration

hashtagCommon S3 Operations

hashtagList Available Buckets

hashtagCreate a New Bucket

hashtagUpload a Single File

hashtagUpload Multiple Files (Sync Directory)

hashtagList Files in a Bucket

hashtagDownload Files

hashtagCreating Data Objects from Uploaded Files

hashtagStep 1: Ensure Data Source Configuration

hashtagStep 2: Create Data Object

hashtagStep 3: Link to Data Source

hashtagStep 4: Configure File Path

hashtagMonitoring Import Status

hashtagTroubleshooting Common Issues

hashtagFile Not Found After Upload

hashtagData Object Stuck in Unhealthy State

hashtagAuthentication Issues

hashtagBest Practices

hashtagNext Steps