Dataset Manager

The FlexAI Dataset Manager provides a comprehensive solution for uploading, organizing, and managing Datasets used in your AI training and Fine-tuning workloads. Whether your data is stored locally or in the cloud, the Dataset Manager streamlines the process of making it available to your AI Workloads.

Overview

The FlexAI Dataset Manager enables you to:

Upload Datasets from a local machine or Cloud Storage providers
Organize data with flexible directory structures
Manage multiple Datasets for complex training scenarios

Key Features

Multiple Upload Sources

Upload from a local machine, AWS S3, Google Cloud Storage, the Hugging Face Hub, and more

Flexible Structure

Support for flat or hierarchical directory structures with custom file organization

Immutable Storage

Datasets are immutable once created, ensuring reproducible training results

Use multiple Datasets on a single Workload

Attach multiple Datasets to a single Training or Fine-tuning Job for complex scenarios.

Dataset Sources

Local Machine

Upload Datasets directly from your local filesystem:

Individual files with custom destination paths
Directory contents preserving nested structure
Multiple files with batch upload capabilities
Flexible mapping between source and destination paths

Remote Storage Providers

Seamlessly upload Datasets from cloud storage without local downloads:

Amazon S3: Direct integration with S3 buckets and object keys
Google Cloud Storage: Native support for GCS buckets
Cloudflare R2: Compatible with R2 storage endpoints
MinIO: Self-hosted S3-compatible storage integration
Hugging Face Hub: Direct Dataset downloads from public and private repositories

Visit the Remote Storage Connection Manager section for detailed instructions on setting up and using remote storage connections.

Dataset Organization

File Structure

FlexAI supports completely flexible Dataset organization.

A single Dataset with a flat structure

Here’s an example of a Dataset with a flat structure. It was named my_flat_dataset at the time of its attachment to a Training or Fine-tuning Job:

Directory/input/ # The contents of /input/ are read-only
- Directorymy_flat_dataset/
  - data.bin

A single Dataset with a hierarchical structure

Here’s an example of a Dataset with a large set of nested directories and files inside. It was named my-dataset at the time of its attachment to a Training or Fine-tuning Job:

Directory/input/ # The contents of /input/ are read-only
- Directorymy-dataset/
  - Directorytrain/
    Directoryimages/
    image001.jpg
    image002.jpg
    …
    labels.csv
  - Directoryvalidation/
    …
  - Directorytest/
    …

Multiple Datasets

This example shows how multiple Datasets are organized when attached to a single Training or Fine-tuning Job. Each Dataset gets its own sub-directory under /input/.

Here we have three Datasets attached to the same Job:

fineweb
ylecun_mnist
OpenAssistant-oasst1

Directory/input/ # The contents of /input/ are read-only
- Directoryfineweb/
  - …
- Directoryylecun_mnist/
  - …
- DirectoryOpenAssistant-oasst1/
  - …

Runtime Access

Datasets are mounted as read-only resources in your Runtime Environment:

Location: /input/<dataset_name>/
Multiple Datasets: Each Dataset gets its own sub-directory
Preserved structure: The original file structure at creation time is maintained
Direct access: Your code can access files immediately

File Format Support

The Dataset Manager supports any file format your training code can process:

Images: JPEG, PNG, TIFF, and other image formats
Text: Plain text, JSON, CSV, Parquet files
Audio: WAV, MP3, FLAC audio files
Video: MP4, AVI, MOV video formats
Archives: TAR, ZIP compressed archives
Binary: Model files, custom binary formats
Encrypted: Password-protected or encrypted files
Other: Any other file formats not listed

Performance Optimization

Batch Operations

Optimize upload performance with efficient batch operations:

Multiple file flags - Upload many files in a single command
Directory uploads - Transfer entire directories efficiently
Remote transfers - Server-to-server transfers for cloud data
Parallel processing - Concurrent upload streams for faster transfers

Storage Efficiency

Maximize storage efficiency and minimize costs:

Immutable Datasets - Reuse existing Datasets across multiple Training or Fine-tuning Jobs
Deduplication - Whenever possible, try to avoid uploading duplicate files
Versioning - Create new Datasets for updated data while retaining previous versions
Dataset management - Archive or delete unused Datasets

Getting Started

Ready to start managing your Datasets? Explore these resources:

Upload from a Local Machine

Comprehensive guide to uploading Datasets from your local host

Remote Sources

Learn how to upload data from a Cloud Storage provider

Remote Storage Connection Manager

Learn how the Remote Storage Connection Manager works and how to set a connection up