Skip to content

Dataset Manager

The FlexAI Dataset Manager provides a comprehensive solution for uploading, organizing, and managing Datasets used in your AI training and Fine-tuning workloads. Whether your data is stored locally or in the cloud, the Dataset Manager streamlines the process of making it available to your AI Workloads.

The FlexAI Dataset Manager enables you to:

  • Upload Datasets from a local machine or Cloud Storage providers
  • Organize data with flexible directory structures
  • Manage multiple Datasets for complex training scenarios

Multiple Upload Sources

Upload from a local machine, AWS S3, Google Cloud Storage, the Hugging Face Hub, and more

Flexible Structure

Support for flat or hierarchical directory structures with custom file organization

Immutable Storage

Datasets are immutable once created, ensuring reproducible training results

Use multiple Datasets on a single Workload

Attach multiple Datasets to a single Training or Fine-tuning Job for complex scenarios.

Upload Datasets directly from your local filesystem:

  • Individual files with custom destination paths
  • Directory contents preserving nested structure
  • Multiple files with batch upload capabilities
  • Flexible mapping between source and destination paths

Seamlessly upload Datasets from cloud storage without local downloads:

  • Amazon S3: Direct integration with S3 buckets and object keys
  • Google Cloud Storage: Native support for GCS buckets
  • Cloudflare R2: Compatible with R2 storage endpoints
  • MinIO: Self-hosted S3-compatible storage integration
  • Hugging Face Hub: Direct Dataset downloads from public and private repositories

Visit the Remote Storage Connection Manager section for detailed instructions on setting up and using remote storage connections.

FlexAI supports completely flexible Dataset organization.

Here’s an example of a Dataset with a flat structure. It was named my_flat_dataset at the time of its attachment to a Training or Fine-tuning Job:

  • Directory/input/ # The contents of /input/ are read-only
    • Directorymy_flat_dataset/
      • data.bin

A single Dataset with a hierarchical structure

Section titled “A single Dataset with a hierarchical structure”

Here’s an example of a Dataset with a large set of nested directories and files inside. It was named my-dataset at the time of its attachment to a Training or Fine-tuning Job:

  • Directory/input/ # The contents of /input/ are read-only
    • Directorymy-dataset/
      • Directorytrain/
        • Directoryimages/
          • image001.jpg
          • image002.jpg
        • labels.csv
      • Directoryvalidation/
      • Directorytest/

This example shows how multiple Datasets are organized when attached to a single Training or Fine-tuning Job. Each Dataset gets its own sub-directory under /input/.

Here we have three Datasets attached to the same Job:

  • fineweb
  • ylecun_mnist
  • OpenAssistant-oasst1
  • Directory/input/ # The contents of /input/ are read-only
    • Directoryfineweb/
    • Directoryylecun_mnist/
    • DirectoryOpenAssistant-oasst1/

Datasets are mounted as read-only resources in your Runtime Environment:

  • Location: /input/<dataset_name>/
  • Multiple Datasets: Each Dataset gets its own sub-directory
  • Preserved structure: The original file structure at creation time is maintained
  • Direct access: Your code can access files immediately

The Dataset Manager supports any file format your training code can process:

  • Images: JPEG, PNG, TIFF, and other image formats
  • Text: Plain text, JSON, CSV, Parquet files
  • Audio: WAV, MP3, FLAC audio files
  • Video: MP4, AVI, MOV video formats
  • Archives: TAR, ZIP compressed archives
  • Binary: Model files, custom binary formats
  • Encrypted: Password-protected or encrypted files
  • Other: Any other file formats not listed

Optimize upload performance with efficient batch operations:

  • Multiple file flags - Upload many files in a single command
  • Directory uploads - Transfer entire directories efficiently
  • Remote transfers - Server-to-server transfers for cloud data
  • Parallel processing - Concurrent upload streams for faster transfers

Maximize storage efficiency and minimize costs:

  • Immutable Datasets - Reuse existing Datasets across multiple Training or Fine-tuning Jobs
  • Deduplication - Whenever possible, try to avoid uploading duplicate files
  • Versioning - Create new Datasets for updated data while retaining previous versions
  • Dataset management - Archive or delete unused Datasets

Ready to start managing your Datasets? Explore these resources: