Overview
The FlexAI Dataset Manager enables you to:- Upload Datasets from a local machine or Cloud Storage providers
- Organize data with flexible directory structures
- Manage multiple Datasets for complex training scenarios
Key Features
Multiple Upload Sources
Upload from a local machine, AWS S3, Google Cloud Storage, the Hugging Face Hub, and more
Flexible Structure
Support for flat or hierarchical directory structures with custom file organization
Immutable Storage
Datasets are immutable once created, ensuring reproducible training results
Use multiple Datasets on a single Workload
Attach multiple Datasets to a single Training or Fine-tuning Job for complex scenarios.
Dataset Sources
Local Machine
Upload Datasets directly from your local filesystem:- Individual files with custom destination paths
- Directory contents preserving nested structure
- Multiple files with batch upload capabilities
- Flexible mapping between source and destination paths
Remote Storage Providers
Seamlessly upload Datasets from cloud storage without local downloads:- Amazon S3: Direct integration with S3 buckets and object keys
- Google Cloud Storage: Native support for GCS buckets
- Cloudflare R2: Compatible with R2 storage endpoints
- MinIO: Self-hosted S3-compatible storage integration
- Hugging Face Hub: Direct Dataset downloads from public and private repositories
Dataset Organization
File Structure
FlexAI supports completely flexible Dataset organization.A single Dataset with a flat structure
Here’s an example of a Dataset with a flat structure. It was namedmy_flat_dataset at the time of its attachment to a Training or Fine-tuning Job:
A single Dataset with a hierarchical structure
Here’s an example of a Dataset with a large set of nested directories and files inside. It was namedmy-dataset at the time of its attachment to a Training or Fine-tuning Job:
Multiple Datasets
This example shows how multiple Datasets are organized when attached to a single Training or Fine-tuning Job. Each Dataset gets its own sub-directory under/input/.
Here we have three Datasets attached to the same Job:
finewebylecun_mnistOpenAssistant-oasst1
Runtime Access
Datasets are mounted as read-only resources in your Runtime Environment:- Location:
/input/<dataset_name>/ - Multiple Datasets: Each Dataset gets its own sub-directory
- Preserved structure: The original file structure at creation time is maintained
- Direct access: Your code can access files immediately
File Format Support
The Dataset Manager supports any file format your training code can process:- Images: JPEG, PNG, TIFF, and other image formats
- Text: Plain text, JSON, CSV, Parquet files
- Audio: WAV, MP3, FLAC audio files
- Video: MP4, AVI, MOV video formats
- Archives: TAR, ZIP compressed archives
- Binary: Model files, custom binary formats
- Encrypted: Password-protected or encrypted files
- Other: Any other file formats not listed
Performance Optimization
Batch Operations
Optimize upload performance with efficient batch operations:- Multiple file flags - Upload many files in a single command
- Directory uploads - Transfer entire directories efficiently
- Remote transfers - Server-to-server transfers for cloud data
- Parallel processing - Concurrent upload streams for faster transfers
Storage Efficiency
Maximize storage efficiency and minimize costs:- Immutable Datasets - Reuse existing Datasets across multiple Training or Fine-tuning Jobs
- Deduplication - Whenever possible, try to avoid uploading duplicate files
- Versioning - Create new Datasets for updated data while retaining previous versions
- Dataset management - Archive or delete unused Datasets