Uploading Datasets from your local machine
If you have your datasets stored locally on your machine, you can upload them to FCS Storage Services by using the FlexAI CLI's dataset push
subcommand:
flexai dataset push <dataset_name> (--file <source_path>=<fcs_dataset_path> ... | --file <source_path> ...)
The --file
flag
The dataset push
command's -f
/--file
flag offers a flexible way to upload files to a dataset as it can be used in three different ways.
1. Uploading files one by one to a specific destination path
Uploading files individually is useful when you need to specify a particular source or destination path for each of them:
flexai dataset push text-records-dataset-1 \
--file openwebtext_mini/urlsf_subset00.tar=owt/0.tar \
--file openwebtext_mini/urlsf_subset01.tar=owt/1.tar \
--file openwebtext_mini/test.tar=test/owt.tar \
--file sf-wikitext/test-00000-of-00001.parquet=test/wikitext.parquet \
--file sf-wikitext/train-00000-of-00001.parquet=wikitext/train.parquet
Here you have the ability to pick individual files from your local machine and set the specific destination path for each of them: some files come from the local openwebtext_mini
directory, others from sf-wikitext
, and in both cases, the test files will be uploaded to the Dataset's test
directory, as shown below:
// FCS dataset:
.
└── <text-records-dataset-1>/
├── owt/
│ ├── 0.tar
│ └── 1.tar
├── test/
│ ├── owt.tar
│ └── wikitext.parquet
└── wikitext/
└── train.parquet
Pushing files this way can begin to become a cumbersome task as the list of files continues to grow, fortunately you can automate the process by using a script or a loop to upload multiple files following the patterns that fit your needs. However, if having a specific Dataset structure is not a requirement, you can upload files without specifying a destination path.
2. Uploading files without picking a destination path
Considering the example above, you could decide to simply upload the files without specifying a destination path. This would result in the files being stored in the root directory of the FCS Dataset:
flexai dataset push text-records-dataset-2 \
--file openwebtext_mini/urlsf_subset00.tar \
--file openwebtext_mini/urlsf_subset01.tar \
--file openwebtext_mini/test.tar \
--file sf-wikitext/test-00000-of-00001.parquet \
--file sf-wikitext/train-00000-of-00001.parquet
Since none of the files names are the same, they won't overwrite each other, ending up in a structure like this:
// FCS dataset:
.
└── <text-records-dataset-2>/
├── urlsf_subset00.tar
├── urlsf_subset01.tar
├── test.tar
├── test-00000-of-00001.parquet
└── train-00000-of-00001.parquet
However, when picking and choosing files to upload, part of your requirements, then you can use the third method: a bulk upload of files from a directory.
3. Uploading a directory's contents
The --file
flag also allows you to push the contents of a directory into a Dataset. This is particularly useful when you already have a directory containing the multiple files that make up your dataset.
Let's assume the following file structure on your local machine:
// Your host system:
~
└── your_current_working_directory/
└── my-dataset/
├── train/
│ ├── t_1.txt
│ ├── t_2.txt
│ ├── t_3.txt
│ └── deep-text/
│ ├── t_1.txt
│ └── t_2.txt
└── test/
├── test_1.txt
├── test_2.txt
└── deep-text_test/
└── t_1.txt
Yes, file names have been deliberately kept similar to show how pushing a entire directory with nested sub-directories is handled (no overwrite risk!).
Uploading the contents of the my-dataset
directory would follow the same pattern as before:
flexai dataset push text-records-dataset-3 --file my-dataset
Resulting in the following FCS Dataset structure:
// FCS dataset:
.
└── <text-records-dataset-3>/
├── train/
│ ├── t_1.txt
│ ├── t_2.txt
│ ├── t_3.txt
│ └── deep-text/
│ ├── t_1.txt
│ └── t_2.txt
└── test/
├── test_1.txt
├── test_2.txt
└── deep-text_test/
└── t_1.txt
Notice that the my-dataset
directory is not included in the FCS Dataset structure, only its contents are uploaded.
Troubleshooting
Pushing large files
Some times you may encounter issues when trying to upload large files directly from your computer. This kind of problem is usually related to network issues. There are a few things you can try to solve this issue:
- Switch to a wired connection if possible.
- Split the file into smaller parts or chunks that you can then join back together at runtime.
- If the file is stored in a Cloud Storage Service such as Amazon S3 or Google Cloud Storage, you can upload directly to FCS Storage Services by creating a Remote Storage Provider Connection.
The process fails when uploading files from a remote machine you're connected to via SSH
If you are trying to upload files from a machine that you're connected to via SSH, the process may fail due to the SSH connection being closed. To avoid this, you can use the a terminal multiplexer, like screen
or tmux
, to keep the process running even after you close the remote session.