Skip to content

Uploading Datasets from your local machine

Navigate to the Dataset Manager in the FlexAI Console:

  1. Visit the “Add Dataset” section of the FlexAI Console 🔗
  2. Enter a name for your dataset: nanoGPT-dataset
  3. Select the “Local” option for “Upload Origin”
  4. Select the + Upload Item button to open the “Upload Items” dialog

Uploading files individually is useful when you need to specify source files that may have different locations on your source machine and/or you want to set a specific destination path for each of them:

Let’s assume the following file structure on your local machine:

  • Directory~/
    • Directoryopenwebtext_mini/
      • urlsf_subset00.tar
      • urlsf_subset01.tar
      • test.tar
    • Directorysf-wikitext/
      • test-00000-of-00001.parquet
      • train-00000-of-00001.parquet

You can upload specific files from the openwebtext_mini and sf-wikitext directories to a FlexAI Dataset named text-records-dataset-1, while also specifying a custom destination path for each of them —including a different file name for each—, as seen below:

  • Directorytext-records-dataset-1/
    • Directoryowt/
      • urlsf_subset00.tar
      • urlsf_subset01.tar
    • Directorytest/
      • test.tar
      • test-00000-of-00001.parquet
    • Directorywikitext/
      • train-00000-of-00001.parquet

You can achieve this by going through the following steps iteratively for each file:

  1. Use the “Select file” option to open a file browser dialog
  2. Select the file you want to upload from your local machine
  3. In the “Destination Path” field, enter the desired destination path within the Dataset
  4. Select the Add button to confirm the file selection and destination mapping
  5. Below the file list named “Upload Items” you will find an + Add items button that will open up the “Upload Items” dialog again.
  6. Repeat the steps above for each file you want to upload
  7. Finally, select the Add Dataset button to start the upload process.

Multiple files, no defined destination path

Section titled “Multiple files, no defined destination path”

Considering the example above, you could decide to simply upload the files without specifying a destination path. This would result in the files being moved up to the root directory of the FlexAI Dataset:

Since none of the files names are the same, they won’t overwrite each other, ending up in a FlexAI Dataset structure that looks as follows:

  • Directorytext-records-dataset-2/
    • urlsf_subset00.tar
    • urlsf_subset01.tar
    • test.tar
    • test-00000-of-00001.parquet
    • train-00000-of-00001.parquet

However, when picking and choosing files to upload is not required for your data workflow, then you can use the third method: a bulk upload of files in a directory.

The --file flag also allows you to push the contents of a directory into a Dataset. This is particularly useful when you already have a directory containing the multiple files that make up your dataset.

Let’s assume the following file structure on your local machine:

  • Directory~/
    • Directorymy-dataset/
      • Directorytrain/
        • t_1.txt
        • t_2.txt
        • t_3.txt
        • Directorydeep-text/
          • t_1.txt
          • t_2.txt
      • Directorytest/
        • test_1.txt
        • test_2.txt
        • Directorydeep-text_test
          • t_1.txt

Yes, file names have been deliberately kept similar to show how pushing a entire directory with nested sub-directories is handled (no overwrite risk!).

Uploading the contents of the my-dataset directory to a Dataset named text-records-dataset-3 would follow the same pattern as before:

Resulting in the following FlexAI Dataset structure:

  • Directorytext-records-dataset-3/
    • Directorytrain/
      • t_1.txt
      • t_2.txt
      • t_3.txt
      • Directorydeep-text/
        • t_1.txt
        • t_2.txt
    • Directorytest/
      • test_1.txt
      • test_2.txt
      • Directorydeep-text_test/
        • t_1.txt

Some times you may encounter issues when trying to upload large files directly from your computer. This kind of problem is usually related to network issues. There are a few things you can try to solve this issue:

  • Switch to a wired connection if possible.
  • Split the file into smaller parts or chunks that you can then join back together at runtime.
  • If the file is stored in a Cloud Storage Service such as Amazon S3 or Google Cloud Storage, you can upload directly to the FlexAI Dataset Manager by creating a Remote Storage Provider Connection.

The process fails when uploading files from a remote machine you’re connected to via SSH

Section titled “The process fails when uploading files from a remote machine you’re connected to via SSH”

If you are trying to upload files from a machine that you’re connected to via SSH, the process may fail due to the SSH connection being closed. To avoid this, you can use the a terminal multiplexer, like screen 🔗 or tmux 🔗, to keep the process running even after you close the remote session.