In some cases you might want to use large datasets that would be too large to download or push to FlexAI and you’d prefer to use that data transfer time more efficiently. Streaming such datasets can be a useful technique in those cases. This experiment demonstrates how to stream a large dataset during a Training Job on FlexAI. We’ll use the HuggingFace Datasets library’s Streaming capabilities to achieve this.Documentation Index
Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
Use this file to discover all available pages before exploring further.
Connect to GitHub (if needed)
If you haven’t already connected FlexAI to GitHub, you’ll need to set up a code registry connection:-u flag in training commands.
Running the Training Job streaming a dataset
Here is an example using thecode/causal-language-modeling/train.py script to stream the over 90 TB Fineweb dataset:
- The Training Job’s name (
gpt2training-stream). - The URL of the repository containing the training script (
https://github.com/flexaihq/blueprints). - The name of the dataset to be used (
empty-datasetor any other dataset you have available).
code/causal-language-modeling/train.py).
Below that, the first argument passed to the script is --dataset_streaming true, which value tells the script to use the Datasets library with streaming capabilities enabled.
The next lines specify the arguments that will be passed to the training script during execution to adjust the Training Job’s hyperparameters or customize its behavior. For instance, --max_train_samples and --max_eval_samples can be used to tweak the sample size.
The code
You will notice that thetrain function in the code/causal-language-modeling/train.py script makes a call to the _load_model_and_tokenizer function to load the model and tokenizer using the user-provided arguments:
load_and_tokenize helper function from the code/dataset/prepare_save_dataset.py file is the one responsible for using the HuggingFace’s Datasets library and enable its streaming capabilities by simply setting the load_dataset’s streaming argument to True:
Code
code/causal-language-modeling/train.py
code/causal-language-modeling/requirements.txt
🚀 Run this on FlexAI
Managed checkpoints mean you never lose a run to preemption. Jobs launch in under 60 seconds — no infra setup, built-in observability.
Get started →Talk to us