This blueprint provides a step-by-step guide for evaluating language models on FlexAI using the LM-Evaluation-Harness framework. LM-Evaluation-Harness is a unified, extensible toolkit for few-shot evaluation of language models across hundreds of standardized NLP benchmarks. In this guide, you’ll learn how to:Documentation Index
Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
Use this file to discover all available pages before exploring further.
- Run evaluations across multiple NLP tasks and benchmarks
- Compare model performance consistently
- Use FlexAI’s managed compute environment for large-scale, reproducible model evaluation
If you haven’t already connected FlexAI to GitHub, run
flexai code-registry connect to set up a code registry connection. This allows FlexAI to pull repositories directly using the repository URL in training commands.What is LM-Evaluation-Harness?
LM-Evaluation-Harness is a comprehensive evaluation framework that provides:- 300+ tasks: standardized implementations of popular NLP benchmarks
- Multiple Model Backends: Support for HuggingFace, OpenAI, Anthropic, and more
- Reproducible Evaluation: Consistent evaluation protocols across different models
- Flexible Configuration: Easy customization of evaluation parameters
- Comprehensive Metrics: Detailed performance metrics and statistical analysis
- HellaSwag: commonsense reasoning
- MMLU: multi-task language understanding (57 subjects)
- GSM8K: grade school math word problems
- HumanEval: code generation capabilities
- TruthfulQA: model truthfulness and reliability
- ARC: AI2 Reasoning Challenge
- WinoGrande: Winograd schema challenge
Quick Start
Run a basic evaluation on HellaSwag with this single command:Common Evaluation Scenarios
Create Secrets
To access models from HuggingFace (especially gated models), you need a HuggingFace token. Use theflexai secret create command to store your HuggingFace Token as a secret:
[Optional] Pre-fetch Models
To speed up evaluation and avoid downloading large models at runtime, you can pre-fetch your models to FlexAI storage:-
Create a HuggingFace storage provider:
-
Push the model checkpoint to your storage:
-
Use the pre-fetched model in your evaluation:
Monitoring Evaluation Progress
You can check the status and progress of your evaluation job:Getting Evaluation Results
Once the evaluation job completes, you can access the results:Understanding Evaluation Results
Sample Results Structure
Key Metrics Explained
- acc: Raw accuracy score
- acc_stderr: Standard error of accuracy
- acc_norm: Length-normalized accuracy (for some tasks)
- pass@k: Pass rate for code generation tasks
- bleu: BLEU score for generation tasks
- rouge: ROUGE scores for summarization tasks
Advanced Evaluation Scenarios
Multi-Node Evaluation
For very large models or extensive benchmark suites:Custom Task Evaluation
For evaluating on custom tasks or datasets:Evaluation with Reduced Precision
For memory-efficient evaluation:Expected Results and Benchmarks
Typical Benchmark Performance
HellaSwag (Commonsense Reasoning):- Random baseline: ~25%
- GPT-3 (175B): ~78.9%
- Human performance: ~95.6%
- Random baseline: ~25%
- GPT-3 (175B): ~43.9%
- Human expert performance: ~89.8%
- Random baseline: ~0%
- GPT-3 (175B): ~17.9%
- Human performance: ~85%
- Random baseline: ~0%
- GPT-3 (175B): ~14.2%
- Human performance: ~90%
Model Size vs Performance
Generally, larger models perform better, but with diminishing returns:- Small models (< 1B): Basic language understanding
- Medium models (1B-7B): Reasonable performance on most tasks
- Large models (7B-70B): Strong performance across benchmarks
- Very large models (> 70B): State-of-the-art performance
Technical Details
Resource Requirements
Recommended Configurations: Small Models (< 1B parameters):- Accelerators: 1-2 GPUs
- Memory: 8-16GB GPU memory
- Evaluation Time: 30 minutes - 2 hours
- Accelerators: 2-4 GPUs
- Memory: 16-32GB GPU memory
- Evaluation Time: 1-6 hours
- Accelerators: 4-8 GPUs
- Memory: 40-80GB GPU memory
- Evaluation Time: 2-12 hours
- Accelerators: 8+ GPUs (multi-node recommended)
- Memory: 80GB+ GPU memory
- Evaluation Time: 6-24 hours
Optimization Tips
Memory Optimization:- Use
dtype=bfloat16ordtype=float16for reduced memory usage - Reduce
batch_sizefor large models - Use
device_map=autofor automatic device placement
- Increase
batch_sizewhen memory allows - Use
max_batch_sizefor adaptive batching - Set
limitfor quick testing with subset of data - Use multiple GPUs with
--device cuda
- Use appropriate
num_fewshotfor few-shot tasks - Enable
--log_samplesfor detailed analysis - Run multiple times with different seeds for statistical significance
Command Line Parameters Explained
--model hf: Use HuggingFace Transformers backend--model_args: Model-specific arguments (path, dtype, etc.)--tasks: Comma-separated list of evaluation tasks--device: Device placement (cuda, cpu)--batch_size: Batch size for evaluation--num_fewshot: Number of examples for few-shot evaluation--output_path: Path to save results JSON--log_samples: Save individual sample predictions--limit: Limit number of samples per task (for testing)
Popular Task Collections
Core Benchmarks
Academic Benchmarks
Reasoning Tasks
Language Understanding
Code Generation
Math and Logic
Troubleshooting
Common Issues: Out of Memory Errors:- Reduce
batch_size(try 1, 2, 4) - Use mixed precision:
dtype=bfloat16ordtype=float16 - Increase number of GPUs:
--accels 8 - Enable CPU offloading:
device_map=auto
- Verify model name is correct on HuggingFace
- Check that your HuggingFace token has permission to access gated models
- Use
trust_remote_code=Truefor custom models - Verify there’s sufficient disk space for model downloads
- List available tasks:
lm_eval --tasks list - Check task name spelling and capitalization
- Verify task is supported in your lm-eval version
- Use task groups:
mmlu_*instead of individual tasks
- Increase
batch_sizewhen memory allows - Use
--limitfor quick testing (e.g.,--limit 100) - Consider fewer tasks for initial testing
- Use faster/smaller models for development
- Create HuggingFace token with appropriate permissions
- Store token as FlexAI secret correctly
- Verify token is not expired
- Check model access permissions (especially for gated models)
References
- LM-Evaluation-Harness GitHub: https://github.com/EleutherAI/lm-evaluation-harness
- Documentation: https://github.com/EleutherAI/lm-evaluation-harness/tree/master/docs
- Paper: “Language Model Evaluation Harness” - https://arxiv.org/abs/2101.00027
- Task List: https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md
- FlexAI Documentation: https://docs.flex.ai
- HuggingFace Transformers: https://huggingface.co/docs/transformers
Code
requirements.txt
🔬 Run evaluations on FlexAI
Scale across 300+ benchmarks with reproducible results. Jobs launch in under 60 seconds — no infra setup, built-in observability.
Get started →Talk to us