Contents:
- Overview
- Installation
- Available Tools
- Server-Based Workflow
- NPU and Hybrid Models
- OGA-Load for Model Preparation
- Accuracy Testing
- Benchmarking
- Export a Finetuned Model
- LLM Report
- Memory Usage
- Power Profiling
- System Information
The lemonade-eval CLI provides tools for evaluating, benchmarking, and preparing LLMs. It is designed to work alongside the Lemonade Server, enabling:
- Performance benchmarking of models running on Lemonade Server
- Accuracy testing using MMLU, HumanEval, Perplexity, and lm-eval-harness
- Model preparation for OGA (ONNX Runtime GenAI) on NPU and CPU devices
The CLI uses a unique command syntax where each unit of functionality is called a Tool. A single call to lemonade-eval can invoke multiple Tools in sequence, with each tool passing its state to the next.
For example:
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load benchCan be read as:
Run
lemonade-evalon the input(-i)modelQwen3-4B-Instruct-2507-GGUF. First, load it on the Lemonade Server (load), then benchmark it (bench).
Use lemonade-eval -h to see available options and tools, and lemonade-eval TOOL -h for help on a specific tool.
Install from the latest release:
- Windows: Download and run
lemonade-server.msi - Linux: See Linux installation options
Choose one of the following methods:
Using venv:
python -m venv lemon
# Windows:
lemon\Scripts\activate
# Linux/macOS:
source lemon/bin/activateUsing conda:
conda create -n lemon python=3.12
conda activate lemonUsing uv:
uv venv lemon --python 3.12
# Windows:
lemon\Scripts\activate
# Linux/macOS:
source lemon/bin/activateClone the repository and install in editable mode:
git clone https://github.com/lemonade-sdk/lemonade-eval.git
cd lemonade-eval
pip install -e .Optional extras:
# For OGA CPU inference:
pip install -e .[oga-cpu]
# For RyzenAI NPU support (Windows + Python 3.12 only):
pip install -e .[oga-ryzenai] --extra-index-url=https://pypi.amd.com/simple
# For model generation/export (Windows + Python 3.12 only):
pip install -e .[oga-ryzenai,model-generate] --extra-index-url=https://pypi.amd.com/simple| Tool | Description |
|---|---|
load |
Load a model onto a running Lemonade Server |
bench |
Benchmark a model loaded on Lemonade Server |
oga-load |
Load and prepare OGA models for NPU/CPU inference |
accuracy-mmlu |
Evaluate accuracy using MMLU benchmark |
accuracy-humaneval |
Evaluate code generation accuracy |
accuracy-perplexity |
Calculate perplexity scores |
lm-eval-harness |
Run lm-evaluation-harness benchmarks |
llm-prompt |
Send a prompt to a loaded model |
report |
Display benchmarking and accuracy results |
cache |
Manage the lemonade-eval cache |
version |
Display version information |
system-info |
Query system information from Lemonade Server |
Most lemonade-eval tools require a running Lemonade Server. Start the server first:
lemonade-server serveThen use lemonade-eval to load models and run evaluations:
# Load a model and prompt it
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load llm-prompt -p "Hello, world!"
# Load and benchmark a model
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load bench
# Load and run accuracy tests
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load accuracy-mmlu --tests managementBy default, lemonade-eval connects to http://localhost:8000. Use --server-url to connect to a different server:
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load --server-url http://192.168.1.100:8000 benchFor NPU and Hybrid inference on AMD Ryzen AI processors, use Lemonade Server with -NPU or -Hybrid models:
# Load and prompt a Hybrid model (NPU + iGPU)
lemonade-eval -i Llama-3.2-1B-Instruct-Hybrid load llm-prompt -p "Hello!"
# Load and benchmark an NPU model
lemonade-eval -i Qwen-2.5-3B-Instruct-NPU load bench
# Load and run accuracy tests on Hybrid
lemonade-eval -i Qwen3-4B-Hybrid load accuracy-mmlu --tests management- Processor: AMD Ryzen AI 300- and 400-series processors (e.g., Strix Point, Krackan Point, Gorgon Point)
- Operating System: Windows 11
- NPU Driver: Install the NPU Driver
See the Models List for all available -NPU and -Hybrid models.
The oga-load tool is for preparing custom OGA (ONNX Runtime GenAI) models. It can build and quantize models from Hugging Face for use on NPU, iGPU, or CPU.
Note: For running pre-built NPU/Hybrid models, use the server-based workflow above with
-NPUor-Hybridmodels. Theoga-loadtool is primarily for model preparation and testing custom checkpoints.
# Prepare and test a model on CPU
lemonade-eval -i microsoft/Phi-3-mini-4k-instruct oga-load --device cpu --dtype int4 llm-prompt -p "Hello!"See Installation above for OGA extras (oga-cpu or oga-ryzenai).
See OGA for iGPU and CPU for more details on model building and caching.
Test language understanding across many subjects:
# With GGUF model
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load accuracy-mmlu --tests management
# With Hybrid model
lemonade-eval -i Qwen3-4B-Hybrid load accuracy-mmlu --tests managementSee MMLU Accuracy for the full list of subjects.
Test code generation capabilities:
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load accuracy-humanevalSee HumanEval Accuracy for details.
Calculate perplexity scores (requires OGA model loaded via oga-load):
lemonade-eval -i microsoft/Phi-3-mini-4k-instruct oga-load --device cpu --dtype int4 accuracy-perplexitySee Perplexity Evaluation for interpretation guidance.
Run standardized benchmarks from lm-evaluation-harness:
# Run GSM8K math benchmark with GGUF model
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load lm-eval-harness --task gsm8k --limit 10
# Run with Hybrid model
lemonade-eval -i Qwen3-4B-Hybrid load lm-eval-harness --task gsm8k --limit 10See lm-eval-harness for supported tasks and options.
Benchmark models loaded on Lemonade Server:
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load benchThe benchmark measures:
- Time to First Token (TTFT): Latency before first token is generated
- Tokens per Second: Generation throughput
- Memory Usage: Peak memory consumption (on Windows)
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load bench --iterations 5 --warmup-iterations 2 --output-tokens 128To prepare your own fine-tuned model for OGA:
- Quantize the model using Quark
- Export using
oga-load
See the Finetuned Model Export Guide for detailed instructions.
View a summary of all benchmarking and accuracy results:
lemonade-eval report --perfResults can be filtered by model name, device type, and data type:
lemonade-eval report --perf --filter-model "Qwen"On Windows, memory usage of the inference server backend can be tracked with the --memory flag.
For example:
lemonade-eval --memory -i Llama-3.2-1B-Instruct-GGUF load benchThis generates a PNG file that is stored in the current folder and the build folder. This file
contains a figure plotting the memory usage of the inference backend over the lemonade-eval
tool sequence. Learn more by running lemonade-eval -h.
For power profiling, see Power Profiling.
To view system information and available devices, use the system-info tool:
lemonade-eval system-infoBy default, this shows essential information including OS version, processor, and physical memory.
For detailed system information including BIOS version, CPU max clock, Windows power setting, and Python packages, use the --verbose flag:
lemonade-eval system-info --verboseFor JSON output format:
lemonade-eval system-info --format json