lemonade-eval CLI

Contents:

Overview
Installation
Available Tools
Server-Based Workflow
NPU and Hybrid Models
OGA-Load for Model Preparation
Accuracy Testing
Benchmarking
Export a Finetuned Model
LLM Report
Memory Usage
Power Profiling
System Information

Overview

The lemonade-eval CLI provides tools for evaluating, benchmarking, and preparing LLMs. It is designed to work alongside the Lemonade Server, enabling:

Performance benchmarking of models running on Lemonade Server
Accuracy testing using MMLU, HumanEval, Perplexity, and lm-eval-harness
Model preparation for OGA (ONNX Runtime GenAI) on NPU and CPU devices

The CLI uses a unique command syntax where each unit of functionality is called a Tool. A single call to lemonade-eval can invoke multiple Tools in sequence, with each tool passing its state to the next.

For example:

lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load bench

Can be read as:

Run lemonade-eval on the input (-i) model Qwen3-4B-Instruct-2507-GGUF. First, load it on the Lemonade Server (load), then benchmark it (bench).

Use lemonade-eval -h to see available options and tools, and lemonade-eval TOOL -h for help on a specific tool.

Installation

1. Install Lemonade Server

Install from the latest release:

Windows: Download and run lemonade-server.msi
Linux: See Linux installation options

2. Create a Python Environment

Choose one of the following methods:

Using venv:

python -m venv lemon
# Windows:
lemon\Scripts\activate
# Linux/macOS:
source lemon/bin/activate

Using conda:

conda create -n lemon python=3.12
conda activate lemon

Using uv:

uv venv lemon --python 3.12
# Windows:
lemon\Scripts\activate
# Linux/macOS:
source lemon/bin/activate

3. Install lemonade-eval

Clone the repository and install in editable mode:

git clone https://github.com/lemonade-sdk/lemonade-eval.git
cd lemonade-eval
pip install -e .

Optional extras:

# For OGA CPU inference:
pip install -e .[oga-cpu]

# For RyzenAI NPU support (Windows + Python 3.12 only):
pip install -e .[oga-ryzenai] --extra-index-url=https://pypi.amd.com/simple

# For model generation/export (Windows + Python 3.12 only):
pip install -e .[oga-ryzenai,model-generate] --extra-index-url=https://pypi.amd.com/simple

Available Tools

Tool	Description
`load`	Load a model onto a running Lemonade Server
`bench`	Benchmark a model loaded on Lemonade Server
`oga-load`	Load and prepare OGA models for NPU/CPU inference
`accuracy-mmlu`	Evaluate accuracy using MMLU benchmark
`accuracy-humaneval`	Evaluate code generation accuracy
`accuracy-perplexity`	Calculate perplexity scores
`lm-eval-harness`	Run lm-evaluation-harness benchmarks
`llm-prompt`	Send a prompt to a loaded model
`report`	Display benchmarking and accuracy results
`cache`	Manage the lemonade-eval cache
`version`	Display version information
`system-info`	Query system information from Lemonade Server

Server-Based Workflow

Most lemonade-eval tools require a running Lemonade Server. Start the server first:

lemonade-server serve

Then use lemonade-eval to load models and run evaluations:

# Load a model and prompt it
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load llm-prompt -p "Hello, world!"

# Load and benchmark a model
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load bench

# Load and run accuracy tests
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load accuracy-mmlu --tests management

Server Connection Options

By default, lemonade-eval connects to http://localhost:8000. Use --server-url to connect to a different server:

lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load --server-url http://192.168.1.100:8000 bench

NPU and Hybrid Models

For NPU and Hybrid inference on AMD Ryzen AI processors, use Lemonade Server with -NPU or -Hybrid models:

# Load and prompt a Hybrid model (NPU + iGPU)
lemonade-eval -i Llama-3.2-1B-Instruct-Hybrid load llm-prompt -p "Hello!"

# Load and benchmark an NPU model
lemonade-eval -i Qwen-2.5-3B-Instruct-NPU load bench

# Load and run accuracy tests on Hybrid
lemonade-eval -i Qwen3-4B-Hybrid load accuracy-mmlu --tests management

Requirements for NPU/Hybrid

Processor: AMD Ryzen AI 300- and 400-series processors (e.g., Strix Point, Krackan Point, Gorgon Point)
Operating System: Windows 11
NPU Driver: Install the NPU Driver

See the Models List for all available -NPU and -Hybrid models.

OGA-Load for Model Preparation

The oga-load tool is for preparing custom OGA (ONNX Runtime GenAI) models. It can build and quantize models from Hugging Face for use on NPU, iGPU, or CPU.

Note: For running pre-built NPU/Hybrid models, use the server-based workflow above with -NPU or -Hybrid models. The oga-load tool is primarily for model preparation and testing custom checkpoints.

Usage

# Prepare and test a model on CPU
lemonade-eval -i microsoft/Phi-3-mini-4k-instruct oga-load --device cpu --dtype int4 llm-prompt -p "Hello!"

Installation for OGA

See Installation above for OGA extras (oga-cpu or oga-ryzenai).

See OGA for iGPU and CPU for more details on model building and caching.

Accuracy Testing

MMLU

Test language understanding across many subjects:

# With GGUF model
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load accuracy-mmlu --tests management

# With Hybrid model
lemonade-eval -i Qwen3-4B-Hybrid load accuracy-mmlu --tests management

See MMLU Accuracy for the full list of subjects.

HumanEval

Test code generation capabilities:

lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load accuracy-humaneval

See HumanEval Accuracy for details.

Perplexity

Calculate perplexity scores (requires OGA model loaded via oga-load):

lemonade-eval -i microsoft/Phi-3-mini-4k-instruct oga-load --device cpu --dtype int4 accuracy-perplexity

See Perplexity Evaluation for interpretation guidance.

lm-eval-harness

Run standardized benchmarks from lm-evaluation-harness:

# Run GSM8K math benchmark with GGUF model
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load lm-eval-harness --task gsm8k --limit 10

# Run with Hybrid model
lemonade-eval -i Qwen3-4B-Hybrid load lm-eval-harness --task gsm8k --limit 10

See lm-eval-harness for supported tasks and options.

Benchmarking

With Lemonade Server

Benchmark models loaded on Lemonade Server:

lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load bench

The benchmark measures:

Time to First Token (TTFT): Latency before first token is generated
Tokens per Second: Generation throughput
Memory Usage: Peak memory consumption (on Windows)

Options

lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load bench --iterations 5 --warmup-iterations 2 --output-tokens 128

Exporting a Finetuned Model

To prepare your own fine-tuned model for OGA:

Quantize the model using Quark
Export using oga-load

See the Finetuned Model Export Guide for detailed instructions.

LLM Report

View a summary of all benchmarking and accuracy results:

lemonade-eval report --perf

Results can be filtered by model name, device type, and data type:

lemonade-eval report --perf --filter-model "Qwen"

Memory Usage

On Windows, memory usage of the inference server backend can be tracked with the --memory flag. For example:

    lemonade-eval --memory -i Llama-3.2-1B-Instruct-GGUF load bench

This generates a PNG file that is stored in the current folder and the build folder. This file contains a figure plotting the memory usage of the inference backend over the lemonade-eval tool sequence. Learn more by running lemonade-eval -h.

Power Profiling

For power profiling, see Power Profiling.

System Information

To view system information and available devices, use the system-info tool:

lemonade-eval system-info

By default, this shows essential information including OS version, processor, and physical memory.

For detailed system information including BIOS version, CPU max clock, Windows power setting, and Python packages, use the --verbose flag:

lemonade-eval system-info --verbose

For JSON output format:

lemonade-eval system-info --format json

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
docs		docs
src/lemonade		src/lemonade
test		test
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
NOTICE.md		NOTICE.md
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lemonade-eval CLI

Overview

Installation

1. Install Lemonade Server

2. Create a Python Environment

3. Install lemonade-eval

Available Tools

Server-Based Workflow

Server Connection Options

NPU and Hybrid Models

Requirements for NPU/Hybrid

OGA-Load for Model Preparation

Usage

Installation for OGA

Accuracy Testing

MMLU

HumanEval

Perplexity

lm-eval-harness

Benchmarking

With Lemonade Server

Options

Exporting a Finetuned Model

LLM Report

Memory Usage

Power Profiling

System Information

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

lemonade-sdk/lemonade-eval

Folders and files

Latest commit

History

Repository files navigation

lemonade-eval CLI

Overview

Installation

1. Install Lemonade Server

2. Create a Python Environment

3. Install lemonade-eval

Available Tools

Server-Based Workflow

Server Connection Options

NPU and Hybrid Models

Requirements for NPU/Hybrid

OGA-Load for Model Preparation

Usage

Installation for OGA

Accuracy Testing

MMLU

HumanEval

Perplexity

lm-eval-harness

Benchmarking

With Lemonade Server

Options

Exporting a Finetuned Model

LLM Report

Memory Usage

Power Profiling

System Information

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages