Skip to content

Benchmarking and accuracy analysis client for lemonade-server

License

Notifications You must be signed in to change notification settings

lemonade-sdk/lemonade-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lemonade-eval CLI

Contents:

Overview

The lemonade-eval CLI provides tools for evaluating, benchmarking, and preparing LLMs. It is designed to work alongside the Lemonade Server, enabling:

  • Performance benchmarking of models running on Lemonade Server
  • Accuracy testing using MMLU, HumanEval, Perplexity, and lm-eval-harness
  • Model preparation for OGA (ONNX Runtime GenAI) on NPU and CPU devices

The CLI uses a unique command syntax where each unit of functionality is called a Tool. A single call to lemonade-eval can invoke multiple Tools in sequence, with each tool passing its state to the next.

For example:

lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load bench

Can be read as:

Run lemonade-eval on the input (-i) model Qwen3-4B-Instruct-2507-GGUF. First, load it on the Lemonade Server (load), then benchmark it (bench).

Use lemonade-eval -h to see available options and tools, and lemonade-eval TOOL -h for help on a specific tool.

Installation

1. Install Lemonade Server

Install from the latest release:

2. Create a Python Environment

Choose one of the following methods:

Using venv:

python -m venv lemon
# Windows:
lemon\Scripts\activate
# Linux/macOS:
source lemon/bin/activate

Using conda:

conda create -n lemon python=3.12
conda activate lemon

Using uv:

uv venv lemon --python 3.12
# Windows:
lemon\Scripts\activate
# Linux/macOS:
source lemon/bin/activate

3. Install lemonade-eval

Clone the repository and install in editable mode:

git clone https://github.com/lemonade-sdk/lemonade-eval.git
cd lemonade-eval
pip install -e .

Optional extras:

# For OGA CPU inference:
pip install -e .[oga-cpu]

# For RyzenAI NPU support (Windows + Python 3.12 only):
pip install -e .[oga-ryzenai] --extra-index-url=https://pypi.amd.com/simple

# For model generation/export (Windows + Python 3.12 only):
pip install -e .[oga-ryzenai,model-generate] --extra-index-url=https://pypi.amd.com/simple

Available Tools

Tool Description
load Load a model onto a running Lemonade Server
bench Benchmark a model loaded on Lemonade Server
oga-load Load and prepare OGA models for NPU/CPU inference
accuracy-mmlu Evaluate accuracy using MMLU benchmark
accuracy-humaneval Evaluate code generation accuracy
accuracy-perplexity Calculate perplexity scores
lm-eval-harness Run lm-evaluation-harness benchmarks
llm-prompt Send a prompt to a loaded model
report Display benchmarking and accuracy results
cache Manage the lemonade-eval cache
version Display version information
system-info Query system information from Lemonade Server

Server-Based Workflow

Most lemonade-eval tools require a running Lemonade Server. Start the server first:

lemonade-server serve

Then use lemonade-eval to load models and run evaluations:

# Load a model and prompt it
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load llm-prompt -p "Hello, world!"

# Load and benchmark a model
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load bench

# Load and run accuracy tests
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load accuracy-mmlu --tests management

Server Connection Options

By default, lemonade-eval connects to http://localhost:8000. Use --server-url to connect to a different server:

lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load --server-url http://192.168.1.100:8000 bench

NPU and Hybrid Models

For NPU and Hybrid inference on AMD Ryzen AI processors, use Lemonade Server with -NPU or -Hybrid models:

# Load and prompt a Hybrid model (NPU + iGPU)
lemonade-eval -i Llama-3.2-1B-Instruct-Hybrid load llm-prompt -p "Hello!"

# Load and benchmark an NPU model
lemonade-eval -i Qwen-2.5-3B-Instruct-NPU load bench

# Load and run accuracy tests on Hybrid
lemonade-eval -i Qwen3-4B-Hybrid load accuracy-mmlu --tests management

Requirements for NPU/Hybrid

  • Processor: AMD Ryzen AI 300- and 400-series processors (e.g., Strix Point, Krackan Point, Gorgon Point)
  • Operating System: Windows 11
  • NPU Driver: Install the NPU Driver

See the Models List for all available -NPU and -Hybrid models.

OGA-Load for Model Preparation

The oga-load tool is for preparing custom OGA (ONNX Runtime GenAI) models. It can build and quantize models from Hugging Face for use on NPU, iGPU, or CPU.

Note: For running pre-built NPU/Hybrid models, use the server-based workflow above with -NPU or -Hybrid models. The oga-load tool is primarily for model preparation and testing custom checkpoints.

Usage

# Prepare and test a model on CPU
lemonade-eval -i microsoft/Phi-3-mini-4k-instruct oga-load --device cpu --dtype int4 llm-prompt -p "Hello!"

Installation for OGA

See Installation above for OGA extras (oga-cpu or oga-ryzenai).

See OGA for iGPU and CPU for more details on model building and caching.

Accuracy Testing

MMLU

Test language understanding across many subjects:

# With GGUF model
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load accuracy-mmlu --tests management

# With Hybrid model
lemonade-eval -i Qwen3-4B-Hybrid load accuracy-mmlu --tests management

See MMLU Accuracy for the full list of subjects.

HumanEval

Test code generation capabilities:

lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load accuracy-humaneval

See HumanEval Accuracy for details.

Perplexity

Calculate perplexity scores (requires OGA model loaded via oga-load):

lemonade-eval -i microsoft/Phi-3-mini-4k-instruct oga-load --device cpu --dtype int4 accuracy-perplexity

See Perplexity Evaluation for interpretation guidance.

lm-eval-harness

Run standardized benchmarks from lm-evaluation-harness:

# Run GSM8K math benchmark with GGUF model
lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load lm-eval-harness --task gsm8k --limit 10

# Run with Hybrid model
lemonade-eval -i Qwen3-4B-Hybrid load lm-eval-harness --task gsm8k --limit 10

See lm-eval-harness for supported tasks and options.

Benchmarking

With Lemonade Server

Benchmark models loaded on Lemonade Server:

lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load bench

The benchmark measures:

  • Time to First Token (TTFT): Latency before first token is generated
  • Tokens per Second: Generation throughput
  • Memory Usage: Peak memory consumption (on Windows)

Options

lemonade-eval -i Qwen3-4B-Instruct-2507-GGUF load bench --iterations 5 --warmup-iterations 2 --output-tokens 128

Exporting a Finetuned Model

To prepare your own fine-tuned model for OGA:

  1. Quantize the model using Quark
  2. Export using oga-load

See the Finetuned Model Export Guide for detailed instructions.

LLM Report

View a summary of all benchmarking and accuracy results:

lemonade-eval report --perf

Results can be filtered by model name, device type, and data type:

lemonade-eval report --perf --filter-model "Qwen"

Memory Usage

On Windows, memory usage of the inference server backend can be tracked with the --memory flag. For example:

    lemonade-eval --memory -i Llama-3.2-1B-Instruct-GGUF load bench

This generates a PNG file that is stored in the current folder and the build folder. This file contains a figure plotting the memory usage of the inference backend over the lemonade-eval tool sequence. Learn more by running lemonade-eval -h.

Power Profiling

For power profiling, see Power Profiling.

System Information

To view system information and available devices, use the system-info tool:

lemonade-eval system-info

By default, this shows essential information including OS version, processor, and physical memory.

For detailed system information including BIOS version, CPU max clock, Windows power setting, and Python packages, use the --verbose flag:

lemonade-eval system-info --verbose

For JSON output format:

lemonade-eval system-info --format json

About

Benchmarking and accuracy analysis client for lemonade-server

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages