Ryzen AI LLM Server

A lightweight, OpenAI API-compatible server for running LLMs on AMD Ryzen AI NPUs using ONNX Runtime GenAI.

Overview

This server enables running Large Language Models on AMD Ryzen AI 300-series processors with NPU acceleration. It implements the OpenAI API specification, making it compatible with existing LLM applications and tools.

Key Features:

OpenAI API Compatible: /v1/chat/completions, /v1/completions, /v1/responses
Tool/Function Calling: OpenAI-compatible function calling support
Multiple Execution Modes: NPU, Hybrid (NPU+iGPU), CPU
Streaming Support: Real-time Server-Sent Events for all endpoints
Echo Parameter: Option to include prompt in completion output
Stop Sequences: Custom stop sequences for generation control
Minimal Dependencies: Single executable + DLLs
Simple Architecture: One-model-per-process design

Building from Source

Prerequisites

Windows Requirements:

Windows 11 (64-bit)
Visual Studio 2022
CMake 3.20 or higher
Ryzen AI Software 1.6.0 with LLM patch
- Base installation must be at C:\Program Files\RyzenAI\1.6.0
- LLM patch must be applied on top of base installation
- Download from: https://ryzenai.docs.amd.com

Hardware Requirements:

AMD Ryzen AI 300- or 400-series processor (for NPU execution)
Minimum 16GB RAM (32GB recommended for larger models)

Build Steps

# Clone the repository
git clone https://github.com/lemonade-sdk/ryzenai-server.git
cd ryzenai-server

# Create and enter build directory
mkdir build
cd build

# Configure with CMake
cmake .. -G "Visual Studio 17 2022" -A x64

# Build
cmake --build . --config Release

Build Output

The executable and required DLLs will be created at:

build\bin\Release\ryzenai-server.exe

All necessary Ryzen AI DLLs are automatically copied to the output directory during build.

Note: The Ryzen AI DLLs included in the release are licensed under the AMD Software End User License Agreement. See AMD_LICENSE in the release package for full terms.

Custom Ryzen AI Installation Path

If Ryzen AI is installed in a custom location:

cmake .. -G "Visual Studio 17 2022" -A x64 -DOGA_ROOT="C:\custom\path\to\RyzenAI\1.6.0"

Code Structure

ryzenai-server/
├── CMakeLists.txt              # Build configuration
│
├── src/                        # Source files
│   ├── main.cpp                # Entry point
│   ├── server.cpp              # HTTP server (cpp-httplib)
│   ├── inference_engine.cpp    # ONNX Runtime GenAI wrapper
│   ├── command_line.cpp        # CLI argument parsing
│   ├── types.cpp               # Data structures
│   ├── tool_calls.cpp          # OpenAI tool/function calling
│   └── reasoning.cpp           # Reasoning content handling
│
├── include/ryzenai/            # Headers
│   ├── server.h
│   ├── inference_engine.h
│   ├── command_line.h
│   ├── types.h
│   ├── tool_calls.h
│   └── reasoning.h
│
└── external/                   # Header-only dependencies
    ├── cpp-httplib/            # HTTP server (auto-downloaded)
    └── json/                   # JSON library (auto-downloaded)

Architecture Overview

Design Principles

Simplicity: One process serves one model - no dynamic loading/unloading
RAII: Resource management follows C++ best practices with smart pointers
Thread Safety: Shared resources protected with proper synchronization
Single Binary: Minimal dependencies for easy deployment

Component Layers

┌─────────────────────────────────────────────────┐
│         HTTP Server (cpp-httplib)               │
│         OpenAI API Endpoints                    │
├─────────────────────────────────────────────────┤
│         Request Handlers                        │
│         (chat, completions, streaming)          │
├─────────────────────────────────────────────────┤
│         Inference Engine                        │
│         ONNX Runtime GenAI                      │
├─────────────────────────────────────────────────┤
│         Execution Providers                     │
│         NPU / Hybrid / CPU                      │
└─────────────────────────────────────────────────┘

Server: HTTP server using cpp-httplib with OpenAI-compatible endpoints. Features:

8-thread pool for concurrent request handling
Built-in CORS support (Access-Control-Allow-Origin: *)
Request routing and response formatting
Chunked transfer encoding for streaming

Inference Engine: Wraps ONNX Runtime GenAI API, managing model loading, generation parameters, and streaming callbacks. Applies chat templates and handles tool call extraction.

Execution Providers: Supports three modes:

Hybrid: NPU + iGPU
NPU: Pure NPU execution
CPU: CPU-only fallback

Dependencies

These dependencies are automatically downloaded during build:

cpp-httplib (v0.26.0) - HTTP server [MIT License]
nlohmann/json (v3.11.3) - JSON parsing [MIT License]

These dependencies must be manually installed by the developer:

ONNX Runtime GenAI - Inference engine

Usage

Starting the Server

# Specify NPU mode
ryzenai-server.exe -m C:\path\to\onnx\model --mode npu

# Hybrid mode with custom port
ryzenai-server.exe -m C:\path\to\onnx\model --mode hybrid --port 8081

# CPU mode
ryzenai-server.exe -m C:\path\to\onnx\model --mode cpu

# Verbose logging
ryzenai-server.exe -m C:\path\to\onnx\model --verbose

Command-Line Arguments

-m, --model PATH - Path to ONNX model directory (required)
--host ADDRESS - Server host address (default: 127.0.0.1)
-p, --port PORT - Server port (default: 8080)
--mode MODE - Execution mode: npu, hybrid, cpu (default: hybrid)
-c, --ctx-size SIZE - Context size in tokens (default: 2048)
-t, --threads NUM - Number of CPU threads (default: 4)
-v, --verbose - Enable verbose logging
-h, --help - Show help message

Model Requirements

Models must be in ONNX format compatible with Ryzen AI. Required files:

model.onnx or model.onnx.data
genai_config.json
Tokenizer files (tokenizer.json, tokenizer_config.json, etc.)

Models are typically cached in:

C:\Users\<Username>\.cache\huggingface\hub\

API Endpoints

The server implements OpenAI-compatible API endpoints.

Health Check

GET /health

Returns server status and Ryzen AI-specific information:

{
  "status": "ok",
  "model": "phi-3-mini-4k-instruct",
  "execution_mode": "hybrid",
  "max_prompt_length": 4096,
  "ryzenai_version": "1.6.0"
}

Other Endpoints

GET / - Server information and available endpoints
POST /v1/chat/completions - Chat completions with tool/function calling support
POST /v1/completions - Text completions with echo parameter
POST /v1/responses - OpenAI Responses API format

All endpoints support both streaming and non-streaming modes. The server applies chat templates automatically and extracts tool calls from model output.

Testing

Quick Test

# Start the server
cd build\bin\Release
ryzenai-server.exe -m C:\path\to\model --verbose

# Test health endpoint (in another terminal)
curl http://localhost:8080/health

# Test chat completion
curl http://localhost:8080/v1/chat/completions ^
  -H "Content-Type: application/json" ^
  -d "{\"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}], \"max_tokens\": 50}"

Integration with Lemonade Server

This server is designed to be used as a backend for Lemonade Server. When running Lemonade Server, the ryzenai-server executable is automatically downloaded from GitHub releases and managed by the Lemonade Router.

Integration Examples

Python with OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="ignored",  # Model is already loaded
    messages=[
        {"role": "user", "content": "What is 2+2?"}
    ]
)

print(response.choices[0].message.content)

Troubleshooting

"Model not found" or "Failed to load model"

Check:

Model path is correct and contains required ONNX files
Ryzen AI 1.6.0 is installed at the correct path
NPU drivers are up to date (Windows Update)
Model is compatible with your Ryzen AI version

Missing DLLs

All required DLLs should be automatically copied during build. If you get DLL errors:

Verify Ryzen AI is installed correctly
Rebuild with cmake --build . --config Release
Manually copy DLLs from C:\Program Files\RyzenAI\1.6.0\deployment\ to the executable directory

Port Already in Use

If port 8080 is occupied:

ryzenai-server.exe -m C:\path\to\model --port 8081

Development

Code Style

C++17 standard
RAII for resource management
Smart pointers (no raw pointers)
Const correctness
Snake_case for functions
PascalCase for types

Building for Development

Debug build with symbols:

cmake --build . --config Debug

Debug executable location:

build\bin\Debug\ryzenai-server.exe

Known Issues

Streaming with JSON Library: Creating nlohmann::json objects directly in ONNX Runtime streaming callbacks can cause crashes. The workaround is to manually construct JSON strings in callbacks. This is stable and performs well.

Related Projects

Ryzen AI Documentation: https://ryzenai.docs.amd.com
ONNX Runtime GenAI: https://github.com/microsoft/onnxruntime-genai
Lemonade Server: https://github.com/lemonade-sdk/lemonade - Parent project providing model orchestration

License

This project's source code is licensed under the MIT License - see LICENSE for details.

Release Artifacts (ryzenai-server.zip):

The ryzenai-server.exe binary and the header-only dependencies (cpp-httplib, nlohmann/json) are MIT licensed
The Ryzen AI DLLs included in binary releases are licensed under the AMD Software End User License Agreement - see AMD_LICENSE file in the release package for full terms

Note: When you download a release, the AMD_LICENSE file is included alongside the DLLs. The source code in this repository does not include the DLLs - they are copied from your local Ryzen AI installation during the build process.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
include/ryzenai		include/ryzenai
src		src
test		test
.gitignore		.gitignore
AMD_LICENSE		AMD_LICENSE
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

License

lemonade-sdk/ryzenai-server

Folders and files

Latest commit

History

Repository files navigation

Ryzen AI LLM Server

Overview

Building from Source

Prerequisites

Build Steps

Build Output

Custom Ryzen AI Installation Path

Code Structure

Architecture Overview

Design Principles

Component Layers

Dependencies

Usage

Starting the Server

Command-Line Arguments

Model Requirements

API Endpoints

Health Check

Other Endpoints

Testing

Quick Test

Integration with Lemonade Server

Integration Examples

Python with OpenAI SDK

Troubleshooting

"Model not found" or "Failed to load model"

Missing DLLs

Port Already in Use

Development

Code Style

Building for Development

Known Issues

Related Projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages