A lightweight, OpenAI API-compatible server for running LLMs on AMD Ryzen AI NPUs using ONNX Runtime GenAI.
This server enables running Large Language Models on AMD Ryzen AI 300-series processors with NPU acceleration. It implements the OpenAI API specification, making it compatible with existing LLM applications and tools.
Key Features:
- OpenAI API Compatible:
/v1/chat/completions,/v1/completions,/v1/responses - Tool/Function Calling: OpenAI-compatible function calling support
- Multiple Execution Modes: NPU, Hybrid (NPU+iGPU), CPU
- Streaming Support: Real-time Server-Sent Events for all endpoints
- Echo Parameter: Option to include prompt in completion output
- Stop Sequences: Custom stop sequences for generation control
- Minimal Dependencies: Single executable + DLLs
- Simple Architecture: One-model-per-process design
Windows Requirements:
- Windows 11 (64-bit)
- Visual Studio 2022
- CMake 3.20 or higher
- Ryzen AI Software 1.6.0 with LLM patch
- Base installation must be at
C:\Program Files\RyzenAI\1.6.0 - LLM patch must be applied on top of base installation
- Download from: https://ryzenai.docs.amd.com
- Base installation must be at
Hardware Requirements:
- AMD Ryzen AI 300- or 400-series processor (for NPU execution)
- Minimum 16GB RAM (32GB recommended for larger models)
# Clone the repository
git clone https://github.com/lemonade-sdk/ryzenai-server.git
cd ryzenai-server
# Create and enter build directory
mkdir build
cd build
# Configure with CMake
cmake .. -G "Visual Studio 17 2022" -A x64
# Build
cmake --build . --config ReleaseThe executable and required DLLs will be created at:
build\bin\Release\ryzenai-server.exe
All necessary Ryzen AI DLLs are automatically copied to the output directory during build.
Note: The Ryzen AI DLLs included in the release are licensed under the AMD Software End User License Agreement. See AMD_LICENSE in the release package for full terms.
If Ryzen AI is installed in a custom location:
cmake .. -G "Visual Studio 17 2022" -A x64 -DOGA_ROOT="C:\custom\path\to\RyzenAI\1.6.0"ryzenai-server/
├── CMakeLists.txt # Build configuration
│
├── src/ # Source files
│ ├── main.cpp # Entry point
│ ├── server.cpp # HTTP server (cpp-httplib)
│ ├── inference_engine.cpp # ONNX Runtime GenAI wrapper
│ ├── command_line.cpp # CLI argument parsing
│ ├── types.cpp # Data structures
│ ├── tool_calls.cpp # OpenAI tool/function calling
│ └── reasoning.cpp # Reasoning content handling
│
├── include/ryzenai/ # Headers
│ ├── server.h
│ ├── inference_engine.h
│ ├── command_line.h
│ ├── types.h
│ ├── tool_calls.h
│ └── reasoning.h
│
└── external/ # Header-only dependencies
├── cpp-httplib/ # HTTP server (auto-downloaded)
└── json/ # JSON library (auto-downloaded)
- Simplicity: One process serves one model - no dynamic loading/unloading
- RAII: Resource management follows C++ best practices with smart pointers
- Thread Safety: Shared resources protected with proper synchronization
- Single Binary: Minimal dependencies for easy deployment
┌─────────────────────────────────────────────────┐
│ HTTP Server (cpp-httplib) │
│ OpenAI API Endpoints │
├─────────────────────────────────────────────────┤
│ Request Handlers │
│ (chat, completions, streaming) │
├─────────────────────────────────────────────────┤
│ Inference Engine │
│ ONNX Runtime GenAI │
├─────────────────────────────────────────────────┤
│ Execution Providers │
│ NPU / Hybrid / CPU │
└─────────────────────────────────────────────────┘
Server: HTTP server using cpp-httplib with OpenAI-compatible endpoints. Features:
- 8-thread pool for concurrent request handling
- Built-in CORS support (Access-Control-Allow-Origin: *)
- Request routing and response formatting
- Chunked transfer encoding for streaming
Inference Engine: Wraps ONNX Runtime GenAI API, managing model loading, generation parameters, and streaming callbacks. Applies chat templates and handles tool call extraction.
Execution Providers: Supports three modes:
- Hybrid: NPU + iGPU
- NPU: Pure NPU execution
- CPU: CPU-only fallback
These dependencies are automatically downloaded during build:
- cpp-httplib (v0.26.0) - HTTP server [MIT License]
- nlohmann/json (v3.11.3) - JSON parsing [MIT License]
These dependencies must be manually installed by the developer:
- ONNX Runtime GenAI - Inference engine
# Specify NPU mode
ryzenai-server.exe -m C:\path\to\onnx\model --mode npu
# Hybrid mode with custom port
ryzenai-server.exe -m C:\path\to\onnx\model --mode hybrid --port 8081
# CPU mode
ryzenai-server.exe -m C:\path\to\onnx\model --mode cpu
# Verbose logging
ryzenai-server.exe -m C:\path\to\onnx\model --verbose-m, --model PATH- Path to ONNX model directory (required)--host ADDRESS- Server host address (default: 127.0.0.1)-p, --port PORT- Server port (default: 8080)--mode MODE- Execution mode: npu, hybrid, cpu (default: hybrid)-c, --ctx-size SIZE- Context size in tokens (default: 2048)-t, --threads NUM- Number of CPU threads (default: 4)-v, --verbose- Enable verbose logging-h, --help- Show help message
Models must be in ONNX format compatible with Ryzen AI. Required files:
model.onnxormodel.onnx.datagenai_config.json- Tokenizer files (
tokenizer.json,tokenizer_config.json, etc.)
Models are typically cached in:
C:\Users\<Username>\.cache\huggingface\hub\
The server implements OpenAI-compatible API endpoints.
GET /healthReturns server status and Ryzen AI-specific information:
{
"status": "ok",
"model": "phi-3-mini-4k-instruct",
"execution_mode": "hybrid",
"max_prompt_length": 4096,
"ryzenai_version": "1.6.0"
}GET /- Server information and available endpointsPOST /v1/chat/completions- Chat completions with tool/function calling supportPOST /v1/completions- Text completions with echo parameterPOST /v1/responses- OpenAI Responses API format
All endpoints support both streaming and non-streaming modes. The server applies chat templates automatically and extracts tool calls from model output.
# Start the server
cd build\bin\Release
ryzenai-server.exe -m C:\path\to\model --verbose
# Test health endpoint (in another terminal)
curl http://localhost:8080/health
# Test chat completion
curl http://localhost:8080/v1/chat/completions ^
-H "Content-Type: application/json" ^
-d "{\"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}], \"max_tokens\": 50}"This server is designed to be used as a backend for Lemonade Server. When running Lemonade Server, the ryzenai-server executable is automatically downloaded from GitHub releases and managed by the Lemonade Router.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="ignored", # Model is already loaded
messages=[
{"role": "user", "content": "What is 2+2?"}
]
)
print(response.choices[0].message.content)Check:
- Model path is correct and contains required ONNX files
- Ryzen AI 1.6.0 is installed at the correct path
- NPU drivers are up to date (Windows Update)
- Model is compatible with your Ryzen AI version
All required DLLs should be automatically copied during build. If you get DLL errors:
- Verify Ryzen AI is installed correctly
- Rebuild with
cmake --build . --config Release - Manually copy DLLs from
C:\Program Files\RyzenAI\1.6.0\deployment\to the executable directory
If port 8080 is occupied:
ryzenai-server.exe -m C:\path\to\model --port 8081- C++17 standard
- RAII for resource management
- Smart pointers (no raw pointers)
- Const correctness
- Snake_case for functions
- PascalCase for types
Debug build with symbols:
cmake --build . --config DebugDebug executable location:
build\bin\Debug\ryzenai-server.exe
Streaming with JSON Library: Creating nlohmann::json objects directly in ONNX Runtime streaming callbacks can cause crashes. The workaround is to manually construct JSON strings in callbacks. This is stable and performs well.
- Ryzen AI Documentation: https://ryzenai.docs.amd.com
- ONNX Runtime GenAI: https://github.com/microsoft/onnxruntime-genai
- Lemonade Server: https://github.com/lemonade-sdk/lemonade - Parent project providing model orchestration
This project's source code is licensed under the MIT License - see LICENSE for details.
Release Artifacts (ryzenai-server.zip):
- The
ryzenai-server.exebinary and the header-only dependencies (cpp-httplib, nlohmann/json) are MIT licensed - The Ryzen AI DLLs included in binary releases are licensed under the AMD Software End User License Agreement - see
AMD_LICENSEfile in the release package for full terms
Note: When you download a release, the AMD_LICENSE file is included alongside the DLLs. The source code in this repository does not include the DLLs - they are copied from your local Ryzen AI installation during the build process.