A high-performance, production-ready Vision-Language Model (VLM) inference server built entirely in Rust
Transform images and text into insights with this OpenAI-compatible inference server powered by real AI models (LLaVA 1.5) and the Candle ML framework.
Modern AI applications need to understand both images and text together - whether it's analyzing medical scans, describing product images, or answering questions about visual data. But deploying vision-language models (VLMs) is challenging:
- Complex Infrastructure: Most solutions require Python, CUDA, multiple dependencies
- Poor Performance: High latency, memory inefficiency, difficult to scale
- Vendor Lock-in: Cloud-only solutions with high costs
We built this to solve those problems with a pure Rust implementation that's:
- Fast: Low-latency inference with Metal GPU support (Apple Silicon) and CPU fallback
- Efficient: 14GB model running on consumer hardware
- Production-Ready: OpenAI-compatible API, streaming support, proper error handling
- Easy to Deploy: Single binary, no Python required
📖 Read the full story in our deep-dive blog post →
Get up and running in 5 minutes:
- Rust 1.70+: Install Rust
- 8GB+ RAM: 16GB recommended for optimal performance
- macOS or Linux: Apple Silicon (M1/M2/M3) or x86_64
# Clone the repository
git clone https://github.com/mixpeek/multimodal-inference-server.git
cd vlm-inference-server
# Build the project (Release mode for best performance)
cargo build --release
# Start the worker (downloads 14GB model on first run)
./target/release/vlm-worker --host 0.0.0.0 --port 50051 &
# Start the gateway (HTTP API)
./target/release/vlm-gateway --host 0.0.0.0 --port 8080 &# Send a chat completion request
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vlm-prod",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"max_tokens": 20
}'That's it! You now have a VLM inference server running locally.
- OpenAI-Compatible API: Drop-in replacement for OpenAI's chat completions endpoint
- Streaming Support: Real-time token streaming via Server-Sent Events (SSE)
- Health Checks:
/healthzand/readyzendpoints for Kubernetes - Graceful Shutdown: Proper cleanup and connection draining
- LLaVA 1.5 7B: State-of-the-art vision-language model
- CLIP Vision Encoder: High-quality image understanding
- LLaMA-2 Text Generation: Powerful language model
- Automatic Downloads: Models fetched from HuggingFace Hub
- Pure Rust: Memory-safe, zero-cost abstractions
- Candle ML Framework: Fast tensor operations, Metal GPU support
- gRPC Communication: Efficient gateway ↔ worker communication
- KV Cache: Optimized attention caching for generation
- Modular Architecture: Clean separation of concerns
- Trait-Based Design: Easy to swap ML backends
- Comprehensive Tests: Unit, integration, and GPU tests
- Rich Documentation: API docs, architecture guides, examples
┌─────────────┐ ┌──────────┐ ┌─────────────┐
│ Client │ │ Gateway │ │ Worker │
│ (HTTP) │─────────▶│ (HTTP) │─────────▶│ (gRPC) │
│ │◀─────────│ │◀─────────│ │
└─────────────┘ └──────────┘ └─────────────┘
│ │
│ ▼
│ ┌─────────────────┐
│ │ Candle Engine │
│ │ ┌───────────┐ │
│ │ │ CLIP │ │
│ │ │ Vision │ │
│ │ └───────────┘ │
│ │ ┌───────────┐ │
│ │ │ LLaMA-2 │ │
│ │ │ LLM │ │
│ │ └───────────┘ │
│ └─────────────────┘
│
▼
┌──────────────────┐
│ Observability │
│ Metrics, Logs │
└──────────────────┘
Learn more: Architecture Documentation →
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vlm-prod",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
"max_tokens": 100,
"temperature": 0.7
}'curl -N -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vlm-prod",
"messages": [{"role": "user", "content": "Write a haiku about Rust"}],
"max_tokens": 50,
"stream": true
}'curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vlm-prod",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What'\''s in this image?"},
{"type": "image_url", "image_url": {"url": "..."}}
]
}],
"max_tokens": 100
}'More examples: examples/
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Create a chat completion (OpenAI-compatible) |
/healthz |
GET | Health check |
/readyz |
GET | Readiness check |
/v1/models |
GET | List available models |
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model |
string | Yes | - | Model identifier (e.g., "vlm-prod") |
messages |
array | Yes | - | Conversation messages |
max_tokens |
integer | No | 256 | Maximum tokens to generate |
temperature |
float | No | 1.0 | Sampling temperature (0.0-2.0) |
top_p |
float | No | 1.0 | Nucleus sampling parameter |
stream |
boolean | No | false | Enable streaming responses |
stop |
string/array | No | - | Stop sequences |
Full API documentation: docs/API.md
Benchmarks on Apple M3 Ultra (192GB RAM, CPU mode):
| Metric | Value |
|---|---|
| Model Loading | ~30s (one-time) |
| Vision Encoding | 100-200ms per image |
| Prefill (256 tokens) | 500ms-1s |
| Decode | 100-200ms per token |
| End-to-End (20 tokens) | 2-5s |
| Memory Usage | ~16GB |
Performance tuning: docs/PERFORMANCE.md
vlm-inference-server/
├── crates/
│ ├── api_types/ # OpenAI-compatible API types
│ ├── proto/ # gRPC protocol definitions
│ ├── gateway/ # HTTP edge service
│ ├── worker/ # Inference worker service
│ ├── engine/ # ML engine trait definitions
│ ├── engine_adapters/
│ │ ├── mock_engine/ # Test mock
│ │ └── candle_engine/ # Candle ML implementation
│ ├── multimodal/ # Image preprocessing
│ ├── scheduler/ # Batching & admission control
│ ├── kv_cache/ # Key-value cache management
│ ├── sampling/ # Token sampling strategies
│ ├── common/ # Shared utilities
│ └── observability/ # Tracing & metrics
├── docs/ # Documentation
├── examples/ # Usage examples
└── scripts/ # Helper scripts
# Run all tests
cargo test --workspace
# Run specific crate tests
cargo test --package vlm-candle-engine
# Run with logging
RUST_LOG=debug cargo test
# Run integration tests
cargo test --test '*'# Debug build (faster compile, slower runtime)
cargo build
# Release build (optimized)
cargo build --release
# With specific features
cargo build --bin vlm-worker --features candle --release
# Check code style
cargo fmt --all -- --check
cargo clippy --all-targets --all-features# Build image
docker build -t vlm-inference-server .
# Run container
docker run -p 8080:8080 -p 50051:50051 vlm-inference-server# See k8s/ directory for full manifests
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml# See systemd/ directory for service files
sudo cp systemd/vlm-worker.service /etc/systemd/system/
sudo cp systemd/vlm-gateway.service /etc/systemd/system/
sudo systemctl enable --now vlm-worker vlm-gatewayDeployment guides: docs/DEPLOYMENT.md
We love contributions! Whether you're:
- 🐛 Reporting a bug
- 💡 Suggesting a feature
- 📝 Improving documentation
- 🔧 Submitting a pull request
Please read our Contributing Guide first.
- Fork and clone the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes and add tests
- Run tests:
cargo test --workspace - Commit:
git commit -m "Add amazing feature" - Push:
git push origin feature/amazing-feature - Open a Pull Request
A: Yes! The Candle engine supports Metal (Apple Silicon), CUDA (NVIDIA), and CPU. Metal support is enabled by default on macOS.
A: Currently LLaVA 1.5 7B. The modular architecture makes it easy to add other models (see Adding Models).
A: Yes! It handles real inference with proper error handling, health checks, and observability. The main limitation is tokenization (tokens shown as tok{id} instead of decoded text).
A: Minimum 8GB, recommended 16GB+. The 14GB model is memory-mapped, so it doesn't all load into RAM at once.
A: Yes! This project is licensed under Apache 2.0, which permits commercial use.
More questions? Check the full FAQ or open an issue.
- Phase 0: Core infrastructure (gateway, worker, protocols)
- Phase 1: Candle VLM engine integration
- Phase 2: Real model inference (LLaVA 1.5 7B)
- Phase 3: Production hardening
- Real tokenizer integration
- Image preprocessing pipeline
- Paged KV cache (vLLM-style)
- Flash Attention support
- Phase 4: Advanced features
- Multi-model support
- Dynamic batching
- Model quantization (int8/int4)
- Distributed inference
See the full roadmap: docs/ROADMAP.md
This project builds on incredible work from the ML and Rust communities:
- Candle: Minimalist ML framework from HuggingFace
- LLaVA: Visual instruction tuning research
- Tonic: gRPC framework for Rust
- Axum: Web framework from the Tokio team
Special thanks to:
- HuggingFace for open-sourcing Candle and hosting models
- The Rust community for amazing tools and libraries
- All contributors who helped make this project better
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Copyright 2026 VLM Inference Server Contributors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
- Documentation: docs/
- Examples: examples/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Blog Post: Why and How We Built This →
If you find this project useful, please consider giving it a star ⭐️
It helps others discover the project and motivates continued development!
Built with ❤️ using Rust
Quick Start • Features • Architecture • Contributing • Blog Post
This README was crafted following best practices from:
