A chatML interface integrated with RAG, Web search and more for LLMs and VLMs. Simply plugin your v1 endpoint from OpenAI, Google AI Studio, Ollama, vLLM or any OpenAI compatible endpoint.
Don't have enough resources? No worries, choose from serveral models in the drop down, and run LLMs in your browser locally!
Install vllm, fastapi, uvicorn, and ngrok (optional, you can also use any other reverse proxy).
pip install -r requirements.txtTerminal 1
uvicorn app.main:app --host 0.0.0.0 --port 3000The following are the list of models we have successfully tried so far on vllm==0.12.x versions. The errors we faced and fixes are also logged within.
| Model Name | Command | Remarks |
|---|---|---|
| Falcon3-7B-Instruct-GPTQ-Int4 | vllm serve tiiuae/Falcon3-7B-Instruct-GPTQ-Int4 --max-model-len 4096 --gpu-memory-utilization 0.85 |
|
| Ministral-3-8B-Instruct-2512-AWQ-4bit | vllm serve cyankiwi/Ministral-3-8B-Instruct-2512-AWQ-4bit --gpu-memory-utilization 0.85 --max-model-len 6144 --max-num-batched-tokens 1024 |
|
| OpenGVLab/InternVL3-8B-AWQ | vllm serve OpenGVLab/InternVL3-8B-AWQ --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 --trust-remote-code --quantization awq |
Note: AWQ quantized model won't work unless --quantization awq flag is set. |
| OpenGVLab/InternVL3-2B | vllm serve OpenGVLab/InternVL3-2B --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 --trust-remote-code |
|
| Nemotron Cascade 8B | vllm serve cyankiwi/Nemotron-Cascade-8B-AWQ-4bit --max-model-len 4096 --gpu-memory-utilization 0.85 --max-num-batched-tokens 1024 --trust-remote-code |
|
| Nemotron Orchestrator 8B | vllm serve cyankiwi/Nemotron-Orchestrator-8B-AWQ-4bit --served-model-name Nemotron-orchestrator --max-model-len 4096 --gpu-memory-utilization 0.85 --max-num-batched-tokens 1024 --trust-remote-code |
|
| Qwen VL 2B Instruct | vllm serve Qwen/Qwen2-VL-2B-Instruct --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 |
|
| Nvidia Cosmos Reason2 2B | vllm serve nvidia/Cosmos-Reason2-2B --max-model-len 8192 --max-num-batched-tokens 2048 --gpu-memory-utilization 0.8 |
|
| H20VL Mississipi 2B | vllm serve h2oai/h2ovl-mississippi-2b --max-model-len 4096 --max-num-batched-tokens 2048 --gpu-memory-utilization 0.75 |
Does not support system prompt, need to take care of this. (Not yet fixed) |
| Gemma 3 4B Instruct | vllm serve ISTA-DASLab/gemma-3-4b-it-GPTQ-4b-128g --max-model-len 4096 --max-num-batched-tokens 1024 --gpu-memory-utilization 0.8 |
Original model OOM. Using GPTQ quantized model from community. Max concurrency observed: 9 |
