Skip to content

vLLM

vLLM is a high-throughput inference engine for serving models on your own GPUs. Its vllm serve command exposes an OpenAI-compatible API at http://localhost:8000/v1, which Routeplane fronts as one provider block.

  • Routeplane installed, with a routeplane.yaml (scaffold one with routeplane init).

  • vLLM serving a model:

    Terminal window
    vllm serve meta-llama/Llama-3.1-8B-Instruct # default port 8000
routeplane.yaml
providers:
vllm:
api_base: http://localhost:8000/v1
api_protocol:
- "*": chat_completions
models:
- id: meta-llama/Llama-3.1-8B-Instruct

The models id must match the name vLLM serves. By default that’s the full Hugging Face repo id; pass --served-model-name my-model to vllm serve to alias it to something shorter, then use that alias here.

**Optional auth.** vLLM is keyless by default. If you launched it with `--api-key ` (or `VLLM_API_KEY`), add `api_key: ${VLLM_API_KEY}` to the provider block — it resolves from the environment at load time. **Port clash with Unsloth.** vLLM and Unsloth Studio both default to `:8000`. If you run both, start one on another port (`vllm serve … --port 8001`) and update `api_base`.
Terminal window
routeplane route vllm:meta-llama/Llama-3.1-8B-Instruct

Then start Routeplane and send a request.