vLLM
vLLM is a high-throughput inference engine for serving models on your own GPUs. Its vllm serve command exposes an OpenAI-compatible API at http://localhost:8000/v1, which Routeplane fronts as one provider block.
Prerequisites
Section titled “Prerequisites”-
Routeplane installed, with a
routeplane.yaml(scaffold one withrouteplane init). -
vLLM serving a model:
Terminal window vllm serve meta-llama/Llama-3.1-8B-Instruct # default port 8000
Add vLLM to Routeplane
Section titled “Add vLLM to Routeplane”providers: vllm: api_base: http://localhost:8000/v1 api_protocol: - "*": chat_completions models: - id: meta-llama/Llama-3.1-8B-InstructThe models id must match the name vLLM serves. By default that’s the full Hugging Face repo id; pass --served-model-name my-model to vllm serve to alias it to something shorter, then use that alias here.
Route to it
Section titled “Route to it”routeplane route vllm:meta-llama/Llama-3.1-8B-InstructThen start Routeplane and send a request.