vLLM Deployment & Inference Guide

Guide on saving and deploying LLMs to vLLM for serving LLMs in production

💻Installing vLLM

For NVIDIA GPUs, use uv and run:

pip install --upgrade pip
pip install uv
uv pip install -U vllm --torch-backend=auto

For AMD GPUs, please use the nightly Docker image: rocm/vllm-dev:nightly

For the nightly branch for NVIDIA GPUs, run:

pip install --upgrade pip
pip install uv
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

See vLLM docs for more details

🚚Deploying vLLM models

After saving your fine-tune, you can simply do:

vllm serve unsloth/gpt-oss-120b

🚒vLLM Deployment Server Flags, Engine Arguments & Options

Some important server flags to use are at vLLM Deployment Server Flags, Engine Arguments & Options

🦥Deploying Unsloth finetunes in vLLM

After fine-tuning Fine-tuning LLMs Guide or using our notebooks at Unsloth Notebooks, you can save or deploy your models directly through vLLM within a single workflow. An example Unsloth finetuning script for eg:

from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    max_seq_length = 2048,
    load_in_4bit = True,
)
model = FastLanguageModel.get_peft_model(model)

To save to 16-bit for vLLM, use:

model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "merged_16bit")
## OR to upload to HuggingFace:
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

To save just the LoRA adapters, either use:

model.save_pretrained("finetuned_model")
tokenizer.save_pretrained("finetuned_model")

Or just use our builtin function to do that:

model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "lora")
## OR to upload to HuggingFace
model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

To merge to 4bit to load on HuggingFace, first call merged_4bit. Then use merged_4bit_forced if you are certain you want to merge to 4bit. I highly discourage you, unless you know what you are going to do with the 4bit model (ie for DPO training for eg or for HuggingFace's online inference engine)

model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "merged_4bit")
## To upload to HuggingFace:
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

PreviousSaving to Ollama NextvLLM Engine Arguments

Last updated 11 days ago

Was this helpful?