vLLM Deployment & Inference Guide
Guide on saving and deploying LLMs to vLLM for serving LLMs in production
💻Installing vLLM
For NVIDIA GPUs, use uv and run:
pip install --upgrade pip
pip install uv
uv pip install -U vllm --torch-backend=autoFor AMD GPUs, please use the nightly Docker image: rocm/vllm-dev:nightly
For the nightly branch for NVIDIA GPUs, run:
pip install --upgrade pip
pip install uv
uv pip install -U vllm
--torch-backend=auto
--extra-index-url https://wheels.vllm.ai/nightlySee vLLM docs for more details
🚚Deploying vLLM models
After saving your fine-tune, you can simply do:
vllm serve unsloth/gpt-oss-120b🚒vLLM Deployment Server Flags, Engine Arguments & Options
Some important server flags to use are at vLLM Deployment Server Flags, Engine Arguments & Options
🦥Unsloth vLLM Instructions
To save to 16bit for vLLM, use:
model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit")
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")To merge to 4bit to load on HuggingFace, first call merged_4bit. Then use merged_4bit_forced if you are certain you want to merge to 4bit. I highly discourage you, unless you know what you are going to do with the 4bit model (ie for DPO training for eg or for HuggingFace's online inference engine)
model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit")
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")To save just the LoRA adapters, either use:
model.save_pretrained("model")
tokenizer.save_pretrained("tokenizer")Or just use our builtin function to do that:
model.save_pretrained_merged("model", tokenizer, save_method = "lora")
model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")Last updated
Was this helpful?

