Saving to SGLang for deployment
Saving models to 16bit for SGLang for deployment and serving
To save to 16bit for SGLang, use:
model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit")
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")To save just the LoRA adapters, either use:
model.save_pretrained("model")
tokenizer.save_pretrained("tokenizer")Or just use our builtin function to do that:
model.save_pretrained_merged("model", tokenizer, save_method = "lora")
model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")💻Installing SGLang
For NVIDIA GPUs, do:
pip install --upgrade pip
pip install uv
uv pip install "sglang" --prerelease=allowFor Docker, try the below:
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path unsloth/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000See https://docs.sglang.ai/get_started/install.html for more details
🚚Deploying SGLang models
After saving your finetune, you can simply do:
python3 -m sglang.launch_server --model-path unsloth/Llama-3.2-1B-Instruct --host 0.0.0.0🚒SGLang Deployment Server Flags, Engine Arguments & Options
Under construction
Last updated
Was this helpful?

