Saving to SGLang for deployment

Saving models to 16bit for SGLang for deployment and serving

To save to 16bit for SGLang, use:

model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit")
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

To save just the LoRA adapters, either use:

model.save_pretrained("model")
tokenizer.save_pretrained("tokenizer")

Or just use our builtin function to do that:

model.save_pretrained_merged("model", tokenizer, save_method = "lora")
model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

💻Installing SGLang

For NVIDIA GPUs, do:

pip install --upgrade pip
pip install uv
uv pip install "sglang" --prerelease=allow

For Docker, try the below:

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path unsloth/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

See https://docs.sglang.ai/get_started/install.html for more details

🚚Deploying SGLang models

After saving your finetune, you can simply do:

python3 -m sglang.launch_server --model-path unsloth/Llama-3.2-1B-Instruct --host 0.0.0.0

🚒SGLang Deployment Server Flags, Engine Arguments & Options

Under construction

Last updated

Was this helpful?