SGLang Deployment & Inference Guide
Guide on saving and deploying LLMs to SGLang for serving LLMs in production
You can serve any LLM or fine-tuned model via SGLang for low-latency, high-throughput inference. SGLang supports text, image/video model inference on any GPU setup, with support for some GGUFs.
See below for standard SGLang saving instructions:
💻Setting up SGLang
For pip or uv installation on NVIDIA GPUs:
# OPTIONAL use a virtual environment
python -m venv unsloth_env
source unsloth_env/bin/activate
# Install Rust and outlines-core
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -y
source $HOME/.cargo/env
sudo apt-get install -y pkg-config libssl-dev
# Install SGLang
pip install --upgrade pip
pip install uv
uv pip install "sglang" --prerelease=allowNote if you see the below, update Rust and outlines-core as specified in Setting up SGLang
hint: This usually indicates a problem with the package or the build environment.
help: `outlines-core` (v0.1.26) was included because `sglang` (v0.5.5.post2) depends on `outlines` (v0.1.11) which depends on `outlines-core`For Docker setups run:
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path unsloth/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000🚚Deploying SGLang models
After saving your fine-tune, you can simply do:
python3 -m sglang.launch_server \
--model-path unsloth/Llama-3.2-1B-Instruct \
--host 0.0.0.0 --port 30000
You can then use an OpenAI Chat completions library to call the model (in another terminal or using tmux):
# Install openai via pip install openai
from openai import OpenAI
import json
openai_client = OpenAI(
base_url = "http://0.0.0.0:30000/v1",
api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
model = "unsloth/Llama-3.2-1B-Instruct",
messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)And you will get 2 + 2 = 4.
🦥Unsloth SGLang Instructions
After fine-tuning Fine-tuning LLMs Guide or using our notebooks at Unsloth Notebooks, you can save or deploy your models directly through SGLang, within a single workflow.
To save to 16-bit for SGLang, use:
model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit")
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")To save just the LoRA adapters, either use:
model.save_pretrained("model")
tokenizer.save_pretrained("tokenizer")Or just use our builtin function to do that:
model.save_pretrained_merged("model", tokenizer, save_method = "lora")
model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")🚃gpt-oss-20b: Unsloth & SGLang Deployment Guide
Below is a step-by-step tutorial with instructions for training the gpt-oss-20b using Unsloth and deploying it with SGLang. It includes performance benchmarks across multiple quantization formats.
Unsloth Fine-tuning and Exporting Formats
If you're new to fine-tuning, you can read our guide, or try the gpt-oss 20B finetuning notebook at gpt-oss: How to Run & Fine-tune After training, you can export the model in multiple formats:
# 1. 16-bit merged model (full precision)
model.save_pretrained_merged(
"finetuned_model",
tokenizer,
save_method = "merged_16bit",
)
# 2. 4-bit MXFP4 model (ONLY for GPT-OSS)
model.save_pretrained_merged(
"finetuned_model",
tokenizer,
save_method = "mxfp4", # (ONLY FOR GPT-OSS otherwise choose merged_16bit)
)Use OpenAI Completions
from openai import OpenAI
import json
openai_client = OpenAI(
base_url = "http://0.0.0.0:30002/v1",
api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
model = "finetuned_model",
messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)
## OUTPUT ##
# <|channel|>analysis<|message|>The user asks a simple math question. We should answer 4. Also we should comply with policy. No issues.<|end|><|start|>assistant<|channel|>final<|message|>2 + 2 equals 4.⚡Benchmarking SGLang
Below is code you can run to test the performance speed of your finetuned model:
python -m sglang.launch_server \
--model-path finetuned_model \
--host 0.0.0.0 --port 30002Then in another terminal or via tmux:
# Batch Size=8, Input=1024, Output=1024
python -m sglang.bench_one_batch_server \
--model finetuned_model \
--base-url http://0.0.0.0:30002 \
--batch-size 8 \
--input-len 1024 \
--output-len 1024You will see the benchmarking like below:

We used a B200x1 GPU with gpt-oss-20b and got the below results (~2,500 tokens throughput)
8/1024/1024
0.40
3.59
20,718.95
2,562.87
8/8192/1024
0.42
3.74
154,459.01
2,473.84
See https://docs.sglang.ai/advanced_features/server_arguments.html for server arguments for SGLang.
🏃SGLang Interactive Offline Mode
You can also use SGLang in offline mode (ie not a server) inside a Python interactive environment.
import sglang as sgl
engine = sgl.Engine(model_path = "unsloth/Qwen3-0.6B", random_seed = 42)
prompt = "Today is a sunny day and I like"
sampling_params = {"temperature": 0, "max_new_tokens": 256}
outputs = engine.generate(prompt, sampling_params)["text"]
print(outputs)
engine.shutdown()🎇GGUFs in SGLang
SGLang also interestingly supports GGUFs! Qwen3 MoE is still under construction, but most dense models (Llama 3, Qwen 3, Mistral etc) are supported.
First install the latest gguf python package via:
pip install -e "git+https://github.com/ggml-org/llama.cpp.git#egg=gguf&subdirectory=gguf-py" # install a python package from a repo subdirectoryThen for example in the offline mode, do:
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
"unsloth/Qwen3-32B-GGUF",
filename = "Qwen3-32B-UD-Q4_K_XL.gguf",
)
import sglang as sgl
engine = sgl.Engine(model_path = model_path, random_seed = 42)
prompt = "Today is a sunny day and I like"
sampling_params = {"temperature": 0, "max_new_tokens": 256}
outputs = engine.generate(prompt, sampling_params)["text"]
print(outputs)
engine.shutdown()Last updated
Was this helpful?

