SGLang Deployment & Inference Guide

Guide on saving and deploying LLMs to SGLang for serving LLMs in production

You can serve any LLM or fine-tuned model via SGLang for low-latency, high-throughput inference. SGLang supports text, image/video model inference on any GPU setup, with support for some GGUFs.

💻Installing SGLang

To install SGLang and Unsloth on NVIDIA GPUs, you can use the below in a virtual environment (which won't break your other Python libraries)

# OPTIONAL use a virtual environment
python -m venv unsloth_env
source unsloth_env/bin/activate

# Install Rust, outlines-core then SGLang
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env && sudo apt-get install -y pkg-config libssl-dev
pip install --upgrade pip && pip install uv
uv pip install "sglang" && uv pip install unsloth

For Docker setups run:

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path unsloth/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

🐛Debugging SGLang Installation issues

Note if you see the below, update Rust and outlines-core as specified in SGLang

hint: This usually indicates a problem with the package or the build environment.
  help: `outlines-core` (v0.1.26) was included because `sglang` (v0.5.5.post2) depends on `outlines` (v0.1.11) which depends on `outlines-core`

If you see a Flashinfer issue like below:

/home/daniel/.cache/flashinfer/0.5.2/100a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_1.cu:1:10: fatal error: flashinfer/attention/prefill.cuh: No such file or directory
    1 | #include <flashinfer/attention/prefill.cuh>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.

Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

Remove the flashinfer cache via rm -rf .cache/flashinfer and also the directory listed in the error message ie rm -rf ~/.cache/flashinfer

🚚Deploying SGLang models

To deploy any model like for example unsloth/Llama-3.2-1B-Instruct, do the below in a separate terminal (otherwise it'll block your current terminal - you can also use tmux):

python3 -m sglang.launch_server \
    --model-path unsloth/Llama-3.2-1B-Instruct \
    --host 0.0.0.0 --port 30000

You can then use the OpenAI Chat completions library to call the model (in another terminal or using tmux):

# Install openai via pip install openai
from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://0.0.0.0:30000/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/Llama-3.2-1B-Instruct",
    messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)

And you will get 2 + 2 = 4.

🦥Deploying Unsloth finetunes in SGLang

After fine-tuning Fine-tuning Guide or using our notebooks at Unsloth Notebooks, you can save or deploy your models directly through SGLang within a single workflow. An example Unsloth finetuning script for eg:

from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    max_seq_length = 2048,
    load_in_4bit = True,
)
model = FastLanguageModel.get_peft_model(model)

To save to 16-bit for SGLang, use:

model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "merged_16bit")
## OR to upload to HuggingFace:
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

To save just the LoRA adapters, either use:

model.save_pretrained("finetuned_model")
tokenizer.save_pretrained("finetuned_model")

Or just use our builtin function to do that:

model.save_pretrained_merged("model", tokenizer, save_method = "lora")
## OR to upload to HuggingFace
model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

🚃gpt-oss-20b: Unsloth & SGLang Deployment Guide

Below is a step-by-step tutorial with instructions for training the gpt-oss-20b using Unsloth and deploying it with SGLang. It includes performance benchmarks across multiple quantization formats.

1

Unsloth Fine-tuning and Exporting Formats

If you're new to fine-tuning, you can read our guide, or try the gpt-oss 20B finetuning notebook at gpt-oss After training, you can export the model in multiple formats:

model.save_pretrained_merged(
    "finetuned_model", 
    tokenizer, 
    save_method = "merged_16bit",
)
## For gpt-oss specific mxfp4 conversions:
model.save_pretrained_merged(
    "finetuned_model", 
    tokenizer, 
    save_method = "mxfp4", # (ONLY FOR gpt-oss otherwise choose "merged_16bit")
)
2

Deployment with SGLang

We saved our gpt-oss finetune to the folder "finetuned_model", and so in a new terminal, we can launch the finetuned model as an inference endpoint with SGLang:

python -m sglang.launch_server \
    --model-path finetuned_model \
    --host 0.0.0.0 --port 30002

You might have to wait a bit on Capturing batches (bs=1 avail_mem=20.84 GB): !

3

Calling the inference endpoint:

To call the inference endpoint, first launch a new terminal. We then can call the model like below:

from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://0.0.0.0:30002/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "finetuned_model",
    messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)

## OUTPUT ##
# <|channel|>analysis<|message|>The user asks a simple math question. We should answer 4. Also we should comply with policy. No issues.<|end|><|start|>assistant<|channel|>final<|message|>2 + 2 equals 4.

💎FP8 Online Quantization

To deploy models with FP8 online quantization which allows 30 to 50% more throughput and 50% less memory usage with 2x longer context length supports with SGLang, you can do the below:

python -m sglang.launch_server \
    --model-path unsloth/Llama-3.2-1B-Instruct \
    --host 0.0.0.0 --port 30002 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3

You can also use --kv-cache-dtype fp8_e5m2 which has a larger dynamic range which might solve FP8 inference issues if you see them. Or use our pre-quantized float8 quants listed in https://huggingface.co/unsloth/models?search=-fp8 or some are listed below:

⚡Benchmarking SGLang

Below is some code you can run to test the performance speed of your finetuned model:

python -m sglang.launch_server \
    --model-path finetuned_model \
    --host 0.0.0.0 --port 30002

Then in another terminal or via tmux:

# Batch Size=8, Input=1024, Output=1024
python -m sglang.bench_one_batch_server \
    --model finetuned_model \
    --base-url http://0.0.0.0:30002 \
    --batch-size 8 \
    --input-len 1024 \
    --output-len 1024

You will see the benchmarking run like below:

We used a B200x1 GPU with gpt-oss-20b and got the below results (~2,500 tokens throughput)

Batch/Input/Output
TTFT (s)
ITL (s)
Input Throughput
Output Throughput

8/1024/1024

0.40

3.59

20,718.95

2,562.87

8/8192/1024

0.42

3.74

154,459.01

2,473.84

See https://docs.sglang.ai/advanced_features/server_arguments.html for server arguments for SGLang.

🏃SGLang Interactive Offline Mode

You can also use SGLang in offline mode (ie not a server) inside a Python interactive environment.

import sglang as sgl
engine = sgl.Engine(model_path = "unsloth/Qwen3-0.6B", random_seed = 42)

prompt = "Today is a sunny day and I like"
sampling_params = {"temperature": 0, "max_new_tokens": 256}
outputs = engine.generate(prompt, sampling_params)["text"]
print(outputs)
engine.shutdown()

🎇GGUFs in SGLang

SGLang also interestingly supports GGUFs! Qwen3 MoE is still under construction, but most dense models (Llama 3, Qwen 3, Mistral etc) are supported.

First install the latest gguf python package via:

pip install -e "git+https://github.com/ggml-org/llama.cpp.git#egg=gguf&subdirectory=gguf-py" # install a python package from a repo subdirectory

Then for example in offline mode SGLang, you can do:

from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
    "unsloth/Qwen3-32B-GGUF",
    filename = "Qwen3-32B-UD-Q4_K_XL.gguf",
)
import sglang as sgl
engine = sgl.Engine(model_path = model_path, random_seed = 42)

prompt = "Today is a sunny day and I like"
sampling_params = {"temperature": 0, "max_new_tokens": 256}
outputs = engine.generate(prompt, sampling_params)["text"]
print(outputs)
engine.shutdown()

🎬High throughput GGUF serving with SGLang

First download the specific GGUF file like below:

from huggingface_hub import hf_hub_download
hf_hub_download("unsloth/Qwen3-32B-GGUF", filename="Qwen3-32B-UD-Q4_K_XL.gguf", local_dir=".")

Then serve the specific file Qwen3-32B-UD-Q4_K_XL.gguf and use --served-model-name unsloth/Qwen3-32B and also we need the HuggingFace compatible tokenizer via --tokenizer-path

python -m sglang.launch_server \
    --model-path Qwen3-32B-UD-Q4_K_XL.gguf \
    --host 0.0.0.0 --port 30002 \
    --served-model-name unsloth/Qwen3-32B \
    --tokenizer-path unsloth/Qwen3-32B

Last updated

Was this helpful?