Unsloth Inference

Learn how to run your finetuned model with Unsloth's faster inference.

Unsloth supports natively 2x faster inference. For our inference only notebook, click here.

All QLoRA, LoRA and non LoRA inference paths are 2x faster. This requires no change of code or any new dependencies.

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)

NotImplementedError: A UTF-8 locale is required. Got ANSI

Sometimes when you execute a cell this error can appear. To solve this, in a new cell, run the below:

import locale
locale.getpreferredencoding = lambda: "UTF-8"

PreviousSGLang NextTroubleshooting Inference

Last updated 1 month ago

Was this helpful?