How to Run Local LLMs with Docker: Step-by-Step Guide

Learn how to run Large Language Models (LLMs) with Docker & Unsloth on your local device.

You can now run any model, including Unsloth Dynamic GGUFs, on Mac or Windows with a single line of code or no code at all. Thanks to our partnership with Docker, deploying models is effortless, and most GGUF models on Docker are now powered by Unsloth.

Before you start, make sure to look over hardware requirements and our tips for optimizing performance when running LLMs on your device.

Docker Terminal TutorialDocker no-code Tutorial

To get started, run OpenAI gpt-oss with a single command:

docker model run ai/gpt-oss:20B

Or to run a specific Unsloth model / quant from Hugging Face:

docker model run hf.co/unsloth/gpt-oss-20b-GGUF:F16

Why Unsloth + Docker?

We collab with model labs like Google Gemma to fix model bugs and boost accuracy. Our Dynamic GGUFs consistently outperform other quant methods, giving you high-accuracy, efficient inference.

If you use Docker, you can run models instantly with zero setup. Docker uses Docker Model Runner (DMR), which lets you run LLMs as easily as containers with no dependency issues. DMR uses Unsloth models and llama.cpp under the hood for fast, efficient, up-to-date inference.

⚙️ Hardware Info + Performance

For the best performance, aim for your VRAM + RAM combined to be at least equal to the size of the quantized model you're downloading. If you have less, the model will still run, but significantly slower.

Make sure your device also has enough disk space to store the model. If your model only barely fits in memory, you can expect around ~5 tokens/s, depending on model size.

Having extra RAM/VRAM available will improve inference speed, and additional VRAM will enable the biggest performance boost (provided the entire model fits)

Example: If you're downloading gpt-oss-20b (F16) and the model is 13.8 GB, ensure that your disk space and RAM + VRAM > 13.8 GB.

Quantization recommendations:

For models under 30B parameters, use at least 4-bit (Q4).
For models 70B parameters or larger, use a minimum of 2-bit quantization (e.g., UD_Q2_K_XL).

⚡ Step-by-Step Tutorials

Below are two ways to run models with Docker: one using the terminal, and the other using Docker Desktop with no code:

Method #1: Docker Terminal

Install Docker

Docker Model Runner is already available in both Docker Desktop and Docker CE.

Run the model

Decide on a model to run, then run the command via terminal.

Browse the verified catalog of trusted models available on Docker Hub or Unsloth's Hugging Face page.
Go to Terminal to run the commands. To verify if you have docker installed, you can type 'docker' and enter.
Docker Hub defaults to running Unsloth Dynamic 4-bit, however you can select your own quantization level (see step #3).

For example, to run OpenAI gpt-oss-20b in a single command:

docker model run ai/gpt-oss:20B

Or to run a specific Unsloth gpt-oss quant from Hugging Face:

docker model run hf.co/unsloth/gpt-oss-20b-GGUF:UD-Q8_K_XL

This is how running gpt-oss-20b should look via CLI:

To run a specific quantization level:

If you want to run a specific quantization of a model, append : and the quantization name to the model (e.g., Q4 for Docker or UD_Q4_K_XL). You can view all available quantizations on each model’s Docker Hub page. e.g. see the listed quantizations for gpt-oss here.

The same applies to Unsloth quants on Hugging Face: visit the model’s HF page, choose a quantization, then run something like: docker model run hf.co/unsloth/gpt-oss-20b-GGUF:Q2_K_L

Method #2: Docker Desktop (no code)

Install Docker Desktop

Docker Model Runner is already available in Docker Desktop.

Decide on a model to run, open Docker Desktop, then click on the models tab.
Click 'Add models +' or Docker Hub. Search for the model.

Browse the verified model catalog available on Docker Hub.

Pull the model

Click the model you want to run to see available quantizations.

Quantizations range from 1–16 bits. For models under 30B parameters, use at least 4-bit (Q4).
Choose a size that fits your hardware: ideally, your combined unified memory, RAM, or VRAM should be equal to or greater than the model size. For example, an 11GB model runs well on 12GB unified memory.

Run the model

Type any prompt in the 'Ask a question' box and use the LLM like you would use ChatGPT.

To run the latest models:

You can run any new model on Docker as long as it’s supported by llama.cpp or vllm and available on Docker Hub.

What Is the Docker Model Runner?

The Docker Model Runner (DMR) is an open-source tool that lets you pull and run AI models as easily as you run containers. GitHub: https://github.com/docker/model-runner

It provides a consistent runtime for models, similar to how Docker standardized app deployment. Under the hood, it uses optimized backends (like llama.cpp) for smooth, hardware-efficient inference on your machine.

Whether you’re a researcher, developer, or hobbyist, you can now:

Run open models locally in seconds.
Avoid dependency hell, everything is handled in Docker.
Share and reproduce model setups effortlessly.

PreviousTutorial: How to Train gpt-oss with RL NextKimi K2 Thinking: How to Run Locally

Last updated 0 minutes ago

Was this helpful?