How to Run Local LLMs with Docker: Step-by-Step Guide
Learn how to run Large Language Models (LLMs) with Docker & Unsloth on your local device.
You can now run any model, including Unsloth Dynamic GGUFs, on Mac or Windows with a single line of code or no code at all. Thanks to our partnership with Docker, deploying models is effortless, and most GGUF models on Docker are now powered by Unsloth.
Before you start, make sure to look over hardware requirements and our tips for optimizing performance when running LLMs on your device.
To get started, run OpenAI gpt-oss with a single command:
docker model run ai/gpt-oss:20BOr to run a specific Unsloth model / quant from Hugging Face:
docker model run hf.co/unsloth/gpt-oss-20b-GGUF:F16Why Unsloth + Docker?
We collab with model labs like Google Gemma to fix model bugs and boost accuracy. Our Dynamic GGUFs consistently outperform other quant methods, giving you high-accuracy, efficient inference.
If you use Docker, you can run models instantly with zero setup. Docker uses Docker Model Runner (DMR), which lets you run LLMs as easily as containers with no dependency issues. DMR uses Unsloth models and llama.cpp under the hood for fast, efficient, up-to-date inference.
⚙️ Hardware Info + Performance
For the best performance, aim for your VRAM + RAM combined to be at least equal to the size of the quantized model you're downloading. If you have less, the model will still run, but significantly slower.
Make sure your device also has enough disk space to store the model. If your model only barely fits in memory, you can expect around ~5 tokens/s, depending on model size.
Having extra RAM/VRAM available will improve inference speed, and additional VRAM will enable the biggest performance boost (provided the entire model fits)
Quantization recommendations:
For models under 30B parameters, use at least 4-bit (Q4).
For models 70B parameters or larger, use a minimum of 2-bit quantization (e.g., UD_Q2_K_XL).
⚡ Step-by-Step Tutorials
Below are two ways to run models with Docker: one using the terminal, and the other using Docker Desktop with no code:
Method #1: Docker Terminal
Install Docker
Docker Model Runner is already available in both Docker Desktop and Docker CE.
Run the model
Decide on a model to run, then run the command via terminal.
Browse the verified catalog of trusted models available on Docker Hub or Unsloth's Hugging Face page.
Go to Terminal to run the commands. To verify if you have
dockerinstalled, you can type 'docker' and enter.Docker Hub defaults to running Unsloth Dynamic 4-bit, however you can select your own quantization level (see step #3).
For example, to run OpenAI gpt-oss-20b in a single command:
docker model run ai/gpt-oss:20BOr to run a specific Unsloth gpt-oss quant from Hugging Face:
docker model run hf.co/unsloth/gpt-oss-20b-GGUF:UD-Q8_K_XLThis is how running gpt-oss-20b should look via CLI:


To run a specific quantization level:
If you want to run a specific quantization of a model, append : and the quantization name to the model (e.g., Q4 for Docker or UD_Q4_K_XL). You can view all available quantizations on each model’s Docker Hub page. e.g. see the listed quantizations for gpt-oss here.
The same applies to Unsloth quants on Hugging Face: visit the model’s HF page, choose a quantization, then run something like: docker model run hf.co/unsloth/gpt-oss-20b-GGUF:Q2_K_L


Method #2: Docker Desktop (no code)
Install Docker Desktop
Docker Model Runner is already available in Docker Desktop.
Decide on a model to run, open Docker Desktop, then click on the models tab.
Click 'Add models +' or Docker Hub. Search for the model.
Browse the verified model catalog available on Docker Hub.


Pull the model
Click the model you want to run to see available quantizations.
Quantizations range from 1–16 bits. For models under 30B parameters, use at least 4-bit (
Q4).Choose a size that fits your hardware: ideally, your combined unified memory, RAM, or VRAM should be equal to or greater than the model size. For example, an 11GB model runs well on 12GB unified memory.


To run the latest models:
You can run any new model on Docker as long as it’s supported by llama.cpp or vllm and available on Docker Hub.
What Is the Docker Model Runner?
The Docker Model Runner (DMR) is an open-source tool that lets you pull and run AI models as easily as you run containers. GitHub: https://github.com/docker/model-runner
It provides a consistent runtime for models, similar to how Docker standardized app deployment. Under the hood, it uses optimized backends (like llama.cpp) for smooth, hardware-efficient inference on your machine.
Whether you’re a researcher, developer, or hobbyist, you can now:
Run open models locally in seconds.
Avoid dependency hell, everything is handled in Docker.
Share and reproduce model setups effortlessly.
Last updated
Was this helpful?


