🧩NVIDIA Nemotron 3 Nano - How To Run Guide
Run & fine-tune NVIDIA Nemotron 3 Nano locally on your device!
NVIDIA releases Nemotron 3 Nano, a 30B parameter hybrid reasoning MoE model with ~3.6B active parameters - built for fast, accurate coding, math and agentic tasks. It has a 1M context window and is best amongst its size class on SWE-Bench, GPQA Diamond, reasoning, chat and throughput.
Nemotron 3 Nano runs on 24GB RAM/VRAM (or unified memory) and you can now fine-tune it locally. Thanks NVIDIA for providing Unsloth with day-zero support.
NVIDIA Nemotron 3 Nano GGUF to run: unsloth/Nemotron-3-Nano-30B-A3B-GGUF We also uploaded BF16 and FP8 variants.
⚙️ Usage Guide
NVIDIA recommends these settings for inference:
General chat/instruction (default):
temperature = 1.0top_p = 1.0
Tool calling use-cases:
temperature = 0.6top_p = 0.95
For most local use, set:
max_new_tokens=32,096to262,144for standard prompts with a max of 1M tokensIncrease for deep reasoning or long-form generation as your RAM/VRAM allows.
The chat template format is found when we use the below:
tokenizer.apply_chat_template([
{"role" : "user", "content" : "What is 1+1?"},
{"role" : "assistant", "content" : "2"},
{"role" : "user", "content" : "What is 2+2?"}
], add_generation_prompt = True, tokenize = False,
)Nemotron 3 chat template format:
🖥️ Run Nemotron-3-Nano-30B-A3B
Depending on your use-case you will need to use different settings.
Llama.cpp Tutorial (GGUF):
Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):
Obtain the specific llama.cpp PR on GitHub here. You can also follow the build instructions below. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. We are using https://github.com/ggml-org/llama.cpp/pull/18058
You can directly pull from Hugging Face. You can increase the context to 1M as your RAM/VRAM allows.
Follow this for general instruction use-cases:
Follow this for tool-calling use-cases:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL or other quantized versions.
Then run the model in conversation mode:
Also, adjust context window as required. Ensure your hardware can handle more than a 256K context window. Setting it to 1M may trigger CUDA OOM and crash, which is why the default is 262,144.
Because the model was trained with NoPE, you only need to change max_position_embeddings. The model doesn’t use explicit positional embeddings, so YaRN isn’t needed.
🦥 Fine-tuning Nemotron 3 Nano and RL
Unsloth now supports fine-tuning of all Nemotron models, including Nemotron 3 Nano. The 30B model does not fit on a free Colab GPU; however, we still made an 80GB A100 Colab notebook for you to fine-tune with. 16-bit LoRA fine-tuning of Nemotron 3 Nano will use around 60GB VRAM:
✨Reinforcement Learning + NeMo Gym
We worked with the open-source NVIDIA NeMo Gym team to enable the democratization of RL environments. Our collab enables single-turn rollout RL training for many domains of interest, including math, coding, tool-use, etc, using training environments and datasets from NeMo Gym:
🎉Llama-server serving & deployment
To deploy Nemotron 3 for production, we use llama-server In a new terminal say via tmux, deploy the model via:
When you run the above, you will get:

Then in a new terminal, after doing pip install openai, do:
Which will print
Also check out our latest collab guide published on NVIDIA’s official Developer blog:
Benchmarks
Nemotron-3-Nano-30B-A3B is the best performing model across all benchmarks, including throughput.

Last updated
Was this helpful?

