🧩NVIDIA Nemotron 3 Nano - How To Run Guide

Run & fine-tune NVIDIA Nemotron 3 Nano locally on your device!

NVIDIA releases Nemotron 3 Nano, a 30B parameter hybrid reasoning MoE model with ~3.6B active parameters - built for fast, accurate coding, math and agentic tasks. It has a 1M context window and is best amongst its size class on SWE-Bench, GPQA Diamond, reasoning, chat and throughput.

Nemotron 3 Nano runs on 24GB RAM/VRAM (or unified memory) and you can now fine-tune it locally. Thanks NVIDIA for providing Unsloth with day-zero support.

Running TutorialFine-tuning Nano 3

NVIDIA Nemotron 3 Nano GGUF to run: unsloth/Nemotron-3-Nano-30B-A3B-GGUF We also uploaded BF16 and FP8 variants.

⚙️ Usage Guide

NVIDIA recommends these settings for inference:

General chat/instruction (default):

  • temperature = 1.0

  • top_p = 1.0

Tool calling use-cases:

  • temperature = 0.6

  • top_p = 0.95

For most local use, set:

  • max_new_tokens = 32,096 to 262,144 for standard prompts with a max of 1M tokens

  • Increase for deep reasoning or long-form generation as your RAM/VRAM allows.

The chat template format is found when we use the below:

tokenizer.apply_chat_template([
    {"role" : "user", "content" : "What is 1+1?"},
    {"role" : "assistant", "content" : "2"},
    {"role" : "user", "content" : "What is 2+2?"}
    ], add_generation_prompt = True, tokenize = False,
)

Nemotron 3 chat template format:

Nemotron 3 uses <think> with token id 12 and </think> with id 13 for reasoning. Use --special to see the tokens.

🖥️ Run Nemotron-3-Nano-30B-A3B

Depending on your use-case you will need to use different settings.

Llama.cpp Tutorial (GGUF):

Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):

1

Obtain the specific llama.cpp PR on GitHub here. You can also follow the build instructions below. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. We are using https://github.com/ggml-org/llama.cpp/pull/18058

2

You can directly pull from Hugging Face. You can increase the context to 1M as your RAM/VRAM allows.

Follow this for general instruction use-cases:

Follow this for tool-calling use-cases:

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL or other quantized versions.

4

Then run the model in conversation mode:

Nemotron 3 uses <think> with token id 12 and </think> with id 13 for reasoning. Use --special to see the tokens.

Also, adjust context window as required. Ensure your hardware can handle more than a 256K context window. Setting it to 1M may trigger CUDA OOM and crash, which is why the default is 262,144.

🦥 Fine-tuning Nemotron 3 Nano and RL

Unsloth now supports fine-tuning of all Nemotron models, including Nemotron 3 Nano. The 30B model does not fit on a free Colab GPU; however, we still made an 80GB A100 Colab notebook for you to fine-tune with. 16-bit LoRA fine-tuning of Nemotron 3 Nano will use around 60GB VRAM:

Reinforcement Learning + NeMo Gym

We worked with the open-source NVIDIA NeMo Gym team to enable the democratization of RL environments. Our collab enables single-turn rollout RL training for many domains of interest, including math, coding, tool-use, etc, using training environments and datasets from NeMo Gym:

🎉Llama-server serving & deployment

To deploy Nemotron 3 for production, we use llama-server In a new terminal say via tmux, deploy the model via:

When you run the above, you will get:

Then in a new terminal, after doing pip install openai, do:

Which will print

Benchmarks

Nemotron-3-Nano-30B-A3B is the best performing model across all benchmarks, including throughput.

Last updated

Was this helpful?