🐳DeepSeek-OCR: How to Run & Fine-tune
Guide on how to run and fine-tune DeepSeek-OCR locally.
DeepSeek-OCR is a 3B-parameter vision model for OCR and document understanding. It uses context optical compression to convert 2D layouts into vision tokens, enabling efficient long-context processing.
Capable of handling tables, papers, and handwriting, DeepSeek-OCR achieves 97% precision while using 10× fewer vision tokens than text tokens - making it 10× more efficient than text-based LLMs.
You can fine-tune DeepSeek-OCR to enhance its vision or language performance. In our Unsloth free fine-tuning notebook, we demonstrated a 88.26% improvement for language understanding.
Our model upload that enables fine-tuning + more inference support: DeepSeek-OCR
🖥️ Running DeepSeek-OCR
To run the model in vLLM or Unsloth, here are the recommended settings:
⚙️ Recommended Settings
DeepSeek recommends these settings:
Temperature = 0.0
max_tokens = 8192ngram_size = 30window_size = 90
📖 vLLM: Run DeepSeek-OCR Tutorial
Obtain the latest
vLLMvia:
Then run the following code:
🦥 Unsloth: Run DeepSeek-OCR Tutorial
Obtain the latest
unslothviapip install --upgrade unsloth. If you already have Unsloth, update it viapip install --upgrade --force-reinstall --no-deps --no-cache-dir unsloth unsloth_zooThen use the code below to run DeepSeek-OCR:
🦥 Fine-tuning DeepSeek-OCR
Unsloth supports fine-tuning of DeepSeek-OCR. Since the default model isn't runnable on the latest transformers version, we added changes from the Stranger Vision HF team, to then enable inference. As usual, Unsloth trains DeepSeek-OCR 1.4x faster with 40% less VRAM and 5x longer context lengths - no accuracy degradation.
We created two free DeepSeek-OCR Colab notebooks (with and without eval):
DeepSeek-OCR: Fine-tuning only notebook
DeepSeek-OCR: Fine-tuning + Evaluation notebook (A100)
Fine-tuning DeepSeek-OCR on a 200K sample Persian dataset resulted in substantial gains in Persian text detection and understanding. We evaluated the base model against our fine-tuned version on 200 Persian transcript samples, observing an 88.26% absolute improvement in Character Error Rate (CER). After only 60 training steps (batch size = 8), the mean CER decreased from 149.07% to a mean of 60.81%. This means the fine-tuned model is 57% more accurate at understanding Persian.
You can replace the Persian dataset with your own to improve DeepSeek-OCR for other use-cases. For replica-table eval results, use our eval notebook above. For detailed eval results, see below:
Fine-tuned Evaluation Results:
DeepSeek-OCR Baseline
Mean Baseline Model Performance: 149.07% CER for this eval set!
DeepSeek-OCR Fine-tuned
With 60 steps, we reduced CER from 149.07% to 60.43% (89% CER improvement)
An example from the 200K Persian dataset we used (you may use your own), showing the image on the left and the corresponding text on the right.

Last updated
Was this helpful?

