📈Datasets Guide
Learn how to create & prepare a dataset for fine-tuning.
Last updated
Was this helpful?
Learn how to create & prepare a dataset for fine-tuning.
Last updated
Was this helpful?
For LLMs, datasets are collections of data that can be used to train our models. In order to be useful for training, text data needs to be in a format that can be tokenized.
One of the key parts of creating a dataset is your chat template and how you are going to design it. Tokenization is also important as it breaks text into tokens, which can be words, sub-words, or characters so LLMs can process it effectively. These tokens are then turned into embeddings and are adjusted to help the model understand the meaning and context.
To enable the process of tokenization, datasets need to be in a format that can be read by a tokenizer.
Raw Corpus
Raw text from a source such as a website, book, or article.
Continued Pretraining
Instruct
Instructions for the model to follow and an example of the output to aim for.
Supervised fine-tuning (SFT)
Conversation
Multiple-turn conversation between a user and an AI assistant.
Supervised fine-tuning (SFT)
RLHF
Conversation between a user and an AI assistant, with the assistant's responses being ranked by a human evaluator.
Reinforcement Learning
Before we format our data, we want to identify the following:
Purpose of dataset
Knowing the purpose of the dataset will help us determine what data we need and format to use.
The purpose could be, adapting a model to a new task such as summarization or improving a model's ability to role-play a specific character. For example:
Chat-based dialogues (Q&A, customer support, conversations).
Structured tasks (classification, summarization, generation tasks).
Domain-specific data (medical, finance, technical).
Style of output
The style of output will let us know what sources of data we will use to reach our desired output.
For example, the type of output you want to achieve could be JSON, HTML, text or code. Or perhaps you want it to be Spanish, English or German etc.
Data source
When we know the purpose and style of the data we need, we need to analyze the quality and of the data.
The Source of data can be a CSV file, PDF or even a website. You can also synthetically generate data but extra care is required to make sure each example is high quality and relevant.
One of the best ways to create a better dataset is by combining it with a more generalized dataset from Hugging Face like ShareGPT to make your model smarter and diverse. You could also add .
When we have identified the relevant criteria, and collected the necessary data, we can then format our data into a machine readable format that is ready for training.
If you utilise this, it will convert your dataset into a readable format for fine-tuning:
For continued pretraining, we use raw text format without specific structure:
This format preserves natural language flow and allows the model to learn from continuous text.
If we are adapting a model to a new task, and intend for the model to output text in a single turn based on a specific set of instructions, we can use Instruction format in Alpaca style
When we want multiple turns of conversation we can use OpenAI's ChatML conversational format:
Each message alternates between human
and assistant
, allowing for natural dialogue flow.
Q: How can I format my raw data into Alpaca instruct?
A: One approach is to create a Python script to process your raw data. If you're working on a summarization task, you can use a local LLM to generate instructions and outputs for each example.
Yes, you can use any local LLM like Llama 3.3 (70B) or OpenAI's GPT 4.5 to generate synthetic data. Generally, it is better to use a bigger like Llama 3.3 (70B) to ensure the highest quality outputs. You can directly use inference engines like vLLM, Ollama or llama.cpp to generate synthetic data but it will require some manual work to collect it and prompt for more data. There's 3 goals for synthetic data:
Produce entirely new data - either from scratch or from your existing dataset
Augment existing data e.g. automatically structure your dataset in the correct chosen format
You can do this process automatically rather than manually via a notebook for synthetic data generation which we will be releasing soon.
Your goal is to prompt the model to generate and process QA data that is in your specified format. The model will need to learn the structure that you provided and also the context so ensure you at least have 10 examples of data already. Examples prompts:
Prompt for generating more dialogue on an existing dataset:
Prompt if you no have dataset:
Prompt for a dataset without formatting:
It is recommended to check the quality of generated data to remove or improve on irrelevant or poor-quality responses. Depending on your dataset it may also have to be balanced in many areas so your model does not overfit. You can then feed this cleaned dataset back into your LLM to regenerate data, now with even more guidance.
We generally recommend using at least 100 rows of data for fine-tuning to achieve reasonable results. For optimal performance, a dataset with over 300 rows is preferable, and in this case, more data usually leads to better outcomes. If your dataset is too small you can also add synthetic data or add a dataset from Hugging Face to diversify it. However, the effectiveness of your fine-tuned model depends heavily on the quality of the dataset, so be sure to thoroughly clean and prepare your data.
If you want to fine-tune a model that already has reasoning capabilities like the distilled versions of DeepSeek-R1 (e.g. DeepSeek-R1-Distill-Llama-8B), you will need to still follow question/task and answer pairs however, for your answer you will need to change the answer so it includes reasoning/chain-of-thought process and the steps it took to derive the answer. For a model that does not have reasoning and you want to train it so that it later encompasses reasoning capabilities, you will need to utilize a standard dataset but this time with a dataset. This is training process is known as Reinforcement Learning and GRPO.
If you have multiple datasets for fine-tuning, you can either:
Standardize the format of all datasets, combine them into a single dataset, and fine-tune on this unified dataset.
Use the Multiple Datasets notebook to fine-tune on multiple datasets directly.
You can fine-tune an already fine-tuned model multiple times, but it's best to combine all the datasets and perform the fine-tuning in a single process instead. Training an already fine-tuned model can potentially alter the quality and knowledge acquired during the previous fine-tuning process.
The dataset for fine-tuning a vision or multimodal model also includes image inputs. For example, the Llama 3.2 Vision Notebook uses a radiography case to show how AI can help medical professionals analyze X-rays, CT scans, and ultrasounds more efficiently.
We'll be using a sampled version of the ROCO radiography dataset. You can access the dataset here. The dataset includes X-rays, CT scans and ultrasounds showcasing medical conditions and diseases. Each image has a caption written by experts describing it. The goal is to finetune a VLM to make it a useful analysis tool for medical professionals.
Let's take a look at the dataset, and check what the 1st example shows:
Panoramic radiography shows an osteolytic lesion in the right posterior maxilla with resorption of the floor of the maxillary sinus (arrows).
To format the dataset, all vision finetuning tasks should be formatted as follows:
We will craft an custom instruction asking the VLM to be an expert radiographer. Notice also instead of just 1 instruction, you can add multiple turns to make it a dynamic conversation.
Let's convert the dataset into the "correct" format for finetuning:
The first example is now structured like below:
Before we do any finetuning, maybe the vision model already knows how to analyse the images? Let's check if this is the case!
And the result:
For more details, view our dataset section in the notebook here.
Diversify your dataset so your model does not and become too specific