Datasets Guide
Learn how to create & prepare a dataset for fine-tuning.
Last updated
Was this helpful?
Learn how to create & prepare a dataset for fine-tuning.
Last updated
Was this helpful?
For LLMs, datasets are collections of data that can be used to train our models. In order to be useful for training, text data needs to be in a format that can be tokenized. You'll also learn how to .
One of the key parts of creating a dataset is your and how you are going to design it. Tokenization is also important as it breaks text into tokens, which can be words, sub-words, or characters so LLMs can process it effectively. These tokens are then turned into embeddings and are adjusted to help the model understand the meaning and context.
To enable the process of tokenization, datasets need to be in a format that can be read by a tokenizer.
Raw Corpus
Raw text from a source such as a website, book, or article.
Continued Pretraining (CPT)
Instruct
Instructions for the model to follow and an example of the output to aim for.
Supervised fine-tuning (SFT)
Conversation
Multiple-turn conversation between a user and an AI assistant.
Supervised fine-tuning (SFT)
RLHF
Conversation between a user and an AI assistant, with the assistant's responses being ranked by a script, another model or human evaluator.
Reinforcement Learning (RL)
Before we format our data, we want to identify the following:
Purpose of dataset
Knowing the purpose of the dataset will help us determine what data we need and format to use.
The purpose could be, adapting a model to a new task such as summarization or improving a model's ability to role-play a specific character. For example:
Chat-based dialogues (Q&A, learn a new language, customer support, conversations).
Structured tasks (classification, summarization, generation tasks).
Domain-specific data (medical, finance, technical).
Style of output
The style of output will let us know what sources of data we will use to reach our desired output.
For example, the type of output you want to achieve could be JSON, HTML, text or code. Or perhaps you want it to be Spanish, English or German etc.
Data source
When we know the purpose and style of the data we need, we need to analyze the quality and of the data. Hugging Face and Wikipedia are great sources of datasets and Wikipedia is especially useful if you are looking to train a model to learn a language.
The Source of data can be a CSV file, PDF or even a website. You can also data but extra care is required to make sure each example is high quality and relevant.
One of the best ways to create a better dataset is by combining it with a more generalized dataset from Hugging Face like ShareGPT to make your model smarter and diverse. You could also add .
When we have identified the relevant criteria, and collected the necessary data, we can then format our data into a machine readable format that is ready for training.
This format preserves natural language flow and allows the model to learn from continuous text.
When we want multiple turns of conversation we can use the ShareGPT format:
The template format uses the "from"/"value" attribute keys and messages alternates between human
and gpt
, allowing for natural dialogue flow.
The other common format is OpenAI's ChatML format and is what Hugging Face defaults to. This is probably the most used format, and alternates between user
and assistant
For datasets that usually follow the common chatml format, the process of preparing the dataset for training or finetuning, consists of four simple steps:
Check the chat templates that Unsloth currently supports:
This will print out the list of templates currently supported by Unsloth. Here is an example output:
Use get_chat_template
to apply the right chat template to your tokenizer:
Define your formatting function. Here's an example:
This function loops through your dataset applying the chat template you defined to each sample.
Finally, let's load the dataset and apply the required modifications to our dataset:
If your dataset uses the ShareGPT format with "from"/"value" keys instead of the ChatML "role"/"content" format, you can use the standardize_sharegpt
function to convert it first. The revised code will now look as follows:
Q: How can I use the Alpaca instruct format?
Q: Should I always use the standardize_sharegpt method?
A: Only use the standardize_sharegpt method if your target dataset is formatted in the sharegpt format, but your model expect a ChatML format instead.
Q: Why not use the apply_chat_template function that comes with the tokenizer.
Q: What if my template is not currently supported by Unsloth?
Yes, you can use any local LLM like Llama 3.3 (70B) or OpenAI's GPT 4.5 to generate synthetic data. Generally, it is better to use a bigger like Llama 3.3 (70B) to ensure the highest quality outputs. You can directly use inference engines like vLLM, Ollama or llama.cpp to generate synthetic data but it will require some manual work to collect it and prompt for more data. There's 3 goals for synthetic data:
Produce entirely new data - either from scratch or from your existing dataset
Augment existing data e.g. automatically structure your dataset in the correct chosen format
You can do this process automatically rather than manually via a notebook for synthetic data generation which we will be releasing soon.
Your goal is to prompt the model to generate and process QA data that is in your specified format. The model will need to learn the structure that you provided and also the context so ensure you at least have 10 examples of data already. Examples prompts:
Prompt for generating more dialogue on an existing dataset:
Prompt if you no have dataset:
Prompt for a dataset without formatting:
It is recommended to check the quality of generated data to remove or improve on irrelevant or poor-quality responses. Depending on your dataset it may also have to be balanced in many areas so your model does not overfit. You can then feed this cleaned dataset back into your LLM to regenerate data, now with even more guidance.
We generally recommend using a bare minimum of at least 100 rows of data for fine-tuning to achieve reasonable results. For optimal performance, a dataset with over 1,000 rows is preferable, and in this case, more data usually leads to better outcomes. If your dataset is too small you can also add synthetic data or add a dataset from Hugging Face to diversify it. However, the effectiveness of your fine-tuned model depends heavily on the quality of the dataset, so be sure to thoroughly clean and prepare your data.
If you have multiple datasets for fine-tuning, you can either:
Standardize the format of all datasets, combine them into a single dataset, and fine-tune on this unified dataset.
You can fine-tune an already fine-tuned model multiple times, but it's best to combine all the datasets and perform the fine-tuning in a single process instead. Training an already fine-tuned model can potentially alter the quality and knowledge acquired during the previous fine-tuning process.
See an example of using the Alpaca dataset inside of Unsloth on Google Colab:
We will now use the Alpaca Dataset created by calling GPT-4 itself. It is a list of 52,000 instructions and outputs which was very popular when Llama-1 was released, since it made finetuning a base LLM be competitive with ChatGPT itself.
You can see there are 3 columns in each row - an instruction, and input and an output. We essentially combine each row into 1 large prompt like below. We then use this to finetune the language model, and this made it very similar to ChatGPT. We call this process supervised instruction finetuning.
But a big issue is for ChatGPT style assistants, we only allow 1 instruction / 1 prompt, and not multiple columns / inputs. For example in ChatGPT, you can see we must submit 1 prompt, and not multiple prompts.
This essentially means we have to "merge" multiple columns into 1 large prompt for finetuning to actually function!
For example the very famous Titanic dataset has many many columns. Your job was to predict whether a passenger has survived or died based on their age, passenger class, fare price etc. We can't simply pass this into ChatGPT, but rather, we have to "merge" this information into 1 large prompt.
For example, if we ask ChatGPT with our "merged" single prompt which includes all the information for that passenger, we can then ask it to guess or predict whether the passenger has died or survived.
Other finetuning libraries require you to manually prepare your dataset for finetuning, by merging all your columns into 1 prompt. In Unsloth, we simply provide the function called to_sharegpt
which does this in 1 go!
Now this is a bit more complicated, since we allow a lot of customization, but there are a few points:
You must enclose all columns in curly braces {}
. These are the column names in the actual CSV / Excel file.
Optional text components must be enclosed in [[]]
. For example if the column "input" is empty, the merging function will not show the text and skip this. This is useful for datasets with missing values.
Select the output or target / prediction column in output_column_name
. For the Alpaca dataset, this will be output
.
For example in the Titanic dataset, we can create a large merged prompt format like below, where each column / piece of text becomes optional.
For example, pretend the dataset looks like this with a lot of missing data:
S
23
18
7.25
Then, we do not want the result to be:
The passenger embarked from S. Their age is 23. Their fare is EMPTY.
The passenger embarked from EMPTY. Their age is 18. Their fare is $7.25.
Instead by optionally enclosing columns using [[]]
, we can exclude this information entirely.
[[The passenger embarked from S.]] [[Their age is 23.]] [[Their fare is EMPTY.]]
[[The passenger embarked from EMPTY.]] [[Their age is 18.]] [[Their fare is $7.25.]]
becomes:
The passenger embarked from S. Their age is 23.
Their age is 18. Their fare is $7.25.
A bit issue if you didn't notice is the Alpaca dataset is single turn, whilst remember using ChatGPT was interactive and you can talk to it in multiple turns. For example, the left is what we want, but the right which is the Alpaca dataset only provides singular conversations. We want the finetuned language model to somehow learn how to do multi turn conversations just like ChatGPT.
So we introduced the conversation_extension
parameter, which essentially selects some random rows in your single turn dataset, and merges them into 1 conversation! For example, if you set it to 3, we randomly select 3 rows and merge them into 1! Setting them too long can make training slower, but could make your chatbot and final finetune much better!
Then set output_column_name
to the prediction / output column. For the Alpaca dataset dataset, it would be the output column.
We then use the standardize_sharegpt
function to just make the dataset in a correct format for finetuning! Always call this!
Let's take a look at the dataset, and check what the 1st example shows:
Panoramic radiography shows an osteolytic lesion in the right posterior maxilla with resorption of the floor of the maxillary sinus (arrows).
To format the dataset, all vision finetuning tasks should be formatted as follows:
We will craft an custom instruction asking the VLM to be an expert radiographer. Notice also instead of just 1 instruction, you can add multiple turns to make it a dynamic conversation.
Let's convert the dataset into the "correct" format for finetuning:
The first example is now structured like below:
Before we do any finetuning, maybe the vision model already knows how to analyse the images? Let's check if this is the case!
And the result:
For , we use raw text format without specific structure:
If we are adapting a model to a new task, and intend for the model to output text in a single turn based on a specific set of instructions, we can use Instruction format in
A: If your dataset is already formatted in the Alpaca format, then follow the formatting steps as shown in the Llama3.1 . If you need to convert your data to the Alpaca format, one approach is to create a Python script to process your raw data. If you're working on a summarization task, you can use a local LLM to generate instructions and outputs for each example.
A: The chat_template
attribute when a model is first uploaded by the original model owners sometimes contains errors and may take time to be updated. In contrast, at Unsloth, we thoroughly check and fix any errors in the chat_template
for every model when we upload the quantized versions to our repositories. Additionally, our get_chat_template
and apply_chat_template
methods offer advanced data manipulation features, which are fully documented on our Chat Templates documentation .
A: Submit a feature request on the unsloth github issues . As a temporary workaround, you could also use the tokenizer's own apply_chat_template function until your feature request is approved and merged.
Diversify your dataset so your model does not and become too specific
If you want to fine-tune a model that already has reasoning capabilities like the distilled versions of DeepSeek-R1 (e.g. DeepSeek-R1-Distill-Llama-8B), you will need to still follow question/task and answer pairs however, for your answer you will need to change the answer so it includes reasoning/chain-of-thought process and the steps it took to derive the answer. For a model that does not have reasoning and you want to train it so that it later encompasses reasoning capabilities, you will need to utilize a standard dataset but this time without reasoning in its answers. This is training process is known as .
Use the notebook to fine-tune on multiple datasets directly.
You can access the GPT4 version of the Alpaca dataset . Below shows some examples of the dataset:
The dataset for fine-tuning a vision or multimodal model also includes image inputs. For example, the uses a radiography case to show how AI can help medical professionals analyze X-rays, CT scans, and ultrasounds more efficiently.
We'll be using a sampled version of the ROCO radiography dataset. You can access the dataset . The dataset includes X-rays, CT scans and ultrasounds showcasing medical conditions and diseases. Each image has a caption written by experts describing it. The goal is to finetune a VLM to make it a useful analysis tool for medical professionals.
For more details, view our dataset section in the .