📈Datasets 101
Learn all the essentials of creating a dataset for fine-tuning!
What is a Dataset?
For large language models, datasets are collections of data that can be used to train our models. In order to be useful for training, text data needs to be in a format that can be tokenized.
Tokenization
Tokenization is the process of breaking text into units called tokens. These units can be representative of words, sub-words or even characters.
The typical approach to tokenization with large language models is to tokenize text into sub-word chunks, this allows models to handle inputs that are out of vocabulary.
During training, tokens are embedded in a high-dimensional latent space. By utilizing attention mechanisms, the model fine-tunes these embeddings to produce contextually relevant outputs.
In Summary: Tokenization turns raw text into a format that is both machine-readable and retains meaningful information.
Data Format
To enable the process of tokenization, datasets need to be in a format that can be read by a tokenizer.
Raw Corpus
Raw text from a source such as a website, book, or article.
Continued Pretraining
Instruct
Instructions for the model to follow and an example of the output to aim for.
Supervised fine-tuning (SFT)
Conversation
Multiple-turn conversation between a user and an AI assistant.
Supervised fine-tuning (SFT)
RLHF
Conversation between a user and an AI assistant, with the assistant's responses being ranked by a human evaluator.
Reinforcement Learning
It's worth noting that different styles of format exist for each of these types.
Getting Started
Before we format our data, we want to identify the following:
Purpose of dataset
Knowing the purpose of the dataset will help us determine what data we need and format to use.
The purpose could be, adapting a model to a new task such as summarization or improving a model's ability to role-play a specific character.
Style of output
The style of output will let us know what sources of data we will use to reach our desired output.
For example, the type of output you want to achieve could be JSON, HTML, text or code. Or perhaps you want it to be Spanish, English or German etc.
Data source
When we know the purpose and style of the data we need, we can look for a data source to collect our data from.
The Source of data can be a CSV file, PDF or even a website. You can also synthetically generate data but extra care is required to make sure each example is high quality and relevant.
Formatting Our Data
When we have identified the relevant criteria, and collected the necessary data, we can then format our data into a machine readable format that is ready for training.
For continued pretraining, we use raw text format without specific structure:
This format preserves natural language flow and allows the model to learn from continuous text.
If we are adapting a model to a new task, and intend for the model to output text in a single turn based on a specific set of instructions, we can use Instruction format in Alpaca style
When we want multiple turns of conversation we can use sharegpt conversational format
Each message alternates between human
and assistant
, allowing for natural dialogue flow.
Q: How can I format my raw data into Alpaca instruct?
A: There are many different methods for turning raw data into each respective format.
Multiple datasets
If you have multiple datasets for fine-tuning, you can either:
Standardize the format of all datasets, combine them into a single dataset, and fine-tune on this unified dataset.
Use the Multiple Datasets notebook to fine-tune on multiple datasets directly.
Last updated