📈Datasets 101

Learn all the essentials of creating a dataset for fine-tuning!

What is a Dataset?

For large language models, datasets are collections of data that can be used to train our models. In order to be useful for training, text data needs to be in a format that can be tokenized.

Tokenization

Tokenization is the process of breaking text into units called tokens. These units can be representative of words, sub-words or even characters.

The typical approach to tokenization with large language models is to tokenize text into sub-word chunks, this allows models to handle inputs that are out of vocabulary.

During training, tokens are embedded in a high-dimensional latent space. By utilizing attention mechanisms, the model fine-tunes these embeddings to produce contextually relevant outputs.

In Summary: Tokenization turns raw text into a format that is both machine-readable and retains meaningful information.

Data Format

To enable the process of tokenization, datasets need to be in a format that can be read by a tokenizer.

Format
Description
Training Type

Raw Corpus

Raw text from a source such as a website, book, or article.

Continued Pretraining

Instruct

Instructions for the model to follow and an example of the output to aim for.

Supervised fine-tuning (SFT)

Conversation

Multiple-turn conversation between a user and an AI assistant.

Supervised fine-tuning (SFT)

RLHF

Conversation between a user and an AI assistant, with the assistant's responses being ranked by a human evaluator.

Reinforcement Learning

It's worth noting that different styles of format exist for each of these types.

Getting Started

Before we format our data, we want to identify the following:

1

Purpose of dataset

Knowing the purpose of the dataset will help us determine what data we need and format to use.

The purpose could be, adapting a model to a new task such as summarization or improving a model's ability to role-play a specific character.

2

Style of output

The style of output will let us know what sources of data we will use to reach our desired output.

For example, the type of output you want to achieve could be JSON, HTML, text or code. Or perhaps you want it to be Spanish, English or German etc.

3

Data source

When we know the purpose and style of the data we need, we can look for a data source to collect our data from.

The Source of data can be a CSV file, PDF or even a website. You can also synthetically generate data but extra care is required to make sure each example is high quality and relevant.

Formatting Our Data

When we have identified the relevant criteria, and collected the necessary data, we can then format our data into a machine readable format that is ready for training.

For continued pretraining, we use raw text format without specific structure:

  "text": "Pasta carbonara is a traditional Roman pasta dish. The sauce is made by mixing raw eggs with grated Pecorino Romano cheese and black pepper. The hot pasta is then tossed with crispy guanciale (cured pork cheek) and the egg mixture, creating a creamy sauce from the residual heat. Despite popular belief, authentic carbonara never contains cream or garlic. The dish likely originated in Rome in the mid-20th century, though its exact origins are debated..."

This format preserves natural language flow and allows the model to learn from continuous text.

If we are adapting a model to a new task, and intend for the model to output text in a single turn based on a specific set of instructions, we can use Instruction format in Alpaca style

"Instruction": "Task we want the model to perform."

"Input": "Optional, but useful, it will essentially be the user's query."

"Output": "The expected result of the task and the output of the model."

When we want multiple turns of conversation we can use sharegpt conversational format

{
  "conversations": [
    {
      "from": "human",
      "value": "Can you help me make pasta carbonara?"
    },
    {
      "from": "assistant",
      "value": "Would you like the traditional Roman recipe, or a simpler version?"
    },
    {
      "from": "human",
      "value": "The traditional version please"
    },
    {
      "from": "assistant",
      "value": "The authentic Roman carbonara uses just a few ingredients: pasta, guanciale, eggs, Pecorino Romano, and black pepper. Would you like the detailed recipe?"
    }
  ]
}

Each message alternates between human and assistant, allowing for natural dialogue flow.

Q: How can I format my raw data into Alpaca instruct?

A: There are many different methods for turning raw data into each respective format.

Multiple datasets

If you have multiple datasets for fine-tuning, you can either:

  • Standardize the format of all datasets, combine them into a single dataset, and fine-tune on this unified dataset.

  • Use the Multiple Datasets notebook to fine-tune on multiple datasets directly.

Last updated