Unsloth now supports full fine-tuning, 8-bit and all models! ๐Ÿฆฅ

๐Ÿ—ฃ๏ธText-to-Speech (TTS) Fine-tuning

Learn how to to fine-tune TTS voice models with Unsloth.

Fine-tuning TTS models enable it to adapt it on your own dataset, specific use case or style/tone. This process helps customize the model for unique voices, speaking styles, new languages or specific types of content.

With Unsloth, we allow you to fine-tune TTS models 1.2x faster with 50% less memory than other implementations with Flash Attention 2. This support includes OpenAI's Whisper, Orpheus, and most of the current popular TTS models.

Because voice models are usually small in size, you can train the models using LoRA 16-bit or full fine-tuning FFT which may provider higher quality results.

Fine-tuning Notebooks:

Choosing and Loading a TTS Model

For TTS, the primary model used in our examples is Orpheus-TTS (3B) โ€“ a Llama-based speech model. Orpheus was pre-trained on a large speech corpus and can generate highly realistic speech, with support for emotional cues (laughs, sighs, etc.) out-of-the-box. Weโ€™ll use Orpheus as our example for TTS fine-tuning. To load it in LoRA 16-bit:

from unsloth import FastModel

model_name = "unsloth/orpheus-3b-0.1-pretrained"
model, tokenizer = FastModel.from_pretrained(
    model_name,
    load_in_4bit=False  # use 4-bit precision (QLoRA)
)

When this runs, Unsloth will download the model weights if you prefer 8-bit, you could use load_in_8bit=True, or for full 16-bit fine-tuning set full_finetuning=True (ensure you have enough VRAM). You can also replace the model name with other TTS models.

Note: Orpheusโ€™s tokenizer already includes special tokens for audio output (more on this later). You do not need a separate vocoder โ€“ Orpheus will output audio tokens directly, which can be decoded to a waveform.

Preparing Your Dataset

At minimum, a TTS fine-tuning dataset consists of audio clips and their corresponding transcripts (text). Letโ€™s use the Elise dataset which is dataset composing of a female character with a pre-built script. Elise will be used as an example of how to prepare data:

Elise dataset: A small (~3 hours) single-speaker speech corpus from Hugging Face. There are two variants:

  • MrDragonFox/Elise โ€“ an augmented version with emotion tags embedded in the transcripts. (This clone adds labels like โ€œโ€, โ€œโ€, etc., to the text.)

  • Jinsaryko/Elise โ€“ base version with transcripts.

The dataset is organized with one audio and transcript per entry. On Hugging Face, these datasets have fields such as audio (the waveform), text (the transcription), and some metadata (speaker name, pitch stats, etc.). We need to feed Unsloth a dataset of audio-text pairs.

Option 1: Using Hugging Face Datasets library โ€“ This is the easiest route if your data is in HF format or a CSV.

from datasets import load_dataset

# Load the Elise dataset from HF (without emotion tags)
dataset = load_dataset("MrDragonFox/Elise", split="train")
# Alternatively, you can use a standard dataset without emotion tags

This will download the data (approx 328 MB for ~1.2k samples). Each item in dataset has dataset[i]["audio"] (an Audio object with array data and sampling rate) and dataset[i]["text"] (the transcript string). You can inspect a sample:

sample = dataset[0]
print(sample["text"])
# e.g., "Oh, honestly, probably still your house <laughs>. But still, I mean, running the dishes through the dishwasher..."

In the MrDragonFox/Elise version, youโ€™ll notice tags like <laughs> or <chuckles> in the text โ€“ these indicate expressive cues. These tags are enclosed in angle brackets and will be treated as special tokens by the model (they match Orpheusโ€™s expected tags like <laugh> and <sigh>.

Option 2: Preparing a custom dataset โ€“ If you have your own audio files and transcripts:

  • Organize audio clips (WAV/FLAC files) in a folder.

  • Create a CSV or TSV file with columns for file path and transcript. For example:

    filename,text
    0001.wav,Hello there!
    0002.wav,<sigh> I am very tired.
  • Use load_dataset("csv", data_files="mydata.csv", split="train") to load it. You might need to tell the dataset loader how to handle audio paths. An alternative is using the datasets.Audio feature to load audio data on the fly:

    from datasets import Audio
    dataset = load_dataset("csv", data_files="mydata.csv", split="train")
    dataset = dataset.cast_column("filename", Audio(sampling_rate=24000))

    Then dataset[i]["audio"] will contain the audio array.

  • Ensure transcripts are normalized (no unusual characters that the tokenizer might not know, except the emotion tags if used). Also ensure all audio have a consistent sampling rate (resample them if necessary to the target rate the model expects, e.g. 24kHz for Orpheus).

Emotion tags: If your dataset includes expressive sounds (laughter, sighs, etc.), mark them in the transcript with a tag. Orpheus supports tags like <laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>, etc. For example: "I missed you <laugh> so much!". During training, the model will learn to associate these tags with the corresponding audio patterns. The Elise dataset with tags already has many of these (e.g., 336 occurrences of โ€œlaughsโ€, 156 of โ€œsighsโ€, etc. as listed in its card). If your dataset lacks such tags but you want to incorporate them, you can manually annotate the transcripts where the audio contains those expressions.

In summary, for dataset preparation:

  • You need a list of (audio, text) pairs.

  • Use the HF datasets library to handle loading and optional preprocessing (like resampling).

  • Include any special tags in the text that you want the model to learn (ensure they are in <angle_brackets> format so the model treats them as distinct tokens).

  • (Optional) If multi-speaker, you could include a speaker ID token in the text or use a separate speaker embedding approach, but thatโ€™s beyond this basic guide (Elise is single-speaker).

Fine-Tuning TTS with Unsloth

Now, letโ€™s bring it all together and run the fine-tuning. Weโ€™ll illustrate using Python code (which you can run in a Jupyter notebook, Colab, etc.). This is analogous to running the Unsloth CLI with corresponding arguments.

Step 1: Initialize Model and Dataset

from unsloth import FastModel
from transformers import Trainer, TrainingArguments

# Load the pre-trained Orpheus model (in 4-bit mode) and tokenizer
model_name = "unsloth/orpheus-3b-0.1-pretrained-unsloth-bnb-4bit"
model, tokenizer = FastModel.from_pretrained(model_name, load_in_4bit=True)

# Load the dataset (Elise) and ensure audio is 24kHz
dataset = load_dataset("Jinsaryko/Elise", split="train")
# Cast the audio to 24kHz if not already
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))

Note: If memory is very limited or if dataset is large, you can stream or load in chunks. Here, 3h of audio easily fits in RAM. If using your own dataset CSV, load it similarly.

Step 2: Preprocess the data for training

We need to prepare inputs for the Trainer. For text-to-speech, one approach is to train the model in a causal manner: concatenate text and audio token IDs as the target sequence. However, since Orpheus is a decoder-only LLM that outputs audio, we can feed the text as input (context) and have the audio token ids as labels. In practice, Unslothโ€™s integration might do this automatically if the modelโ€™s config identifies it as text-to-speech. If not, we can do something like:

# Tokenize the text transcripts
def preprocess_function(example):
    # Tokenize the text (keep the special tokens like <laugh> intact)
    tokens = tokenizer(example["text"], return_tensors="pt")
    # Flatten to list of token IDs
    input_ids = tokens["input_ids"].squeeze(0)
    # The model will generate audio tokens after these text tokens.
    # For training, we can set labels equal to input_ids (so it learns to predict next token).
    # But that only covers text tokens predicting the next text token (which might be an audio token or end).
    # A more sophisticated approach: append a special token indicating start of audio, and let the model generate the rest.
    # For simplicity, use the same input as labels (the model will learn to output the sequence given itself).
    return {"input_ids": input_ids, "labels": input_ids}

train_data = dataset.map(preprocess_function, remove_columns=dataset.column_names)

Important: The above is a simplification. In reality, to fine-tune Orpheus properly, you would need the audio tokens as part of the training labels. Orpheusโ€™s pre-training likely involved converting audio to discrete tokens (via an audio codec) and training the model to predict those given the preceding text. For fine-tuning on new voice data, you would similarly need to obtain the audio tokens for each clip (using Orpheusโ€™s audio codec). The Orpheus GitHub provides a script for data processing โ€“ it encodes audio into sequences of <custom_token_x> tokens.

However, Unsloth may abstract this away: if the model is a FastModel with an associated processor that knows how to handle audio, it might automatically encode the audio in the dataset to tokens. If not, youโ€™d have to manually encode each audio clip to token IDs (using Orpheusโ€™s codebook). This is an advanced step beyond this guide, but keep in mind that simply using text tokens wonโ€™t teach the model the actual audio โ€“ it needs to match the audio patterns.

For brevity, let's assume Unsloth provides a way to feed audio directly (for example, by setting processor and passing the audio array). If Unsloth does not yet support automatic audio tokenization, you might need to use the Orpheus repositoryโ€™s encode_audio function to get token sequences for the audio, then use those as labels. (The dataset entries do have phonemes and some acoustic features which suggests a pipeline.)

Step 3: Set up training arguments and Trainer

training_args = TrainingArguments(
    output_dir="orpheus_finetune_elise",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=5,
    learning_rate=1e-5,
    fp16=True,  # use mixed precision if available
    logging_steps=50,
    save_strategy="epoch",
    report_to="none"  # or "tensorboard" if you want to use TB
)
# Instantiate Trainer
trainer = Trainer(
    model=model,
    train_dataset=train_data,
    args=training_args
)

Here we set a small batch and accumulations to simulate batch size 8, 5 epochs over ~1200 samples (~6000 steps), LR=1e-5, and FP16 training (which helps even with 4-bit base). Adjust as needed.

Step 4: Begin fine-tuning

trainer.train()

This will start the training loop. You should see logs of loss every 50 steps (as set by logging_steps). The training might take some time depending on GPU โ€“ for example, on a Colab T4 GPU, a few epochs on 3h of data may take 1-2 hours. Unslothโ€™s optimizations will make it faster than standard HF training.

During training, Unsloth applies its magic (patches, fused ops, etc.) behind the scenes to speed up computation.

Step 5: Save the fine-tuned model

After training completes (or if you stop it mid-way when you feel itโ€™s sufficient), save the model:

trainer.save_model("orpheus_finetune_elise/final")

This saves the model weights (for LoRA, it might save only adapter weights if the base is not fully fine-tuned). If you used --push_model in CLI or trainer.push_to_hub(), you could upload it to Hugging Face Hub directly.

Now you should have a fine-tuned TTS model in the directory. The next step is to test it out!

Last updated

Was this helpful?