Assuming your dataset is a list of list of dictionaries like the below:
[ [{'from':'human','value':'Hi there!'},{'from':'gpt','value':'Hi how can I help?'},{'from':'human','value':'What is 2+2?'}], [{'from':'human','value':'What's your name?'},{'from':'gpt','value':'I'm Daniel!'},{'from':'human','value':'Ok! Nice!'},{'from':'gpt','value':'What can I do for you?'},{'from':'human','value':'Oh nothing :)'},],]
You can use our get_chat_template to format it. Select chat_template to be any of zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth, and use mapping to map the dictionary values from, value etc. map_eos_token allows you to map <|im_end|> to EOS without any training.
You can also make your own custom chat templates! For example our internal chat template we use is below. You must pass in a tuple of (custom_template, eos_token) where the eos_token must be used inside the template.
unsloth_template =\"{{ bos_token }}"\"{{ 'You are a helpful assistant to the user\n' }}"\"</div>"\"<div data-gb-custom-block data-tag="for">"\"<div data-gb-custom-block data-tag="if" data-0='role' data-1='role' data-2='] == ' data-3='user'>"\"{{ '>>> User: ' + message['content'] + '\n' }}"\ "<div data-gb-custom-block data-tag="elif" data-0='role' data-1='role' data-2='] == ' data-3='assistant'></div>"\
"{{ '>>> Assistant: ' + message['content'] + eos_token + '\n' }}"\"</div>"\"</div>"\"<div data-gb-custom-block data-tag="if">"\"{{ '>>> Assistant: ' }}"\"</div>"unsloth_eos_token = "eos_token"tokenizer = get_chat_template( tokenizer, chat_template = (unsloth_template, unsloth_eos_token,), # You must provide a template and EOS token mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style map_eos_token =True, # Maps <|im_end|> to </s> instead)