This is an automated archive made by the Lemmit Bot.
The original was posted on /r/machinelearning by /u/mtasic85 on 2024-10-18 08:35:43+00:00.
Mainstream LLM tokenizers cann’t encode and decode to exact string. This means they aren’t lossless. Some Llama, Mistral, and Phi tokenizers cannot encode string ' Who let the dog out?! !'
and then decode to the same string.
If you run code:
from transformers import AutoTokenizer
models = [
'meta-llama/Llama-2-7b',
'meta-llama/Meta-Llama-3-8B',
'meta-llama/Llama-3.1-8B',
'mistralai/Mistral-7B-v0.3',
'mistralai/Mixtral-8x7B-v0.1',
'mistralai/Mixtral-8x22B-v0.1',
'mistralai/Mistral-Nemo-Instruct-2407',
'mistralai/Mistral-Small-Instruct-2409',
'mistralai/Mistral-Large-Instruct-2407',
'microsoft/phi-1',
'microsoft/phi-1\_5',
'microsoft/phi-2',
'microsoft/Phi-3-mini-4k-instruct',
'microsoft/Phi-3.5-mini-instruct',
]
text = ' Who let the dog out?! !'
for n in models:
tokenizer = AutoTokenizer.from\_pretrained(n)
text2 = tokenizer.decode(tokenizer.encode(text, add\_special\_tokens=False))
if text2 == text: print('OK: ', n, repr(text2)) else: print(‘ERR:’, n, repr(text2))
You will get:
OK: meta-llama/Llama-2-7b ' Who let the dog out?! !' ERR: meta-llama/Meta-Llama-3-8B ' Who let the dog out?!!' ERR: meta-llama/Llama-3.1-8B ' Who let the dog out?!!' ERR: mistralai/Mistral-7B-v0.3 'Who let the dog out?! !' OK: mistralai/Mixtral-8x7B-v0.1 ' Who let the dog out?! !' ERR: mistralai/Mixtral-8x22B-v0.1 'Who let the dog out?! !' OK: mistralai/Mistral-Nemo-Instruct-2407 ' Who let the dog out?! !' OK: mistralai/Mistral-Small-Instruct-2409 ' Who let the dog out?! !' OK: mistralai/Mistral-Large-Instruct-2407 ' Who let the dog out?! !' ERR: microsoft/phi-1 ' Who let the dog out?!!' ERR: microsoft/phi-1_5 ' Who let the dog out?!!' ERR: microsoft/phi-2 ' Who let the dog out?!!' OK: microsoft/Phi-3-mini-4k-instruct ' Who let the dog out?! !' OK: microsoft/Phi-3.5-mini-instruct ' Who let the dog out?! !'
All marked with ERR cannot encode and then decode to the same string.