Lemmit.Online bot@lemmit.online

Lemmit.Online bot@lemmit.online

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/mtasic85 on 2024-10-18 08:35:43+00:00.

Mainstream LLM tokenizers cann’t encode and decode to exact string. This means they aren’t lossless. Some Llama, Mistral, and Phi tokenizers cannot encode string ' Who let the dog out?! !' and then decode to the same string.

If you run code:

from transformers import AutoTokenizer

models = [
 'meta-llama/Llama-2-7b',
 'meta-llama/Meta-Llama-3-8B',
 'meta-llama/Llama-3.1-8B',
 'mistralai/Mistral-7B-v0.3',
 'mistralai/Mixtral-8x7B-v0.1',
 'mistralai/Mixtral-8x22B-v0.1',
 'mistralai/Mistral-Nemo-Instruct-2407',
 'mistralai/Mistral-Small-Instruct-2409',
 'mistralai/Mistral-Large-Instruct-2407',
 'microsoft/phi-1',
 'microsoft/phi-1\_5',
 'microsoft/phi-2',
 'microsoft/Phi-3-mini-4k-instruct',
 'microsoft/Phi-3.5-mini-instruct',
]

text = ' Who let the dog out?! !'

for n in models:
 tokenizer = AutoTokenizer.from\_pretrained(n)
 text2 = tokenizer.decode(tokenizer.encode(text, add\_special\_tokens=False))

if text2 == text: print('OK: ', n, repr(text2)) else: print(‘ERR:’, n, repr(text2))

You will get: OK: meta-llama/Llama-2-7b ' Who let the dog out?! !' ERR: meta-llama/Meta-Llama-3-8B ' Who let the dog out?!!' ERR: meta-llama/Llama-3.1-8B ' Who let the dog out?!!' ERR: mistralai/Mistral-7B-v0.3 'Who let the dog out?! !' OK: mistralai/Mixtral-8x7B-v0.1 ' Who let the dog out?! !' ERR: mistralai/Mixtral-8x22B-v0.1 'Who let the dog out?! !' OK: mistralai/Mistral-Nemo-Instruct-2407 ' Who let the dog out?! !' OK: mistralai/Mistral-Small-Instruct-2409 ' Who let the dog out?! !' OK: mistralai/Mistral-Large-Instruct-2407 ' Who let the dog out?! !' ERR: microsoft/phi-1 ' Who let the dog out?!!' ERR: microsoft/phi-1_5 ' Who let the dog out?!!' ERR: microsoft/phi-2 ' Who let the dog out?!!' OK: microsoft/Phi-3-mini-4k-instruct ' Who let the dog out?! !' OK: microsoft/Phi-3.5-mini-instruct ' Who let the dog out?! !'

All marked with ERR cannot encode and then decode to the same string.

[R] Limitations in Mainstream LLM Tokenizers

[R] Limitations in Mainstream LLM Tokenizers

This is an automated archive made by the Lemmit Bot.