clean_up_tokenization_spaces=True causes formatting issues, why is it set?

#44
by dzhulgakov - opened

Setting clean_up_tokenization_spaces=True in tokenizer_config.json causes weird output space formatting issues and makes tokenizer encode+decode lossy. This is especially pronounced for code. Minimal repro:

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct")
s = "foo ?? bar"
ids = t.encode(s)
s_cleanup = t.decode(ids)
s_no_cleanup = t.decode(ids, clean_up_tokenization_spaces=False)
print(ids)
print(s)
print(s_cleanup)
print(s_no_cleanup)

outputs

[128000, 8134, 9602, 3703]
foo ?? bar
<|begin_of_text|>foo?? bar
<|begin_of_text|>foo ?? bar

Notice the missing space in the first output.

FWIW, Llama2 had it as False.

Official Meta's repo (https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py) doesn't have any special sauce around TikToken either, and for the text above it preserves the space. Why was this setting turned on for Llama3?

Sign up or log in to comment