Replies: 2 comments 1 reply
-
大小关系弄反了。原版llama是32k,我们的是接近50k。编码问题已经在技术报告里说了,感兴趣可以自己看一下。
当然可以,技术选型问题,没有说必须扩展词表
尚未有明确研究结论说明不扩展词表会有什么影响。 |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
没有弄反,原始的llama token是多个组合的token id来表示一个中文的token,跟tiktoken一样,我测试了llama1和llama2的token,涵盖了所有的中文token。 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
原始的llama1和llama2的tokenizer包含了chinese-llama中所有token,为什么还要扩充中文词表呢?直接在原始的llama tokenizer基础上进行lora语言模型训练不可以么?如果不扩充词表,对结果会有什么影响么?
Beta Was this translation helpful? Give feedback.
All reactions