预训练数据先拼接在切分成block_size,容易导致一条样本的上下文不相关 #724
Unanswered
sameul-yuan
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
预训练数据处理,首先将整个pt_sample_data.txt 拼接在一起,再按block_size进行切分, 这可能会导致完全不相关的内容进行自回归,容易导致模型胡说八道,想问一下一般预训练数据是这么处理的吗
Beta Was this translation helpful? Give feedback.
All reactions