Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pretrain 数据问题 #73

Open
ruleGreen opened this issue Jul 28, 2020 · 3 comments
Open

pretrain 数据问题 #73

ruleGreen opened this issue Jul 28, 2020 · 3 comments

Comments

@ruleGreen
Copy link

您好,想问一下为什么我这边用我自己的数据pretrain老是获取不到tfrecords

image
image
image

@zhang-yunke
Copy link

你好,请问你解决这个问题了吗?我在pretrain的时候得到的instance数量和文章数量相比少很多

@zhang-yunke
Copy link

@ruleGreen 你好,我这边发现了一个问题,在create_pretrianing_data.py文件下330行左右,有时传入的列表会变成二维的导致后续读取不出来,我做了如下修改:
document = all_documents[document_index]
改为
document = all_documents[document_index]
document = np.squeeze(document).tolist()
从数据量上和数据样例上看起来正常了许多

@Rxma1805
Copy link

@ruleGreen 你好,我这边发现了一个问题,在create_pretrianing_data.py文件下330行左右,有时传入的列表会变成二维的导致后续读取不出来,我做了如下修改:
document = all_documents[document_index]
改为
document = all_documents[document_index]
document = np.squeeze(document).tolist()
从数据量上和数据样例上看起来正常了许多

hello 我想问一下,数据集的准备是否需要以每一篇文章为一个txt,每个txt里面的每个句子一行这样的形式?我不确定预测句子是都是下一句这个训练任务是否需要这样构建数据任务。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants