Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in Segment ID list for BERT example in documentation #258

Open
fz-29 opened this issue Dec 1, 2021 · 1 comment
Open

Issue in Segment ID list for BERT example in documentation #258

fz-29 opened this issue Dec 1, 2021 · 1 comment

Comments

@fz-29
Copy link

fz-29 commented Dec 1, 2021

The documentation section at the blog has wrong length of segment_ids

Issue replication:

For the given code, if one prints out the shape of the inputs to the BERT model, they can find the issue.

text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"

# Tokenized input with special tokens around it (for BERT: [CLS] at the beginning and [SEP] at the end)
indexed_tokens = tokenizer.encode(text_1, text_2, add_special_tokens=True)

# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])
print(segments_tensors.shape)
print(tokens_tensor.shape)

The output is:

torch.Size([1, 16])
torch.Size([1, 14])

The shape mismatch is very clear.

Proposed Fix:

Since the segment_ids are hardcoded, we can probably make them programmatic.

(I can raise the pull request but I was not able to figure the source code for the documentation of this specific page.)

@jdsgomes
Copy link

Hi @fz-29 , thank you for raising this issue.
The page you are looking for is this one https://github.com/pytorch/hub/blob/8f8788108bd95b39f9c8729aa1904161e476401e/huggingface_pytorch-transformers.md.
Please add the original authors to the PR review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants