Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing EOS? #12

Open
Zadagu opened this issue May 22, 2021 · 12 comments
Open

Missing EOS? #12

Zadagu opened this issue May 22, 2021 · 12 comments

Comments

@Zadagu
Copy link

Zadagu commented May 22, 2021

Hi,
I'm looking through your training data (the json representation).
I found a instance where tokens are not followed by a EOS node.

  "targets": [
    "Nm",
    "idle,sources",
    "Cal",
    "Nm",
    "get",
    "EOS",
    "Nm",
    "i",
    "EOS",
    "EOS"
  ],
  "target_seq": "idleSources.get(i)",

Could you please elaborate why there is no EOS after "idle,sources" in this case?

@urialon
Copy link
Contributor

urialon commented May 23, 2021

Hi @Zadagu ,
Thank you for your interest in SLM.

I understand the confusion - the order of targets is not their sequential order.
Specifically, it seems that idleSources can be copied as a full token from the context. Thus, the model will be trained to predict it by copying the entire token from another head_path.

Because it is a full token (marked with the comma in idle,sources), there is no need to predict an EOS after it.

Does that make sense?
I hope it helps,
Uri

@Zadagu
Copy link
Author

Zadagu commented May 23, 2021

Hi @urialon,
thank you for your explanation.
Might I ask one more question?
I read your paper and I'm still not sure how s("idle,sources") is composed.
Is it:

s("idle,sources") = \sum_{val(l)="idle,sources"} s_copy_token(l) + \sum_{val(l0)="idle"} s_copy_0(l) + \sum_{val(l1)="sources"} s_copy_1(l)

or is it just:

s("idle,sources") = \sum_{val(l)="idle,sources"} s_copy_token(l)

@urialon
Copy link
Contributor

urialon commented May 23, 2021

Hi @Zadagu ,
Sure, you can ask as many questions as you like :-)

Let me separate my answer into two parts:

Training

If idle,sources appears in the context, that is, it can be copied as a full token, it is just computed as your second option (s_copy_token(l)). That is, the model is trained to copy a full token. That is, to predict a single "target", without EOS.
(there is a rare case that "idle,sources" is so frequent that it appears in the vocabulary. In this case, it also gets is vocabulary score).

If idle,sources does not appear in the context as a full token, then the model is trained to predict it as a sequence of multiple subtokens (there are 2 "targets", and an additional EOS). Each subtoken gets its score from the copy-subtoken pointer and its (possible) vocabulary score.

Beam Search

For simplicity, let's assume that we're doing beam search with a beam size of 1. But the same idea holds from a beam of any size.

During beam search, the model computes the score for each of "idle", "sources", and "idle,sources".
Each of them gets a score from two sources: vocabulary score + copy score.
"idle" and "sources" get their copy score from the sub-token copy pointer, while "idle,sources" get its copy score from the full-token copy pointer.

Then, we just take the argmax (or top-k if the beam is greater than 1). If the argmax is "idle,sources", beam search can consider this token as "done", without needing to predict EOS.
If the argmax is "idle" or "sources", the model continues predicting subtokens until reaching an EOS.

Does that make sense?
Feel free to ask any question.
Uri

@Zadagu
Copy link
Author

Zadagu commented May 23, 2021

Thank you Uri. This makes sense.

@Zadagu
Copy link
Author

Zadagu commented May 24, 2021

Hi @urialon,
Thank you for giving me the opportunity to ask questions. :)

I wondering what are dimensions of \tilde{h}?
I tried to observe it from the expression E^{type} * \tilde{h} but thats conflicts s_{gen}(w) = E^{subtoken} * \tilde{h} because E^{subtoken} has got a different width than E^{type}.

@urialon
Copy link
Contributor

urialon commented May 24, 2021

Hi @Zadagu ,
I agree that this is not clear from the paper.
The dimensions of \tilde{h}, E^{type} and E^{subtoken} are all equal (they must be, as you noticed) and in our experiments, they all were equal to 512.

There is a minor inaccuracy in the paper, in the function e(n_i) (page 4, bottom left):

if n_i is a subtoken, its embedding is also concatenated with the subtoken's child id, where subtokens were numbered according to their order. For example, fileStatus is two subtokens: ['file', 'status'] having child ids [1,2].

I hope it helps :-)
Uri

@Zadagu
Copy link
Author

Zadagu commented Jun 10, 2021

Hi @urialon,

thank you for the clearification.
In the meantime I tried to fix my reimplementation according to your comments.
But it ended up with a lot less parameters (9.1M) than yours (15M).

No I wonder, what I might have forgotten.
My Model consits of:

  • an embedding matrix for subtokens 1000 by 512
  • an embedding matrix for node types 225 by 512
  • an embedding matrix for child ids 40 by 32
  • an 2 Layer LSTM, with an input size of 534, and a width of 256
  • a 4 Layer Transformer width 8 Heads and an input size of 256
  • Ci with a size of 9 x 256 x 256
  • Wr with a size of 256 x 256
  • Wg with a size of 512 x 512
  • for copy there are 5 matrices Wc each of them with a size of 256 x 512 (1 is for tokens, 4 are for subtokens)

For most of the weight matrices I'm already using a FC-Layer with bias.
Could you please tell me, where I'm wrong?

Are you using a Positional Embedding like in "Attention is all you need" for the child ids? Or is this just another randomly initialized matrix?

Kind regargds,
Zadagu

@urialon
Copy link
Contributor

urialon commented Jun 17, 2021

Hi @Zadagu ,
Sorry for the late reply.

I'm not sure why I did not mention in the paper, but I think that the source of discrepancy is as follows:

after the LSTM layers, I projected the states using an FC layer (+bias) to a dimension of 512.
The projection matrix is the size of 256x512, and there is a different projection matrix for the "standard" paths and a different projection matrix for the "root path".

I think that I did that to help the model distinguishing between the "standard paths" (blue paths in Figure 2) and the "root path" (the orange path in Figure 2).
I probably just meant to distinguish between them, and did not notice that I am increasing their dimensions.

So this adds some weights and increases the size of other tensors. For example, Ci must be 9x512x512, which is an increase of 1.5M parameters.

Sorry for this inconsistency, let me know how many parameters you are getting now.

@Zadagu
Copy link
Author

Zadagu commented Jun 17, 2021

Hi @urialon,

I'm very grateful for your reply. Thank you very much for looking into this.

After installing the projection layer, i got a total of 21.5M parameters.
Are you using the projected paths embeddings everywhere (e.g. in the copy mechanism) or just in the transformer?
That could be an explanation for the difference, because I use the bigger size everywhere.

Kind regargds,
Zadagu

@urialon
Copy link
Contributor

urialon commented Jun 18, 2021 via email

@Zadagu
Copy link
Author

Zadagu commented Jun 18, 2021

Hi @urialon,

We are getting closer. Now I got 17.2M Parameters. 🤓
But I'm running out of ideas where the difference comes from.

Kind regards,
Zadagu

@urialon
Copy link
Contributor

urialon commented Jun 23, 2021

Hi Zadagu,

Ci with a size of 9 x 256 x 256

In my implementation, the first dimension is 8 instead of 9 (that is: 8 x 512 x 512).
How close are we now?
Uri

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants