-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with recognizing unseen NEs #5
Comments
@fliegenpilz357 Yes. I think you are right. The problem may be with the empty abstract_map(s). FYI, the preprocessing scripts for named entities are borrowed from https://github.com/sheng-z/stog. I will try to find some time investigating this issue. Let me know if you have any new findings. Thank you. UPDATE: This should be related to pre-processing and post-processing. I mean that, if the abstract_map is empty, the entities are lost in the pre-processing stage, and both the parser and the post-processing can do nothing about it. |
I also think that it has something to do with the pre-processing from https://github.com/sheng-z/stog. When predicting the test sentences from LDC everything seems to work fine, it's just for new arbitrary sentences that it doesn't work well, for some reason. Again, thanks for the quick answer. I will let you know if I find something out about this. |
Hi Deng, I also came across this issue of the abstract_map not getting populated for unseen named entities, and working perfectly fine for those in the LDC dataset. While analyzing the pre-processing code, more specifically, the text_anonymizor.py file, I observed that the named entities are compared with a pre-built dictionary present in 'text_anonymization_rules.json' file [part of the amr_2.0_utils folder] and only if a match is found, the abstract_map gets populated. A work-around would be to regenerate the amr_2.0_utils, using the script provided here: sheng-z/stog#3 (comment) |
Thanks @mkartik But that doesn't explain why it would work well for the test partition of LDC, does it? Anyways, may I ask you if you have tried out this workaround and were able to parse unseen sentences with better quality? |
@mkartik Thx for pointing out it! |
@fliegenpilz357 , the entries(namely named-entities) in the test partition of LDC are similar to those of the train and dev partition of the LDC dataset, and hence, that is why I suppose it works for the test partition of LDC. I haven't yet tried the workaround for regenerating the amr_utils, but would be doing so in the coming days. @jcyk I guess regeneration would not require gold AMRs and we would be able to regenerate using our own annotated training data. It could be possible to update the '_replace_span' function in the text_anonymizer.py file to relax the match constraint, but I haven't yet explored that option. |
Thanks a lot @mkartik for this helpful investigation! Yes, there is a significant overlap (unfortunately) between train NEs and test NEs. But I think it still does not explain everything, the overlap is not 100%. For example, I have just performed the following little experiment.
Guofang Shen , The foreign ministry spokesperson , announced at a news conference held this afternoon that President Gentzs Aerpade of the Hungary Republic , would pay a State visit to China from September 14th to the 17th at the invitation of president Zemin Jiang .
= 0 (likewise on dev) And likewise with "Gentzs Aerpade"
Expectation: It should not matter for the parser/preprocessing (much?). Result: It matters a lot. The preprocessing/parser performs perfectly on the original sentences even when they have entities that are not seen in train or dev. But, the preprocessing/parser performs worse when we change the unseen named entities to other unseen named entities. Example: [EDIT: removed long complicated example, see simplified example in my next post below] |
@fliegenpilz357 hi, I think this example is too complicated. |
Yes, that's what I mean. I am very sorry for the complicated example. I simplified the example a lot now, I hope it is now more clear. Again, I'm sorry for the long example above. If I am not mistaken "Guofang Shen" and "Gentzs Aerpade" are both only in LDC test but not in train/dev, so replacing them should not have any (major) effect.
As can be seen, there is already some manipulations at the tokenizing (# ::tokens) going on. In the end, this may be related to the empty abstract map which again propagates the error into the parser. |
@fliegenpilz357 As pointed out by @mkartik, an entity can only be recognized only if a match is found. then the "Guofang Shen" must be included in the rules. I would suspect that the original author used the test set for building their utils (text_anonymization_rules.json). However, as long as that the By the way, since the problem is with the pre-processing and post-processing, which are simply borrowed from the stog repo. |
Yes it is very interesting, and I agree that the bug is not exactly in your parser but in the pre-processing in stog. There are also some similar issues reported already over there, as I see now. Unfortunately Sheng seems very busy and does not respond (much). So, @jcyk I appreciate your help and quick answers a lot! I don't expect you to fix this, but if, by any chance, you find out what's causing the pre-processing trouble, it would be awesome to just let me know or update this issue. |
@fliegenpilz357
As suggested by @mkartik, one possible solution is to regenerate the 'text_anonymization_rules.json'. Another possible solution is to modify the preprocessing rules ( I guess we better remove this inconvenient dependency entirely). I do plan to investigate both options, but maybe not in a short time. So, please do let me know if you find a good solution. Thx!
Note that the parser is supposed to see anonymized entities only, it cannot handle any specific names. Therefore, if the preprocessing does not catch the entity, the parser can do nothing about it.
Also please note that there is a no-graph-recategorization version parser in this repo, though, with slightly worse accuracy in the official evaluation. Since this doesn’t rely on such hard-coded rules, it will not have this problem (or even generalize better?)
Best,
Deng
…On Wed, 24 Jun 2020 at 6:02 PM, fliegenpilz357 ***@***.***> wrote:
Yes it is very interesting, and I fully agree that the bug is not in your
parser but in the pre-processing in stog <https://github.com/sheng-z/stog>.
There are also some similar issues reported already in there, as I see now.
Unfortunately Sheng seems very busy and does not respond (much). @jcyk
<https://github.com/jcyk> I appreciate your help and quick answers a lot!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACYLVME6BOOGJ2BWBI4RKZ3RYHFLBANCNFSM4N5JTOUQ>
.
|
@jcyk I can confirm that it works so much better with the non GR model. Thanks for your quick advice and again, I congratulate you for your very awesome work! I even like the non-GR approach a lot more than the one which depends on stogs super-complicated anonymization, even though it's slightly worse in Smatch. I even suspect that it is better in generalizing than the 80 Smatch model, but it's hard to tell without knowing the exact bug in stog pre-processing or fully understanding the complicated pre-processing pipeline of stog. Hence, I'd still leave this issue open. It might be interesting for other people too and perhaps someone might find out what's the bug in the pre-processing, then the 80 Smatch model can also be used for unseen sentences/entities. |
Hi Deng,
I am now able to parse arbitrary sentences with your pre-trained model, thanks to your valuable tips!
My pipeline looks as follows:
That writes a file ckpt.pt_test_out.pred.post, of which I assume that it is the final result.
However, the quality of arbitrary sentence parses is not so good. It does seem to struggle a bit with unseen named entities and this leads to errors. Named entities that are in the training data are perfectly recognized, but not so much new ones. Here is an example of a short sentence:
The parser has struggled with new named entities, apparently. It also contains, due to this, many other errors (e.g. two ARG2 of c6). It has also not even detected tsar "Feodor".
Here is an example of a longer sentence, this time from US press.
Again, it has not properly recognized any named entity and because of this (?) made also many other error (like Grand Slam --> "grand manufacture")
A last example, from sports.
Again, all named entities were not recognized and it has hallucinated new concepts (e.g., "(c12 / company)"). The famous tennis player Djiokovic does not even occur in the parse.
These sentences are just randomly sampled, all my outputs look more or less like this. Do you have any idea where the problem could be? The problem doesn't seem to be in the post-processing, the NE errors are already contained (mostly, as far as I can assess this) in the parser output file ckpt.pt_test_out/ckpt.pt_test_out.pred.
Could it be because the # ::abstract_map {} is always empty?
The text was updated successfully, but these errors were encountered: