Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

变形词exchange有很多致命问题 #107

Open
a7785292 opened this issue Jul 4, 2023 · 2 comments
Open

变形词exchange有很多致命问题 #107

a7785292 opened this issue Jul 4, 2023 · 2 comments

Comments

@a7785292
Copy link

a7785292 commented Jul 4, 2023

尤其是名词,基本不能用,比如:
women - 0:woman/1:s/s:womens/3:womens (womens是什么鬼????女人们们???)
children - 0:child/1:s/s:childrens (本来就是复数,基本不会在children后再加s)
sheep - s:sheeps (这是单复数同形的)
lives - 0:life/1:s (这个明显有两个lemma,分别是life和live,应该是 1:3s 才对)
.......................

没有经过严格测试,只是随随便便搜索一下就发现这么多问题,注意,我只是随随便便搜索一下哦,还有音标、释义也有很多问题,惨不忍睹

合理推测,如果认真测试,可能有几千个错误

@simsilver
Copy link

看了下lemma.en.txt里确实有womens,然后看这个是来自于bnc,搜了下确实有
image

然后从这个文章里的The History of the Possessive Apostrophe这一段看,应该是women's的误用

@skywind3000
Copy link
Owner

skywind3000 commented Nov 24, 2023

lemma.en.txt 包含了从 bnc 统计得到的频率信息:

woman/60142 -> women
women/43 -> womens

在众多语料中,确实出现了 43 个 women 到 womens 的对应关系,我之前思考过要不要保留 womens 的,最后我选择忠实的保留 BNC 的变换关系,因为:

1)确实有语料用了 womens,保留这个信息,至少允许你用 womens 反查到 women,对吧?
2)我同时给出了出现频率:"women/43" 后面那个 43 就是出现的次数,有了这个次数,其实允许你方便的过滤掉低于一定阈值的干扰信息,比如你可以选择忽略 500 以内的变换,认为是错误的用法。

我给你们保留了最完整的语料信息,给你们选择的自由。

另外一个获取词形变化的是用:
https://www.nodebox.net/code/index.php/Linguistics

这个工具库来做,不过它的数据量很有限。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants