Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

包含 阿拉伯数字+字母大写 的文本内容 某些情况 摘录为空 #215

Open
bennyji opened this issue Jun 15, 2024 · 2 comments
Labels

Comments

@bennyji
Copy link

bennyji commented Jun 15, 2024

中文摘录的时候,混杂 阿拉伯数字+字母大写 的一些情况,会导致摘录返回为空。
详见 https://zhuanlan.zhihu.com/p/703598986


Some cases of mixing Arabic numerals + capitalized letters in Chinese excerpts can cause the excerpt to return empty. See https://zhuanlan.zhihu.com/p/703598986 for details.

@miso-belica
Copy link
Owner

Hello, can you please provide your settings of the summarizer you use? Language, stemmer, tokenizer, ... any settings to reproduce this. Thank you

@bennyji
Copy link
Author

bennyji commented Jun 19, 2024

以hugging face上的服务为例,你可以在该界面(https://huggingface.co/spaces/issam9/sumy_space)复现这个bug。
配置为:
method:LSA,
language:chinese
sentence count:2
input type:text
以下两个文本就无法进行摘要,没有输出:

数字经济加快发展,5G用户普及率超过50%。强化农业发展支持政策,有力开展抗灾夺丰收,乡村振兴扎实推进。--强化生态环境保护治理,加快发展方式绿色转型。深入推进美丽中国建设。持续打好蓝天、碧水、净土保卫战。制定支持绿色低碳产业发展政策。

3D技术是一种以这种三维空间为基础的数字技术。广泛应用于多个领域。包括但不限于娱乐、设计、制造、教育和医疗等。使用场景例如:3D成像与显示、3D打印、3D建模与渲染、3D扫描、3D交互技术等。

但是,如果将第一句中的5G改为5g,他又可以成功。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants