-
-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heading in Chinese #110
Comments
This is indeed problematic when you have Chinese paragraphs with capital English letters in it. Good catch! Using |
I encounter the same problem. And i rewrite the document() function in plaintext.py to deal with it. |
@seven-linglx Can you share your solution with us? Can you add here the code snippet? |
Of course, it's my honor to share with everyone, this is my rewrite function: def document(self):
current_paragraph = []
paragraphs = []
for line in self._text.splitlines():
line = line.strip()
if line:
current_paragraph.append(line)
else:
sentences = self._to_sentences(current_paragraph)
paragraphs.append(Paragraph(sentences))
current_paragraph = []
sentences = self._to_sentences(current_paragraph)
paragraphs.append(Paragraph(sentences))
print(paragraphs) # preview
return ObjectDocumentModel(paragraphs) In fact, maybe i don't solve this problem because i ignore the HEADING of document directly. In other words, it is suit the scene that there isn't HEADING in document, or you don't mind the program judge the HEADING as text. This is the document for test:" 技术有限公司在IT泡沫之前是一间籍籍无名的公司。但从IT泡沫之后该公司以中国为据点急速成长,快速吸引各界注目,市场不局限于发展中国家。 This is result:1987年9月,以“民间科技企业”身份获深圳市工商局批准获得注册,注册资本2.1万元,员工14人,主要业务为代理中资控股的香港康力投资有限公司的HAX小型模拟交换机。 |
Thank you all. I think this is more tricky. I tried to find out some solution but seems I should introduce a new parser. Maybe I would like to ask you because I have no idea about Chinese texts.
Thanks in advance and sorry for the really late reply. Have a nice day 🌞 |
I am agree with you that let PlaintextParser really plain text, but you can provide a optional API in PlaintextParser that appoint the HEADING of plain text, instead of detect by PlaintextParser because the text with ideal format is difficult. let this work decided by user is better than introduce a new parser. About your doubt:
|
I found that sumy will distinguish heading and other sentences, so checked the source code and I found that:
Whether a line is heading is decided by str.isupper() function.
But in a str composed by Chinese characters, if it contains an uppercase alphabet, the isupper() will return True, but actually it is just a normal sentence instead of heading.
For example:
The text was updated successfully, but these errors were encountered: