Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentences being joined that shouldn't be #3

Open
kpu opened this issue Aug 20, 2019 · 3 comments
Open

Sentences being joined that shouldn't be #3

kpu opened this issue Aug 20, 2019 · 3 comments

Comments

@kpu
Copy link
Member

kpu commented Aug 20, 2019

Faheem reports this was in se-en: [Name of person] Umu.se makes use of cookies to improve the user experience.By continuing to use the website you agree to the usage of cookies.
There's no space experience.By but a space appears in the source document, including the WARC we crawled from umu.se

@kpu
Copy link
Member Author

kpu commented Aug 20, 2019

Tagging @kirefu who isn't able to be assigned to do initial investigation.

@kirefu
Copy link

kirefu commented Aug 20, 2019

The problem lies with bleualign_cpp, although not sure where, as I don't speak C++. I'm trying to find the bug, but could be quicker if @lpla also took a look

To run an example of the problem on valhalla:

cd /fs/meili0/faheem/postprocess/sv-en
../../bitextor.malign/bleualign-cpp/bleualign_cpp --text1 mona_sv.xz --text2 mona_en.xz --text1translated mona_sv.trans.xz --matches mona.matches --doc-threshold 0.1 --bleu-threshold 0.2 --output-dir blue/

@kpu
Copy link
Member Author

kpu commented Aug 21, 2019

In ec8f4f7 I've updated bleualign to put a space between consecutive sentences found by the gap filler. However, we still have a policy question: should consecutive sentences found by the gap filler go on to one line‽

@lpla lpla transferred this issue from bitextor/bitextor Feb 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants