Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Itstool experimental branch #303

Open
wants to merge 19 commits into
base: stable
Choose a base branch
from
Open

Itstool experimental branch #303

wants to merge 19 commits into from

Conversation

jralls
Copy link
Member

@jralls jralls commented Mar 27, 2023

Recreate #120

@gjanssens original description:

This branch is meant to experiment with an alternative workflow for documentation authoring and translations. In this alternative workflow the independent documents per language are replaced with a single master document and translation will happen using gettext and the its extensions.

The single master document will serve for all languages as follows:

Sections with contents that is valid for all languages will be written in English.
Using the proper tooling the translatable messages will be extracted into a message catalog (pot file) which can be translated for each language we support.
Sections that are relevant only for a single language will also be added in the master document, but marked with special ITS tags. As a result messages from these sections will not appear in the catalog files.
ITS also supports sections that are valid for some but not all languages or sections that are not relevant for some languages.
This branch is working with a reduced sample book (many chapters removed) for simplicity.

It has converted this subset into a master document and has created a po file for the German language.

Note creating this initial PO file is fairly complicated (the pot file on the other hand is easy). I have written a script that will extract msgids and corresponding msgstrs from the original and translated files. The only caveat is the order in which the xml tags appear in both documents has to match exactly. If they don't the script will fail. So to use the script any tag mismatches should first be manually corrected (using a smart diff tool to find the differences does help, but it's still a huge one-time manual effort).

Lastly what gettext with ITS won't solve for us is that incomplete translations will be replaced with English text as opposed to not included as it was in the original workflow. I'll admit that personally I would prefer complete documentation that may be partially untranslated than incomplete documentation with what's there fully translated. Including untranslated contents may also better expose the work to be done still to potentially new translators. We can investigate how tooling/scripting can improve the experience though.

Note also that the last commits are there to simulate large documentation changes and evaluate how this impacts translatability of the master document and the language specific po files in particular.

sunfish62 and others added 19 commits September 12, 2018 20:44
… this commit required additional content be created to accommodate links, resulting in the addition of ch_configuring.xml and ch_importing.xml to the documentation source.
…te files. xmllint yields:

ch_basics.xml:1520: parser error : Premature end of data in tag chapter line 18

^
gnucash-guide.xml:497: parser error : Failure to process entity chapter2
&chapter2;
          ^
gnucash-guide.xml:497: parser error : Entity 'chapter2' not defined
&chapter2;
          ^
Note this commit readds the custom html entity definitions.
This is needed for itstool 2.0.2 to be able to parse the files.
Due to a bug it won't properly parse the DTD.
…into itstool

With some adjustments to get the branch up to speed with the XInclude changes
The xml files I have kept are those needed to merge in sunfish62's PR.
That should give enough context to evaluate how pot based translations
work when moving texts from one book to another.
This is the first step to extract translations into a po file.
When the tag structure across the documents is exactly the same on
the document files across different languages, the extraction
of translatable strings/translated messages can be done in a semi
automated way.
Note some sections exist only in one language. ITS has a mechanism to
handle this, but that can only be used *after* the documents have
been converted to po based translations. So in the interim missing
sections are added with markers that can later be searched for. I am
currently using these markers:
* LANG-DE: will be found in the C documents to indicate this section
  only exists in the German translation, not in the English original
* UNTRANSLATED-DE: used in the German translation to indicate this
  section did not exist in the translation, but does in the English
  original.
* FUZZY-DE: used in the German translation, indicating the context
  of this section has changed and needs review.

Sections marked with LANG-DE or UNTRANSLATED-DE are either candidates
for a special ITS marker (if the section is only relevant for one
language) or for future translation.

Note also that whitespace differences are not important. ITS cleans up
whitespace before extracting the strings. The sequence of the tags in
a file is what matters.
…r conversion to po

At the top level, these global targets have been defined:
- gnucash-docs-de-english.fpot, gnucash-docs-de-native.fpot,
  gnucash-docs-it-english.fpot, gnucash-docs-it-native.fpot,
  gnucash-docs-ja-english.fpot, gnucash-docs-ja-native.fpot,
  gnucash-docs-pt-english.fpot, gnucash-docs-pt-native.fpot,
  gnucash-docs-ru-english.fpot, gnucash-docs-ru-native.fpot:
  These all generate a po template file based on the native language (de, it, ja, pt, ru).
  The files having "native" will have msgids in the native language. The files having
  "english" in their name will have English msgids to go with the native language. (For example
  for the de language, these fpot files will have msgids for both the guide and the help manual,
  while the fpot files for ru will only consider the guide).
  Note the 'f' stands for 'fake' as the template files for other languages are not really meant
  as template files: they have msgid's in the native language instead of English. But
  this fake template file can later be used to compile a proper po message catalog including that
  language's translations.
- gnucash-docs-de-english.struct, gnucash-docs-de-native.struct,...: these rules will extract
  the msgid order as found in the respective fpot with similar name and stores this order as
  gnucash-docs-<lang>-<english,native>.struct. This roughly maps with the xml tag
  hierarchy of the original xml files. This extracted order can be used to verify if the msgid order
  in the English pot file (in srcdir/po) is identical to the msgid order in the fpot file for another
  language. This is crucial for a later automatic generation of po files.
  If the msgid order doesn't match exactly the automatic compilation can't work.
  Note to fix misalignments, xml nodes may have to be added or combined in the original xml files
  (either the English ones or the translated ones depending on the mismatch) or extracted strings
  should be harmonized (match capitalization and punctuation in the same language such that one
  msgid is used the same number of times in both files).
- fpot-de, fpot-pt,...: pseudo targets that will generate the proper gnucash-docs-xy.fpot and .struct for that language
- de.po, it.po,...: will generate a po file in the given language starting from
  gnucash-docs.pot and the fpot file for the given language. Again if the associated
  .struct files differ, this command will print an error and exit.

Aside from these global targets, there are similar targets per book (guide/C, help/de, ...)
- <entity>.fpot: will generate an fpot file for a single entity. For example ch_accts.fpot
  will generate such a file for ch_accts.xml. It will also generate the associated ch_accts.struct
  which again can be used to compare the same source file in different languages.
- fpots: will run the *.fpot rule for each source file in the current directory.
These additional targets are provided as it's likely easier to start the msgid alignment
on a file per file basis. After all files in a language have been lined up with the same files
in the C sources, the global fpot file for a language can be evaluated to make sure the
alignment holds when all files are parsed into one pot file.
Note this likely obscures the rule to extract a <lang>.po file from
two fake pot files. If this happens, the file in src/po/<lang>.po should
temporarily be renamed for the extraction to work.
@jralls jralls changed the title Itstool Itstool experimental branch Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants