Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xmldocument::save uses non-transformed encoding parameter #617

Closed
aral-matrix opened this issue Jun 19, 2024 · 4 comments
Closed

xmldocument::save uses non-transformed encoding parameter #617

aral-matrix opened this issue Jun 19, 2024 · 4 comments
Labels

Comments

@aral-matrix
Copy link

In pugixml.cpp / xml_document::save, the impl::xml_buffered_writer constructor invoked for buffered_writer uses get_write_encoding on parameter encoding, which can transform xml_encoding::encoding_auto into a specific encoding (e.g. encoding_utf8), stored in buffered_writer.encoding.

Accordingly, buffered_writer.encoding should be evaluated, while currently only the unmodified function parameter encoding is checked.

Pull request #616 xmldocument::save - use encoding as interpreted by get_write_encoding… fixes this.

The same pull request also adds explicit UTF-8 encoding to the target file, but because this can change how an empty document looks, this requires a change to the tests/test_document.cpp, which I don't know how to make because the tests are hard to read due to macros.

@zeux
Copy link
Owner

zeux commented Jun 19, 2024

Accordingly, buffered_writer.encoding should be evaluated, while currently only the unmodified function parameter encoding is checked.

What user visible consequences does this have without the change to add utf8?

Why should the default behavior be to emit encoding=utf-8 for every document? That’s already what the XML specification says should be the default.

@aral-matrix
Copy link
Author

What user visible consequences does this have without the change to add utf8?

As I said in the pull request (forgot to mention it explicitly here): it's a minor bugfix - in the current implementation, I don't see a problem - however, if get_write_encoding ever gets modified to interpret e.g. encoding_auto as something that is specifically checked in xml_document::save to emit an encoding property, the check would fail / the buffered_writer would write a different encoding than the document header would specify.

That's why it's a bug without consequences in the current implementation, but it's still a bug as far as I can tell.

Why should the default behavior be to emit encoding=utf-8 for every document? That’s already what the XML specification says should be the default.

Well I suppose should is not the right word - it's just a matter of preference in my case - when working with XLSX documents created in e.g. LibreOffice, the encoding=utf-8 gets stripped when saving a modified file with pugixml.

I just thought that this initial xml tag has basically zero impact on performance, so explicit encoding is better than implicit. It's your project - leave that change out if you don't like it :)

@zeux
Copy link
Owner

zeux commented Jun 20, 2024

Ok - I don't know that it matters either way, but feel free to submit a PR that changes the encoding check to buffered_writer.encoding.

I do not think the default output should be changed. Stylistic preferences can be accommodated using a documented method in this case: https://pugixml.org/docs/manual.html#saving.declaration - and the default should be the minimal output that results in unambiguous decoding result, which is why there's a special case for latin1 but not for any other encoding - every other encoding can/should use BOM or default to UTF8 without any extras.

I'm going to close this either way, feel free to change the PR to only change the encoding if you need that change.

@zeux zeux closed this as completed Jun 20, 2024
@zeux zeux added question and removed bug labels Jun 20, 2024
@aral-matrix
Copy link
Author

aral-matrix commented Jun 21, 2024

Actually, I am not entirely sure how to patch this -
I just pushed (to my fork) a change as you suggest, and this change also breaks the checks - which I do not understand, because all I changed is two checks on encoding to check buffered_writer.encoding instead, and with line no. 3630 in get_write_encoding:
if (encoding != encoding_auto) return encoding;
this should be exactly the desired behavior?

Before I send you a pull request, I was hoping you could help me understand why my forked version breaks these validations:

Test document_save_file_wide failed: doc contents does not match L"<?xml version=\"1.0\"?><node/>" at tests/test_document.cpp:776
Test document_save_file failed: doc contents does not match L"<?xml version=\"1.0\"?><node/>" at tests/test_document.cpp:762
FAILURE: 2 out of 964 tests failed.

The "Test document_save_file_wide" failed makes be wonder whether line 3621 is the one that triggers a differing behavior under test with my path. Is there maybe an actual bug here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants