Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wchar array encoding in structs #1348

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

Emperor-RE
Copy link

@Emperor-RE Emperor-RE commented May 16, 2023

Checklist

Which kind of PR do you create?

  • This PR only contains minor fixes.
  • This PR contains major feature update.
  • This PR introduces a new function/api for Qiling Framework.

Coding convention?

  • The new code conforms to Qiling Framework naming convention.
  • The imports are arranged properly.
  • Essential comments are added.
  • The reference of the new code is pointed out.

Extra tests?

  • No extra tests are needed for this PR.
  • I have added enough tests for this PR.
  • [] Tests will be added after some discussion and review.

Changelog?

  • This PR doesn't need to update Changelog.
  • Changelog will be updated after some proper review.
  • Changelog has been updated in my PR.

Target branch?

  • The target branch is dev branch.

One last thing


Structs that use the ctypes.c_wchar_array type were being double encoded, causing the bytes to be padded with '\x00\x00\x00' instead of just '\x00'. I added a few lines to struct.py that fix this.
Example code that shows the problem by reading KUSER_SHARED_DATA->NtSystemRoot:

from qiling import *

ql = Qiling(["x86_windows/bin/x86_hello.exe"], "x86_windows")
ql.log.info(f"NtSystemRoot bytes: {ql.mem.read(0x7ffe0030, 0x20).hex()}")
ql.log.info(f"NtSystemRoot wstring: {ql.os.utils.read_wstring(0x7ffe0030)}")

@elicn
Copy link
Member

elicn commented May 17, 2023

Thanks for reporting that and taking the time to write a patch.

As the one who wrote that code, I was a bit surprised I haven't taken this scenario into account and checked that myelf. Your repro script showed it quite well, but testing it on another environment actually worked well. After more digging, it turned out that this is a misbehavior of Python 3.8 where Python 3.9 works fine the way it is.

Is it possible you were testing this using Python 3.8?
If so, we'll need to come up with a workaround approach that could be easily removed when we proceed to a newer Python version.

Python 3.8:
image

Same script executed on Python 3.9 (Win version, but I don't think it matters):
image

A few more experiments showed that ctypes.create_unicode_buffer behaves differently on those versions. I guess this is related.

ubuff = ctypes.create_unicode_buffer("hello", 8)
print(bytes(ubuff).hex(" "))

@elicn elicn self-assigned this May 17, 2023
@Emperor-RE
Copy link
Author

Emperor-RE commented May 17, 2023

Thanks for the quick reaction!
I am running Python version 3.10.8 on MacOS, with that version it produces the results mentioned in my pull request.
It might also be caused by the operating system, could you check if it produces the same results for you on version 3.10 on Windows or Linux?

@Emperor-RE
Copy link
Author

Emperor-RE commented May 17, 2023

As a follow-up, this problem also seems to be ocurring with struct.ref assignments, example of my own project where the same problem occurs (in assigning fdata_obj.cFileName), even after my patch:

fdata_struct = make_win32_find_data(ql.arch.bits, wide=True)
with fdata_struct.ref(ql.mem, pointer) as fdata_obj:
    fdata_obj.cFileName          = filename

I'll await your reply to see if this also needs its own patch or if we need to find another solution with regards to the python versions.

@elicn
Copy link
Member

elicn commented May 17, 2023

I tested that on a Python 3.9 running on top of WSL, and it behaves like 3.8 (that is, use excessive padding) so it is not a matter of versions after all. I beleive that could be a result of a default encoding, where Python over POSIX set it to UTF-8 by default [means that any string you provide it is already implemented as a UTF-8 buffer under the hood, which gets re-padded by ctypes].

My Windows Python encoding is set to cp1252 (Western Europe), so it might be the reason it works OK there.
However, I tried forcing the default encoding on WSL and it did not change anything:

$ python3.9.exe -X utf8=1 :

>>> import locale ; locale.getpreferredencoding()
'UTF-8'
>>> import ctypes ; bytes(ctypes.create_unicode_buffer("hello", 8))
b'h\x00e\x00l\x00l\x00o\x00\x00\x00\x00\x00\x00\x00'
>>>

$ python3.9.exe -X utf8=0:

>>> import locale ; locale.getpreferredencoding()
'cp1252'
>>> import ctypes ; bytes(ctypes.create_unicode_buffer("hello", 8))
b'h\x00e\x00l\x00l\x00o\x00\x00\x00\x00\x00\x00\x00'

So I am a bit confued here.. I am pretty much convienced this is an encoding / locale issue, but I cannot figure out how to determine that in order to work around it.

@Emperor-RE
Copy link
Author

Emperor-RE commented May 17, 2023

Yeah i'm pretty sure it has to do with encoding of arrays. When i tested it with some dummy code that called bytes() seperately for each ctypes.c_wchar, that does result in proper encoding (with proper \x00 padding) Calling bytes() on a ctypes_c_wchar_array object results in double encoding (with \x00\x00\x00 padding).

@Emperor-RE
Copy link
Author

Emperor-RE commented May 17, 2023

I think it is dependent on if your python build is compiled using UCS2 or UCS4, can you check which type you are running?
I tested it in a windows VM which was compiled using UCS2, that had no issue with creating a unicode buffer, whilst a MacOS UCS4 build does produce the problem.
https://stackoverflow.com/questions/1446347/how-to-find-out-if-python-is-compiled-with-ucs-2-or-ucs-4

@elicn
Copy link
Member

elicn commented May 17, 2023

Both show a result of 1114111, but behave differently with regards to the c_wchar type.
When running:

ctypes.sizeof(ctypes.c_wchar)

One returns 4 ("extra padding"), while the other returns 2 ("normal padding"). I do believe this has something to do with the locale, but I can't figure it out.

@Emperor-RE
Copy link
Author

Emperor-RE commented May 17, 2023

I've spent quite some time following this up, but the most logical conclusion i can think of is that ctypes follows the "internal" python representation of the OS, and it is still related to Linux/Mac representing a wchar_t as 4 bytes while windows represents it as 2 bytes.

@elicn
Copy link
Member

elicn commented May 18, 2023

No doubt about that. ctypes follows the hosting system specs for everything else, and this one is not different.
However, I designed Qiling's struct module in a way that one could get the structures layed out according to the emulated system, and not the hosting one. The question now is how to use c_wchar in a way that it can be guaranteed to be UCS2 (as expected in Windows structures) regardless of the host it is emualted on.

@Emperor-RE
Copy link
Author

I just pushed a fix that adds a wrapper class to substitue c_wchar arrays and represent them as an array of c_ubytes under the hood. This should ensure that they are always handled as UCS2 regardless of the host os. I tested it and it passes the struct tests of qiling & it works with the binary i'm emulating 👍

@elicn
Copy link
Member

elicn commented May 21, 2023

Why did you add +2 for null terminator? I think you are confusing it with LPCWSTR or something similar.
Adding those extra bytes will change the layout of the structure, causing all following fields to be unaligned.

@Emperor-RE
Copy link
Author

My bad, that indeed causes misalignment. For my use-case that didn't matter, but i'll change it to the proper length of bytes in a new commit.

I have a different question about some of the other windows struct types, could you elaborate on why you internally represent a lot of the string pointers types as the STRING type in qiling/os/windows/api.py? For example PANSI_STRING = STRING. I'm passing an executable through qiling that uses the LdrGetProcedureAddress API call, but the function hook for that does not work properly because it is trying to use a string instead of a pointer to an ANSI string (Code causing this). I have created a fix for it locally that dereferences the variable and fetches the correct struct values, but i'm assuming there is a reason why you did it that way, so i'd prefer to fix it using your intended approach.

@elicn
Copy link
Member

elicn commented May 23, 2023

Tagging an argument as STRING lets Qiling treat it as a pointer to a null-terminating ASCII string. Qiling safely dereferences the pointer and reads the string from memory, so the user doesn't have to do it every time a C string is passed in to a function.

However, ANSI_STRING and UNICODE_STRING are not ordinary ASCII and wide strings, they are structures. I did create a dedicated structure for UNICODE_STRING (see here) but not for ANSI_STRING. Mapping them both to STRING and WSTRING (respectively) must be a leftover from older code.

I will add this to my fix backlog [probably through OS resolver handlers].
Thanks for pointing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants