Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Make Artifacts support in-structure commenting #102

Open
4 tasks
mahaloz opened this issue Aug 25, 2024 · 1 comment
Open
4 tasks

Feat: Make Artifacts support in-structure commenting #102

mahaloz opened this issue Aug 25, 2024 · 1 comment

Comments

@mahaloz
Copy link
Member

mahaloz commented Aug 25, 2024

Background

In most decompilers, like IDA Pro, you can have types that have comments in them, like:

struct Elf64_Vernaux // sizeof=0x10
{                                       // XREF: LOAD:0000000000400410/r
     unsigned __int32 vna_hash;         // this is some comment on this first member
     unsigned __int16 vna_flags;
     unsigned __int16 vna_other;
     unsigned __int32 vna_name __offset(OFF64,0x400390);
     unsigned __int32 vna_next;
};

Which libbs does not currently support. An ideal solution would look like this:

my_struct = deci.structs["Elf64_Bernaux"]
print(my_struct.comments[0]) // this is some comment on this first member
print(my_struct.members[0].comment) // // this is some comment on this first member

Implementation

To support this type of commenting, we'll need to do a few things:

  • Update member-like Artifacts to have the comment attribute
  • Support setting/getting these in each decompiler (as much as possible)
  • Refactor Function to support comments
  • Remove the old comments system that simply stored all comments globally
@arizvisa
Copy link

arizvisa commented Aug 27, 2024

if you don't want to have to use the edm_t.cmt and udm_t.cmt attributes to enumerate or serialize complex field comments, you can also unpack/save them from the result of tinfo_t.serialize() ..which was the pre-8.4 method anyways ("fields" are similar).

decoding the bytes returned by tinfo_t.serialize into a list of comments is basically consuming a byte, determine whether it's an 8-bit/16-bit length, decoding said length, using the length to extract the comment, then utf-8 decoding those bytes and repeating until done.

    def decode_bytes(bytes):
        '''Decode the given `bytes` into a list containing the length and the bytes for each encoded string.'''
        ok, results, iterable = True, [], (ord for ord in bytearray(bytes))

        integer = next(iterable, None)
        length_plus_one, ok = integer or 0, False if integer is None else True
        while ok:
            one = 1 if length_plus_one < 0x7f else next(iterable, None)
            assert((one == 1) and length_plus_one > 0)
            encoded = bytearray(ord for index, ord in zip(builtins.range(length_plus_one - 1), iterable))   # using zip to clamp bytes consumed
            results.append((length_plus_one - 1, encoded)) if ok else None

            integer = next(iterable, None)
            length_plus_one, ok = integer or 0, False if integer is None else True
        return results

encoding the string passed to tinfo_t.deserialize(til, type, fields, cmts=None) requires encoding the length for each utf-8 encoded comment, and concatenating back into a stream of bytes.

apologies for the unreadability of the following.. "encode_length" is all that is relevant

    def encode_bytes(cls, strings):
        '''Encode the list of `strings` with their lengths and return them as bytes.'''
        encode_length = lambda integer: bytearray([integer + 1] if integer + 1 < 0x80 else [integer + 1, 1])
        iterable = (bytes(string) if isinstance(string, (bytes, bytearray)) else string.encode('utf-8') for string in strings)
        pairs = ((len(chunk), chunk) for chunk in iterable)
        return bytes(bytearray().join(itertools.chain(*((encode_length(length), bytearray(chunk)) for length, chunk in pairs))))

however, it's worth confirming the performance with regards to serializing/deserializing them at scale is actually relevant in binsync. minsc creates an index for all commentable "things" so that they can be tagged for searching and (mis-)used to store nearly-arbitrary data, so being able to check if a tinfo_t even has comments or distinguishing what exactly was updated (name/comment/other) in response to events (w/o having to iterate through all the fields one-by-one) made a difference.

...i'm literally praying that they don't try to retrofit repeatable/non-repeatable comments into this btw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: paused
Development

No branches or pull requests

2 participants