Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover type information #190

Open
frabert opened this issue Oct 21, 2021 · 8 comments
Open

Recover type information #190

frabert opened this issue Oct 21, 2021 · 8 comments
Labels
decomp Related to LLVM IR to C decompiler enhancement New feature or request user-story

Comments

@frabert
Copy link
Collaborator

frabert commented Oct 21, 2021

Currently, some type information (e.g. const, volatile) regarding variables is lost.
For example, this source code

struct foo {
  int x;
};

typedef struct foo foo_t;

int main(void) {
  const volatile foo_t **a = 0;
  void *b = a;
}

roundtrips to:

unsigned int main();
struct struct_foo {
    unsigned int x;
};
unsigned int main() {
    struct struct_foo **a; // lost const volatile
    unsigned char *b; // was void* before
    a = (void *)0U;
    b = (unsigned char *)a;
    return 0U;
}

It would be nice to recover the original type information when possible.

@frabert frabert added enhancement New feature or request decomp Related to LLVM IR to C decompiler user-story labels Oct 21, 2021
@frabert
Copy link
Collaborator Author

frabert commented Oct 22, 2021

So one problem I stumbled upon while working on this is the following: currently, all local declarations are at the top, and initialization follows. This is not valid in the case of const variables, e.g.

const int a = 0;

is valid, but

const int a;
a = 0;

is not

@pgoodman
Copy link
Contributor

Is there an opportunity here? For example, the const qualifier in debug info is telling us "this is assigned once." There will be a declref or something in the code on the assignment statement. The const is thus a kind of hint to us that "hey, we /could/ do the decl and assignment at the site of this assignment.

I'm not sure how to guarantee that valid code is produced. For example, if the scoping ends up wrong, then that means we possibly already admitted undefined behaviour into the code (by having a read-before-write of a).

@frabert
Copy link
Collaborator Author

frabert commented Oct 29, 2021

char types pose a bit of an issue: all of the other integral types are either signed or unsigned, with signed being the default when not specified. char is different, in that char is "its own thing", separate from signed char and unsigned char.
For example,

int *a = /* ... */;
signed int *b = a; // OK!
unsigned int *c = a; // Warning

char *x = /* ... */;
signed char *y = x; // Warning
unsigned char *z = x; // Warning

The issue is that debug info only either specify signed or unsigned for integral types, with no possibility of expressing a "plain" char. This causes lots of warnings for e.g. strings

@pgoodman
Copy link
Contributor

Perhaps from the triple and the data layout we can get whether or not char is signed on the platform? I know in LanguageOptions there's a way to force signed or unsigned char.

@frabert
Copy link
Collaborator Author

frabert commented Oct 29, 2021

According to https://llvm.org/docs/LangRef.html#langref-datalayout we do not have something to tell us whether chars are signed or unsigned by default, but even if we did... we'd still need to specify the signedness in the C code, otherwise we would not be guaranteed to have the same semantics on platforms with different signedness, once re-compiled... or am I missing something?

At that point the only issue becomes deciding whether global strings (which do not have debug info attached, from a first glance) are signed char* or unsigned char*, but that could be solved via a command line argument or something like that

@pgoodman
Copy link
Contributor

Do you see "omnipotent char" show up anywhere in debug info? We might be able to see that and use char. There is CharIsSigned, and here is how LLDB infers this.

@frabert
Copy link
Collaborator Author

frabert commented Oct 30, 2021

Yes, "omnipotent char" does show up as part of TBAA attributes on load and store instructions.
I've tried searching for something but couldn't really grasp what that means.

@pgoodman
Copy link
Contributor

Perhaps it's actually a sort of joke type for pointers. Perhaps if we know if it's meant to be a char, but want to hedge against signed/unsigned, we could have a typedef like typedef signed char char_t or typedef unsigned char char_t, or have a char_t and a uchar_t.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decomp Related to LLVM IR to C decompiler enhancement New feature or request user-story
Projects
None yet
Development

No branches or pull requests

2 participants