Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taint analysis improvements #123

Open
fxshlein opened this issue Dec 7, 2023 · 0 comments
Open

Taint analysis improvements #123

fxshlein opened this issue Dec 7, 2023 · 0 comments

Comments

@fxshlein
Copy link

fxshlein commented Dec 7, 2023

Hi! I've been implementing some static analysis (somewhat similar to taint analysis, but the propagated state is a bit more complex). During this, I've implemented some improvements to the various components that are involved, which vastly improved results for me, and I just wanted to share them in case you'd be interested in these changes, perhaps they are relevant 🙂

I wrote them in kotlin in our internal code, but I'd be happy to make a PR if the changes are something that would be useful to you.

Note that some of these changes are based on the assumption that the heap is not tracked.

1. Replacing virtual calls with calls to implementations

This greatly improved the results because I have a lot of cases like this:

class Taint {
    static String source() { ... }

    static void sink(String value) { ... }
}

interface Interface {
    void takeTheArgument(String argument)
}

class Implementation implements Interface {
    @Override
    void takeTheArgument(String argument) {
        Taint.sink(argument)
    }
}

class Main {
    // And somewhere:
    static main(Interface value) {
        value.takeTheArgument(Taint.source())
    }
}

The main method calls takeTheArgument on an Interface with a tainted argument, but the call is ignored, since the call is to an interface, and not the implementation.

What I did instead is that, when building the CFA, I take every invokevirtual, also add calls to the implementations of the method for each subclass that implements it. In the resulting CFA, the single invokevirtual then has multiple outgoing calls instead of one, which seems to be handled fine by the existing algorithms, and taint analysis will also look at the implementations of the methods.

2. Starting from taint sources instead of the main entrypoint

Instead of directly using a BamCpaRun, I implemented a top level CpaRun based on a custom transfer relation, which will:

  • If the current position is an exit node, backtrack to the method entry, use the reduce operator to create the return state, and then uses all the callers as the successors.
  • If the current position is a call, it uses a BamCpaRun to analyze that call. The BamCpaRun will have the called method as the main method.
  • Otherwise, it just uses the regular JvmModelTrackingTransferRelation.

As this sometimes starts analyzing from the middle of a method, the stack is modified to just return an empty state instead of throwing when it's empty and someone tries to pop an item from it.

When, after backtracking, the return value of a function is not tained, the algorithm stops, since the only thing that could cause the state to be tainted again is another call to a taint source, and that is analyzed separately.

This allows me to drastically reduce the amount of code that is analyzed, since, instead of starting from a main method, I can start analysis from inside my source method. It also circumvented some issues I had, where some pieces of code were not reached from the main method. The program I'm analyzing is fairly huge, so it's kind of expected that coverage would not be perfect.

3. Nested Call Filter

This one was again really useful to optimize the amount of code analyzed. Basically it is a simple predicate, which allows the implementation to decide whether to actually analyze a function call. The predicate is called here in this if statement in BamTransferRelation. If the method should not be entered, it is treated like an unknown method.

Given that:

  • The run is directly starting from the sources as entry points (see above)
  • The heap is not being tracked (i.e. a forgetful heap model is being used)

A call where none of the operands is tainted can be filtered out, as there is no way for any code in that call to be tained. If that code calls a taint source again, it will also be an entry point, and that will be analyzed separately.

4. Taint analysis from field access

This one is fairly straightforward and probably an intended functionality, but it seems that the taint analysis algorithm doesn't implement it currently:
By overriding the JvmForgetfulHeapAbstractState, and returning a tainted state from getFieldOrDefault depending on the passed fqn, it's possible to use a field as a taint source, even without requiring one of the more complex heap models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant