Slave has Died issue #243

5angjun · 2023-10-11T09:38:46Z

Hello, I'm sangjun who is very interested in this project.

However, I want to know how to fix some error in Manager & Workers Communication.

I want to make died process to restart new qemu and reconnect fuzzing process when slave has died.

Dying slaves is very critical when i try to fuzzing very long hours ex) over 6hours.

So i think what needs to be impored is to re-engage dead workers in the fuzzing process.

Is any idea of this??

    def wait(self, timeout=None):
        results = []
        r, w, e = select.select(self.clients, (), (), timeout)
        for sock_ready in r:
            if sock_ready == self.listener:
                c = self.listener.accept()
                self.clients.append(c)
                self.clients_seen += 1
            else:
                try:
                    msg = sock_ready.recv_bytes()
                    msg = msgpack.unpackb(msg, strict_map_key=False)
                    results.append((sock_ready, msg))
                except (EOFError, IOError):
                    sock_ready.close()
                    self.clients.remove(sock_ready)
                    self.logger.info("Worker disconnected (remaining %d/%d)." % (len(self.clients)-1, self.clients_seen))
                    if len(self.clients) == 1:
                        raise SystemExit("All Workers exited.")
        return results
     ```

Wenzel · 2023-10-11T09:45:01Z

Hi @5angjun,

I think we need to understand why the slaves (or Workers) are dying in the first place ?
That shouldn't happen.

You can get more logging information with --log and combine it with --debug to extract useful debug output.

cc @il-steffen , can we expect dying workers during a fuzzing campaign ? Something I'm missing ?

il-steffen · 2023-10-11T09:48:50Z

Worker exit can happen on Qemu segfault or unhandled exception in the worker / mutation logic. The above logic is only to handle the loss of the socket connection, you need to look at why the worker exited..

5angjun · 2023-10-11T10:35:35Z

I think the error occured when qemu died.

The last died code is this.

    def run_qemu(self):
        self.control.send(b'x')
        self.control.recv(1)

So i think it is nice to restart fuzzing campaign when qemu die.

il-steffen · 2023-10-11T11:23:02Z

Please have a look why this is happening. In general we want to fix anything that causes workers to die during a fuzzing campaign.
There are cases where restarting won't help, for instance if the disk is full then Qemu will just exit again on next file/log write.

In some cases there may be Qemu segfault that is not easy to fix, for instance we had bugs related to specific virtio fuzzing harnesses where fixing Qemu did not make much sense. In this case it would make sense to catch + restart the worker. This should be possible from the manager, and then the fuzzing campaign can just continue running.

The manager main loop is here: https://github.com/IntelLabs/kafl.fuzzer/blob/master/kafl_fuzzer/manager/manager.py#L85
We enter this just after launching the workers: https://github.com/IntelLabs/kafl.fuzzer/blob/master/kafl_fuzzer/manager/core.py#L104
The workers are python threads which in turn launch Qemu sub-processes. The threads should abort normally on Qemu communication error or uncatched exceptions, so you should be able to detect and restart the thread with same settings.

With some luck, the socket connection code you referenced above should detect the new worker and the main loop will start dispatching jobs again.

5angjun · 2023-10-11T13:04:00Z

This situation appears when allocating a lot of RAM to a vm image and performing parallel fuzzing. In my case, this problem appeared while fuzzing the Windows built-in driver for a long time.

For example, my host computer's RAM size is 84G, but when I allocated 10G of RAM to each vm and fuzzed it with 8 cores ( use almost 82G / 84G ), qemu or worker died (there is a high probability that qemu died).

But the manager process is still alive. I am thinking about how to modify the code to revive dead workers in the manager process.

As a person who loves kAFL, I will also think about how to modify kAFL to make it a masterpiece.😀😀😀

Thank

hyjun0407 mentioned this issue Feb 7, 2024

Memory Leak(?) problem with usermode harness #271

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slave has Died issue #243

Slave has Died issue #243

5angjun commented Oct 11, 2023 •

edited by Wenzel

Loading

Wenzel commented Oct 11, 2023

il-steffen commented Oct 11, 2023

5angjun commented Oct 11, 2023

il-steffen commented Oct 11, 2023

5angjun commented Oct 11, 2023 •

edited

Loading

Slave has Died issue #243

Slave has Died issue #243

Comments

5angjun commented Oct 11, 2023 • edited by Wenzel Loading

Wenzel commented Oct 11, 2023

il-steffen commented Oct 11, 2023

5angjun commented Oct 11, 2023

il-steffen commented Oct 11, 2023

5angjun commented Oct 11, 2023 • edited Loading

5angjun commented Oct 11, 2023 •

edited by Wenzel

Loading

5angjun commented Oct 11, 2023 •

edited

Loading