Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PicoServer - get_sender_client_id() returning wrong value #30

Open
samikool opened this issue Sep 30, 2021 · 3 comments
Open

PicoServer - get_sender_client_id() returning wrong value #30

samikool opened this issue Sep 30, 2021 · 3 comments

Comments

@samikool
Copy link

Apologies if this is the wrong place for this question, but I am having a problem with the PicoServer App.

Essentially, I have ported an ethernet driver from the Zynq7000 to the Zync Ultrascale+, and am trying to use the PicoServer App to test it. It works about 90% of the time I load and run it, but sometimes it flakes out. I have done a lot of investigation but am a bit stuck as this is the first project I have worked on in either seL4 or camkes.

Right now, I have the Listener component start up and get its socket into the listening state, then I signal the echo client to start up and attempt to make a connection. From my understanding of the app, these components should enter the while loop at the bottom and call seL4_Wait(() and then be woken up by the PicoServer when their socket receives an event. However, I am running into one of two scenarios.

I leave the seL4_Wait() call alone, and the Listener socket is never woken up, even though, when I trace down the TCP/IP stack, I see the syn from the Echo going out and trying to make the connection.

I change seL4_Wait() to and seL4_Poll() call, and the connection is made between the two components, however at times it doesn't work as mentioned above. Specifically when I attempt to send data, either data is not sent, or it says data is sent but the listener doesn't actually receive anything.

I am a little stuck on where to start debugging situation 1, but in situation 2, I have traced the problem to when the component checks its client_id, which then corresponds to what socket gets returned from the PicoServer. A call is made to pico_send_get_sender_id(), which switches on the pico_send_badge to return the ID of the Echo component. At times pico_send_badge is not correctly set to the Echo component's badge, which is causing the wrong badge to get returned, which then causes the wrong client_id to get returned, which then causes the wrong socket to get returned. The stack then attempts to send from the Listener socket instead of the echo socket, which obviously causes an error. At times the error happens on the receiving end, where the Echo component's socket is returned and it tries to receive the data it just sent to the Listener socket, which again causes an error.

My best guess is that since there is a single handle message function being called by potentially two different component threads, that the pico_send_badge is not getting read/written correctly at times. I've tried adding the volatile keyword, but it didn't fix this issue.

I am a bit at a loss on how to progress in debugging this issue. If anyone has any suggestions, or needs more info please let me know. This is my first issue on creation on Github so if this is the wrong place for this or I'm leaving something out please let me know.

@kent-mcleod
Copy link
Member

Essentially, I have ported an ethernet driver from the Zynq7000 to the Zync Ultrascale+, and am trying to use the PicoServer App to test it. It works about 90% of the time I load and run it, but sometimes it flakes out.

Do you mean that sometimes it correctly loads and runs and sometimes it fails to startup, or that when it runs it sometimes fails on individual packets?

At times pico_send_badge is not correctly set to the Echo component's badge, which is causing the wrong badge to get returned, which then causes the wrong client_id to get returned, which then causes the wrong socket to get returned.

Which version of the project are you using? Are you able to add a GitHub url to where pico_send_get_sender_id() is called? When I look it seems that only pico_control_get_sender_id() is being called?
Looking at a recent checkout, this is what the server is doing (https://github.com/seL4/global-components/blob/master/remote-drivers/picotcp-socket-sync/server/server.c#L437):

int pico_send_send(int socket_fd, int len, int buffer_offset)
{
    seL4_Word client_id = client_check();

And client_check():

static inline seL4_Word client_check(void)
{
    /* Client IDs start from one to avoid using the zero badge */
    seL4_Word client_id = pico_control_get_sender_id();
    ZF_LOGF_IF(client_id >= num_clients, "Client ID is greater than the number of clients registered!");
    return client_id;
}

This does look like a bug in the picoserver component, where for the pico_send* interface calls, the client check function should be calling pico_send_get_sender_id() instead of pico_control_get_sender_id(). Because it's checking the sender ID for the wrong interface it's going to get things completely wrong. This is probably the root cause of the errors you are seeing.

The camkes runtime should probably also be setting pico_control_badge, pico_send_badge and other similar global variables to invalid values when not currently handling an RPC for that interface so that pico_*_get_sender_id() can report an error when called incorrectly.

@samikool
Copy link
Author

samikool commented Oct 14, 2021

Alright this will be a long one, but with the info you gave I took a second look this week, and was able to identify and fix the problem. I'll try to be as detailed as possible.

Answering Questions

Do you mean that sometimes it correctly loads and runs and sometimes it fails to startup, or that when it runs it sometimes fails on individual packets?

I was unsure at first, but after fixing the bug, I can confirm the driver I ported is working as intended, and this was definitely a problem with the PicoServer component.

Which version of the project are you using?

I am using the default.xml manifest from the camkes-manifest github at tag 3.10.0.

Are you able to add a GitHub url to where pico_send_get_sender_id() is called? When I look it seems that only pico_control_get_sender_id() is being called?

I am not sadly, because these functions are actually generated when calling the init-build.sh script. However, you are correct, even in the generated component code, these functions are never called, which did turn out to be part of the problem.

Intended Behaviour

The original problem I described was that the wrong client_id was being returned by pico_control_get_sender_id. This was true, but was sort of misleading as I investigated further. The reason it was misleading was I was using the PicoServer component in an unintended way. The intended way to use the component and PicoTCP stack in general is to have a low level device loop that is constantly looping, accepting TCP connections, packets, ARP requests, etc. (this is the part where my driver was functioning correctly). When these connections and packets are detected, the loop calls a socket callback function, which puts an event in the socket's structure, and then flags the need to emit a signal, using a PicoControl function that is generated during the initialization with the init-build.sh script. This theoretically wakes up the correct CAmkES component which owns the socket, and allows it to handle the network event. I will try to show this all below.

CAmkES Components Waiting

int run(void)
{
    printf("%s instance starting up, going to be listening on %s:%d\n",
           get_instance_name(), ip_addr, LISTENER_PORT);

    int socket = listener_control_open(false);
    if (socket == -1) {
        assert(!"Failed to open a socket for listening!");
    }

    int ret = listener_control_bind(socket, PICOSERVER_ANY_ADDR_IPV4, LISTENER_PORT);
    if (ret) {
        assert(!"Failed to bind a socket for listening!");
    }

    ret = listener_control_listen(socket, 1);
    if (ret) {
        assert(!"Failed to listen for incoming connections!");
    }

    /* Now poll for events and handle them */
    seL4_Word badge;

    while (1) {
        seL4_Wait(listener_control_notification(), &badge);
        handle_picoserver_notification();
    }
}

As you can see in in picoserver example application, both the Echo and Listener components do some initialization of their sockets, and then pause on a seL4_Wait call. I will show below how they are meant to be woken up.

Socket callback function

static void socket_cb(uint16_t ev, struct pico_socket *s)
{
    /* ZF_LOGE("\033[32mpico_socket addr = %x, ev = %d\033[0m", s, ev); */

    /* Find the picoserver_socket struct that houses the pico_socket */
    picoserver_socket_t *client_socket = client_get_socket_by_addr(s);

    /** I removed a huge comment that was here to save space */

    if (client_socket->async_transport) {
        if (ev & PICO_SOCK_EV_RD) {
            ev &= ~PICO_SOCK_EV_RD;
            rx_queue_handle();
            rx_socket(client_socket);
        }
        if (ev & PICO_SOCK_EV_WR) {
            ev &= ~PICO_SOCK_EV_WR;
            tx_queue_handle();
            tx_socket(client_socket);
        }
    }
    if (ev) {
        seL4_Word client_id = client_socket->client_id;
        int ret = client_put_event(client_id, client_socket->socket_fd, ev);
        ZF_LOGF_IF(ret == -1, "Failed to set the event flags for client %"PRIuPTR"'s socket %d",
                   client_id + 1, client_socket->socket_fd);

        emit_client = 1;
    }
}

You can see when this callback is invoked by the low level device loop, it is passed an event, and socket_fd. If there is an event, then the emit_client flag is set to 1. Additionally, the the function calls client_put_event which simply adds the event to the socket's structure as shown below.

Putting Client Event

int client_put_event(seL4_Word client_id, int socket_id, uint16_t event)
{
    picoserver_client_t *client = clients[client_id];
    /* Fetch the socket from the client and add the event */
    picoserver_socket_t *client_socket = client_get_socket(client_id, socket_id);
    if (client_socket == NULL) {
        return -1;
    }
    client_socket->events |= event;

    /* Add the socket to the set if not already there */
    int ret = 0;
    kh_put(socket_event, client->socket_event_set, socket_id, &ret);
    if (ret == -1) {
        return -1;
    }

    return 0;
}

Later, after the event has already been put, the CAmkES component actually needs to be notified that it's socket received an event so it can handle it. This is where the emit_client flag comes in to play. In the low level device loop, a notify_client function is called.

Notify Client

static void notify_client(UNUSED seL4_Word badge, void *cookie)
{
    if (emit_client) {
        pico_control_emit(1);
        emit_client = 0;
    }
    if (emit_client_async) {
        tx_virtqueue.notify();
        emit_client_async = 0;
    }
}

This function is called in the low level device loop, and essentially just checks the flag, and emits a signal, then resets the flag. The pico_control_emit function is another one that is generated during the init-build script. However this is where the bug lies.

The Bug

The problem I was running into when running the example app is that only one of my components was actually being woken up and handling its events. I eventually tried removing the seL4_Wait call from the component that was stuck. This is where my original post gets the idea the pico_control_get_sender_id function was returning the wrong id . I eventually figured out that by removing the seL4_Wait call, I was short circuiting the signal logic, which was causing a race condition inside PicoControl. The ID it was supposed to return was being changed, before it could be returned, because one component was behaving as if it was constantly being signaled.

After seeing your post I took a closer look at the notify_client function and spotted the error. notify_client calls pico_control_emit which is generated when the build is first initialized, and for me, looks like this:

void pico_control_emit(unsigned int badge) {
    
    static void (*lookup[])(void) = {
        
            [1] = pico_control_emit_1,
        
            [2] = pico_control_emit_2,
        
    };
    assert(badge < ARRAY_SIZE(lookup));
    assert(lookup[badge]);
    lookup[badge]();
}
void pico_control_emit_1(void) {
    seL4_Signal(14);
}
void pico_control_emit_2(void) {
    seL4_Signal(13);
}

Essentially, depending on whatever ID is passed in, seL4_Signal is called, which will wake the corresponding component from its seL4_Wait call. The component then will call its handle function to handle the event at the application layer.

Well looking back at the notify_client function the problem becomes really obvious. The pico_control_emit function is only ever passed a hardcoded 1. This is why one of my CAmkES components was never being woken up, and the other was working fine. It is also why, when debugging, I could see the socket callback function firing as well as the events being successfully put, but still saw my component stuck on a seL4_Wait call.

The Fix

The fix I implemented was very straight forward, and from testing seems to be working flawlessly. Although I have not tried testing with >2 components yet.

In the server_cb function: emit_client = 1; -> emit_client = client_id+1;

+1 because CAmkES component IDs start at 0, while PicoServer registers keeps track of components starting at 1.

The next change was in notify_client function: pico_control_emit(1) -> pico_control_emit(emit_client);

Now instead of a hardcoded emit, pico_control_emit receives the ID of the component it needs to signal. I believe this must have been how it was always intended to work. I have no idea why there would have been a hardcoded 1 there in the first place. This would have broken anytime someone had >1 component connected to the PicoServer, which from my understanding, is supposed to be the exact use case of the PicoServer component.

One More Bug

With that fixed I ran into one other bug, and to come full circle, your comment about pico_send_get_sender_id never being called ended up coming into play here. The issue was when I was trying to send/recv data with pico_send_send/pico_recv_recv, I was getting the wrong buffers. The Echo component was sometimes grabbing the Listener's buffers and vice-versa. On further investigation, this seemed to vary every time I reinitialized the build. Sometimes Echo was broken but Listener wasn't, sometimes the other way around, or sometimes it worked perfectly. What I eventually found is that the ID for the PicoControl component doesn't always match the PicoSend and PicoRecv IDs. For example, Echo could be 0 with respect to PicoControl, but 1 with respect to PicoSend. So if only the PicoControl ID is used, then the wrong buffers are accessed.

The fix here was also very straight forward, in the pico_send_send/pico_recv_recv functions, I added a send_client_id and recv_client_id and use those to grab the buffers instead of the original client_id. The original client_id is now just used to get the socket, since that is the part associated with PicoControl. I will post code below:

Send

int pico_send_send(int socket_fd, int len, int buffer_offset)
{
    seL4_Word control_client_id = client_check();
    seL4_Word send_client_id = pico_send_get_sender_id();
    /*
     * client_id needs to be incremented here as the CAmkES generated interfaces are one off
     * from ours
     */
    size_t buffer_size = pico_send_buf_size(pico_send_enumerate_badge(send_client_id ));
    void *client_buf = pico_send_buf(pico_send_enumerate_badge(send_client_id ));
    picoserver_socket_t *client_socket = NULL;

    int ret = server_communication_common(control_client_id , socket_fd, len, buffer_offset,
                                          buffer_size, &client_socket, &client_buf);

    if (ret) {
        return -1;
    }

    ret = pico_socket_send(client_socket->socket, client_buf, len);
    return ret;
}

Receive

int pico_recv_recv(int socket_fd, int len, int buffer_offset)
{
    seL4_Word control_client_id = client_check();
    seL4_Word recv_client_id= pico_recv_get_sender_id();
    /*
     * client_id needs to be incremented here as the CAmkES generated interfaces are one off
     * from ours
     */
    size_t buffer_size = pico_recv_buf_size(pico_recv_enumerate_badge(recv_client_id));
    void *client_buf = pico_recv_buf(pico_recv_enumerate_badge(recv_client_id));
    picoserver_socket_t *client_socket = NULL;

    int ret = server_communication_common(control_client_id , socket_fd, len, buffer_offset,
                                          buffer_size, &client_socket, &client_buf);
    if (ret) {
        return -1;
    }

    ret = pico_socket_recv(client_socket->socket, client_buf, len);
    return ret;
}

With this change, everything seems to be working perfectly for how I'm using it.

Conclusion

Obviously there are larger problems, as something like this should probably be changed in every function instead of the two I'm using. I don't have a commit ready to fix all of this yet, and would be interested to hear feedback on it before getting one ready. However, from what I can tell, when the PicoServer component claims it can multiplex access to an Ethdriver component for multiple CAmkES components, it is lying. I cannot see how it would ever wake up more than one component in its current iteration. I should also throw out that I have not taken a look at how the code is generated. Potentially, a hardcoded 1 used to work because signaling was handled differently, but something was changed which caused this bug.

I am going to do more testing with >2 components, as well as work with UDP and see if I can confirm these fixes will work. If so, I do plan on making a commit which should address the issues I have shown. I will also clean it up a bit, and maybe induce a second client_to_signal variable or something, to make it more clear how the signaling is working.

@samikool
Copy link
Author

samikool commented Oct 15, 2021

Tested with 3 components, and more changes will be necessary. Since socket_cb, and notify_client are called by two different threads, some components are not getting signaled before others have their socket_cb function called. This overwrites what client needed emitting, and it will get stuck waiting.

I have it mostly working by adding an array with length == number of clients
added to picoserver init

client_needs_emit = (int *) calloc(num_clients, sizeof(int));

In socket_cb

if (ev) {
        seL4_Word client_id = client_socket->client_id;
        int ret = client_put_event(client_id, client_socket->socket_fd, ev);
        ZF_LOGF_IF(ret == -1, "Failed to set the event flags for client %"PRIuPTR"'s socket %d",
                   client_id + 1, client_socket->socket_fd);
        // printf("client_id: %lu needs emitting...\n", client_id);
        emit_client = 1;
        client_needs_emit[client_id] = 1;
    }

In notify_client

if (emit_client) {
        for(int i=0; i<num_clients; i++)
        {
            if(client_needs_emit[i])
            {   
                pico_control_emit(i + 1);
                client_needs_emit[i] = 0;  
            }
        }
        emit_client = 0;
    }

This was my first thought of a solution, although it seems wasteful to traverse the entire array every time, Maybe using a linked list would be more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants