Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wayland communication protocol draft, v0.2 #4

Open
KarsMulder opened this issue Nov 29, 2023 · 20 comments
Open

Wayland communication protocol draft, v0.2 #4

KarsMulder opened this issue Nov 29, 2023 · 20 comments

Comments

@KarsMulder
Copy link

Overview

So I have been thinking about a new protocol for a while now, and have written down quite a bit about it. It is still pretty half-baked, insufficiently implemented, and several parts are not even implemented yet, but I thought it would be a good idea to post my current progress.

This protocol focuses entirely on how the communication between the Wayland compositor and a mapper script running as Wayland client could work. Libinput may or may not be able to be refactored to use the same structure internally as the compositor is supposed to use according to this protocol, but I haven't checked the feasibility of that yet; this is not a proposal on how libinput should be refactored.

There are several changes that still need to be made to this protocol, so I've created a new issue for it where I can update the top few posts as I progress.

@KarsMulder
Copy link
Author

KarsMulder commented Nov 29, 2023

Layers

Like the filter protocol at kernel level I proposed earlier, I think that event processing should be separated into layers. Here is a simplified image with only four layers drawn, but I think we should have about 2^32 of them:

layers

Unlike my previous proposal, this time layers are a concept that exist on a compositor level, not on a device level. Devices do not have layers; if anything, layers have devices. Or more accurately, they all have their own view of the properties and state of the devices.

Before doing anything, a mapper script must attach itself to one of those layers (or multiple if it wants to, but that's potentially a really bad idea.) Let's say that our hypothetical script attaches itself to layer 1000.

Once attached to a layer, the script can inquire the following:

  • Which input devices are available at this layer;
  • The properties of those input devices;

Devices are layer dependent
The reason that a script had to attach itself to a layer before viewing which input devices are available is because the existing input devices and their properties depend on the layer.

A script at layer 1000 can create a virtual input device. That device will be visible to everything attached to layer 1001 or higher, and invisible to everything at a lower layer.

Device properties are layer dependent
Input devices have properties, such as:

  • Name of the device;
  • Vendor and product id;
  • Device type (keyboard/pointer/touchpad);
  • Range of the absolute axes (EV_ABS);
  • Associated XKB keymap;
  • Desired acceleration profile;
  • And more...

These properties can depend on the layer as well. Suppose that a device named "Plain Keyboard" exists at layer 0. Then a script running at layer 1000 is allowed to change this name to "Fancy Keyboard". Any script running at layer 1001 and higher will see this device's name as "Fancy Keyboard" whereas any script at lower layer will still see it as "Plain Keyboard".

Unless a script explicitly changes the properties of a device, all devices at later layers are assumed to have the same properties as those on earlier layers.

There are some use cases for wanting to change the device properties:

  • Some device database might want to look up the vendor+product id in a database to find more information about it than the kernel knows and fix some quirks like "this device is known to report incorrect absolute axis ranges";
  • Some user configuration tool wants to change the properties of a specific device.

But most interestingly, the associated XKB keymap is treated as a device property. Mapper scripts are able to change the associated keymap in a way that later layers will observe their changes without interfering with previous layers that are maybe changing the keymap as well.

Events

I haven't figured out the right model for events to use. Maybe we find someway to make it work with the evdev format. Maybe we'll just have to use the libinput format. Whatever format we end up using, I want to define some new concepts.

States
Some events are persistent, which means that they retain their last state until they are changed by another event. Examples are EV_KEY events and EV_ABS events.

Other events are transient, such as EV_REL events. There is no significant semantic meaning to be associated to the value of the last EV_REL event you saw.

The collection of the last value observed of every persistent event of a particular device is called the state of that device. For a keyboard, the state might be "the K key is pressed, everything else is released". For a joystick, the state might be "the X axis is 94 and the Y axis is -4".

Anyway, remember that at every layer, a set of input devices exist. The compositor should keep track of the state of each input device at each layer, even (in theory) of layers to which no scripts are attached. Of course it does so sparsely. If there is a script attached to layer 1000 and the next occupied layer is 50000, then it should just know that the devices and their states are identical at layers 1000–49999.

To recap:

  • At each layer, there is a set of input devices available;
  • For each (layer, device) pair, there is a set of device properties;
  • For each (layer, device) pair, there is a state, which is the set of the last event value seen at this layer for each persistent event type.

Event flow
All events are attached to a particular device and enter the stream at one layer. Then they automatically propagate to the next layer until they encounter a layer where their associated device is being listened to.

Listening
To keep things efficient, no events are routed to mappers unless they've declared that they want to listen to a certain device. Listening is pretty similar to opening an event device and then grabbing it: after you listen to a device, no events will reach later layers unless you write them back.

Unlike grabbing, a listened device does not become unavailable at later layers, it just means that they receive no events. A script listening to a device is free to write events back to a device it is listening to, and those events will appear again in the event stream at the next layer.

This means that there is no need to create virtual devices unless you really want to. Writing back to the original device is usually a good choice because that means that two scripts that are configured to listen specifically to a device called "Fancy Keyboard" can both work without having to agree on the name of a virtual device.

Most mapper scripts that were not configured otherwise should probably just use a policy of "listen to all keyboard-type devices", so it won't matter whether those keyboard devices were "real" or "virtual".

Unresolved question:

  • Currently, the system is that a script would have to explicitly tell you to listen to each individual device it wants to, with some synchronization built into the protocol to delay events until a script has decided whether it wants to listen or not. This is the most versatile way, but also creates the largest overhead when a new device is added. I wonder if instead the protocol should make scripts declare filters "I want to listen to all devices that have these properties"?

Initializing listening and unlistening
Remember how the compositor was supposed to keep track of the state at each layer? It comes in handy when some mapper listens or unlistens a device.

Whenever a mapper listens to a new device, the compositor will send the mapper the current state of all input events (probably assuming that all keys that do not get sent are released, for efficiency). The mapper then sends output events telling the compositor what the state should have been if the mapper had been here earlier. For example, if the mapper maps K→Q and the current state says that K is pressed, then the mapper sends an input event "press Q" and tells the compositor it is done initializing.

The compositor then checks for the difference between the current state and what the mapper thinks the state should've been, and then inserts events "release K, press Q" into the event chain to bridge the difference.

Whenever a device gets unlistened, the compositor should know what its state was on the previous layer, and its state on the current layer. It then bridges the difference, for example by inserting "release Q, press K" in the event chain.

The ability for the compositor to handle these state transitions is important to maintain the integrity of the event chain in case a script crashes.

The first layer
I have intentionally not clarified where the devices and events on the first layer come from. That's something that needs to be figured out later.

Maybe they come directly from event devices. Maybe they've the output of libinput. Maybe libinput's internal structure gets split into layers so that libinput internally uses the described event flow as well, although with a more effficent internal API that allows it to skip over the "communicate with the compositor" part.

State synchronisation

Sometimes the enviroment of a layer changes. Changes in environment may include:

  • A new input device gets created or destroyed;
  • The properties of an input device change;
  • The active window changes.

In these cases, it is possible that the state of the input devices (= the last value of all persistent events) needs to change even if users didn't actually input any events.

For example, suppose there are two open windows Firefox and Dolphin. There is a script that maps the key K to the key Q on Dolphin but not on Firefox. Suppose that Firefox window is active and the user is currently holding key K down. Then the user clicks the Dolphin window.

This means that by the time the Dolphin window gets activated, it should look like the key "K" has been pressed since before this window got the focus. It should not first observe "key K is down, release K, press Q". Nor should the Firefox window receive the events "release K, press Q" before it loses focus.

For such cases, we need to change the state atomically. To facilitate this, there are two Wayland events and two Wayland requests available:

  • (event) begin_update_state;
  • (event) finish_update_state;
  • (request) begin_sync_state;
  • (request) finish_sync_state;

Whenever the environment changes, say the active window has changed, the compositor should first send an event begin_update_state. Then it should send Wayland events informing the mapper of how the state changed (in this case, the active window changed). In case any previous layers want to update their state atomically in this window as well, the mapper may subsequently receive zero or more input events representing how the state of the devices has changed due to the previous layers. Finally, the compositor will send finish_update_state to mark that all events related to this change in environment have been sent.

The mapper must reply to begin_update_state with the request begin_sync_state. This helps the compositor to deduce the intent of the input events it receives: all input events sent by the mapper before begin_sync_state were actual input events that were meant to happen before it became aware that the environment changed (in this example, input events that Firefox was meant to receive). All input events that the mapper sends after begin_sync_state but before finish_sync_state are meant to represent events that are meant to update the state of the devices as result of the change in the environment. In this case, the mapper would send "release K, press Q", and the state of the device would be updated atomically between Firefox losing focus and Dolphin gaining focus. All input events sent after finish_sync_state reflect actual input events that are meant to be sent to Dolphin.

There are some difference between "normal event sending mode" and the "sync mode" which is used between begin_sync_state and finish_sync_state. In sync mode, the events are only meant to update the state of the event device, not to process actual input events.

  • The compositor must discard any transient input events like mouse movements that it receives while in sync mode, because transient events do not affect the state;
  • If the key K is not pressed, then sending "press K, release K" will do nothing because the state transition is "K not pressed → K not pressed". The compositor should discard both events and not relay them to later layers (potentially sending such events may even be deemed a protocol violation; requires more thought);
  • Because the difference in state is the only thing that matters, the ordering between unrelated events does not matter either. If a program were to write "press A, press B" while in sync mode, the compositor is free to reorder that into "press B, press A" when relaying those events to the next layer. This is unlike normal mode, where the compositor must preserve the order of the events;
  • The compositor may (must?) wait until the previous layer has declared finish_sync_state before it sends begin_update_state to the next layers, whereas normal events should be relayed as soon as possible.

@KarsMulder
Copy link
Author

Reserved.

@KarsMulder
Copy link
Author

KarsMulder commented Nov 29, 2023

Here is the protocol in the standard XML format that Wayland protocols use. Compatible with wayland-scanner:

<?xml version="1.0" encoding="UTF-8"?>
<protocol name="input_mapper">
  <interface name="im_input_mapper" version="1">
    <description summary="core global object">
      The global object that is defined for all input-mapping supporting
      compostitors. Its task is handing out layers to input mappers.
    </description>

    <request name="create_layer">
      <description summary="creates a layer">
        A new mapper application first needs to define a layer where it wants to
        act. It will be able to see all devices created at lower layers and be
        able to receive events from them (if requested). All events generated at
        this layer will be visible to applications listening at higher layers.

        Only one application can open a layer at a particular depth at a time.
        TODO: Figure out how to report having failed to create a layer, or figure
        out another way of handling multiple applications opening a layer at the
        same depth.
      </description>
      <arg name="layer" type="new_id" interface="im_layer" summary="the newly allocated layer"/>
      <arg name="depth" type="uint" summary="depth of the new layer"/>
    </request>
  </interface>

  <interface name="im_layer" version="1">
    <description summary="event processing layer">
      A single layer in the event processing chain. Events generated at lower
      layers are visible here, events generated here are visible at higher levels.
      
      Not only events can be mapped, input devices can be created and have their
      properties changed as well. When the properties of an input device get
      changed, the changes are only visible in higher layers.
    </description>

    <!-- Functionality for keeping track of the input devices. -->
    <event name="new_device">
      <description summary="a new input device appeared">
        Tells you that a new input device has appeared. After receiving this event,
        a stream of zero or more device_property events will follow, followed by
        exactly one device_sync event. The client must respond to the device_sync
        event.

        TODO: I think those devices should have their own id, i.e. the type of the
        device should be new_id&lt;im_device&gt; instead of uint. That makes the protocol
        more complicated so I need more time to integrate that.
      </description>
      <arg name="device_id" type="uint" summary="the input device that just appeared" />
    </event>

    <event name="device_remove">
      <description summary="an input device is no longer available">
        Tells you that an input device is no longer available, for example
        because its cable was unplugged.
      </description>
      <arg name="device_id" type="uint" summary="the input device got removed" />
    </event>

    <event name="device_property">
      <description summary="a property of a device appeared or changed">
        Tells you either a property of a device you didn't know, or tells you that
        a property of a device has been changed.

        A stream of one or more device_property events will always be followed up
        by a device_sync event. The changed properties become definitive after
        device_sync has been sent.
      </description>
      <arg name="device_id" type="uint" summary="the input device that just appeared" />
      <arg name="property" type="string" summary="the name of the property" />
      <arg name="value" type="string" summary="the value of the property" />
    </event>

    <!-- TODO: add requests to modify the properties of devices. -->
    <!-- TODO: the current draft "devices on later layers are assumed to have the
               same events as those on earlier layers unless changed" is somewhat
               different from how events work. On the other hand, you get informed
               of device properties whether you want or not, so maybe they deserve
               to be treated differently. This needs more thought. -->

    <!-- Functionality for dealing with events -->
    <request name="listen_device">
      <description summary="listen to a new device">
        Does two things at once: tells the compositor to send all events on this
        device on this layer to you, and removes those events from the event
        processing stream.

        After you listen, you will receive a begin_update_state event and thereafter
        a stream of input events representing the current state of this device, e.g.
        which keys are pressed, where the absolute axes currently are, etc. Those
        events will be followed up by a finish_update_state event.

        You should reply with begin_sync_state, then map those events to what you
        believe the current state of this device should've been if your mapper had
        been active since earlier. Then you must respond with an finish_sync_state.

        Input events you send with send_input_event after you send begin_sync_state
        but before you send finish_sync_state, will not be written to the input
        stream. Instead, the compositor will look at the difference between the
        current state and what it should've been according to the events you sent.
        The compositor then sends events to later layers to resolve the difference.
        
        For example, suppose that key A was down before you listened to a device. The
        previous events get sent in and you send out one event: B down, followed up by
        acknowledge_device_init_complete. The compositor then sees that the state changed
        from (A down, B up) to (A up, B down) and will generate two events to bridge the
        difference: A up, B down. These two events get sent to later layers as if they
        were common input events.

        All input events you write after finish_sync_state will be sent to later layers
        following the usual input stream process.

        After that protocol has been completed, all input events from this device
        that hit this layer will be sent to you, and they will be removed from the
        input event stream unless you write identical events back to the input event
        stream with send_input_event.

        TODO: should "open" and "grab" be separate actions? You'd think so, but since
              Wayland is asynchronous, it is hard to properly order events when they're
              not grabbed...
        TODO: what should happen with listen_device requests that happen in the middle
              of an begin_update_state/finish_update_state window? Do the device events
              get added to the same window, or to a new one?
      </description>
      <arg name="device_id" type="uint" summary="the device you want to listen to" />
    </request>

    <request name="unlisten_device">
      <description summary="undoes listen_device">
        Ungrab a device and no longer receive events from it.

        After unlistening to a device, the compositor shall look at the difference
        between the state of the device before and after this layer. It shall then
        generate events to make the state after this layer match up with what the
        state before this layer was. E.g. if on this layer the state from "key K
        released" changed to "key K pressed", then the compositor shall send a
        "key K up" event to the next layer upon receiving an unlisten request.

        TODO: requires more thought on semantics and state synchronisation. What
              should happen if an unlisten request was sent in the middle of some
              other request, such as synchronizing a changed active surface?
      </description>
      <arg name="device_id" type="uint" summary="the device you want unlisten" />
    </request>

    <event name="input_event">
      <description summary="input event arrived">
        An input event has arrived from one of the lower layer on one of the devices
        that you're listening to. The input event has been removed from the event
        processing pool, but can be added back to it by sending a send_input_event request.

        TODO: Not much thought has been put yet into what information this event exactly
              contains. Potentially we should split it into multiple kinds of events like
              "keyboard_event", "pointer_event", "touchpad_event"...

        TODO: I kinda want to associate unicode codepoints (or even grapheme clusters!)
              to events, but I'm not sure if that's really feasible taking integration
              with existing protocols into account...

        TODO: should we attach a "source"/"channel" to each event to avoid multiple mapper
              scripts interfering with each other. E.g. if two scripts press K and then one
              releases K, is it acceptable that the second K is released too?
      </description>
      <arg name="device_id" type="uint" summary="the device that generated this event"/>
      <arg name="type" type="uint" summary="the evdev type of this event"/>
      <arg name="code" type="uint" summary="the evdev code of this event"/>
      <arg name="value" type="int" summary="the evdev value of this event"/>
    </event>

    <request name="send_input_event">
      <description summary="input event arrived">
        Adds an input event to the event stream at the first layer that comes after
        the current layer. Can be used to spontaneously generate events, or to map
        or pass on events you received from devices you're listening to.
      </description>
      <arg name="device_id" type="uint" summary="the device that generated this event"/>
      <arg name="type" type="uint" summary="the evdev type of this event"/>
      <arg name="code" type="uint" summary="the evdev code of this event"/>
      <arg name="value" type="int" summary="the evdev value of this event"/>
    </request>

    <!-- Functionality for changing the active map based on which application has focus. -->
    <enum name="device_type">
      <description summary="input device type">
        The type of input device: keyboard, pointer, or touch.
      </description>
      <entry name="keyboard" value="0" summary="keyboard device" />
      <entry name="pointer" value="1" summary="pointer device" />
      <entry name="touch" value="2" summary="touch device" />
    </enum>

    <event name="enter_surface">
      <description summary="a surface received focus">
        A surface is about to receive focus from either the keyboard or pointer.
        All layers are informed about this before wl_keyboard::enter or wl_pointer::enter
        is sent to the target surface. A begin_update_state event has been sent before this
        event, and a finish_update_state event will be sent after it.

        All input events sent between sending the corresponding begin_sync_state
        and finish_sync_state requests will alter the state of the devices, but will
        not generate any kind of input event sent to any application except later layers.
        Input events you write after sending begin_sync_state and before
        finish_sync_state will be reported to later layers between their
        begin_update_state and finish_update_state events.

        TODO: should we have an event for leaving surfaces too?
        TODO: maybe we should only send this events to layers that have asked to
              deal with these events instead of bothering all layers with them?
        TODO: do we want a way to send events to surfaces that do not have focus?
      </description>
      <arg name="surface" type="uint" summary="something to identify the surface; not sure what type this should be"/>
      <arg name="focus_type" type="uint" enum="device_type" summary="from which device focus was received"/>
    </event>

    <event name="begin_update_state">
      <description summary="the environment is about to change">
        The state of the environment and/or event devices is about to be atomically
        changed. After you receive a begin_update_state event, zero or more of the
        following events will follow:

        * new_device
        * device_remove
        * device_property
        * enter_surface

        After those event, zero or more persistent input events may follow, which
        should tell you the difference between the current state as you knew it so far,
        and what the state should be after the changes are committed.

        After receiving a begin_update_state event, you must respond with a begin_sync_state
        request. After the begin_sync_state request, you are free to send out other requests
        to alter device properties or new persistent events.

        You can also write persistent input events between the corresponding begin_sync_state
        and finish_sync_state events. Those events shall be interpreted as updating some kind
        of state of the device, rather than actual input events.
        
        Depending on the nature of what triggered begin_update_state, the input events you
        write after writing begin_sync_state may either:

        1. Be sent to future layers between their begin_update_state and finish_update_state
           events.
        2. Decide what the state after this layer should be. That state may be compared
           to some other state, and then the events that form the difference between those
           states are sent to later layers. (TODO: This sentence is incomprehensible.)
      </description>
      <arg name="serial" type="uint" summary="used for the corresponding begin_sync_state request"/>
    </event>

    <event name="finish_update_state">
      <description summary="the environment has changed">
        Sent after all state changes have been sent in after a begin_update_state event.
        The receiver must respond with a finish_sync_state request.
      </description>
      <arg name="serial" type="uint" summary="same as in begin_update_state"/>
    </event>

    <request name="begin_sync_state">
      <description summary="change current state">
        Tells the compositor that all input events sent by this layer between this
        request and the later corresponding finish_sync_state request should
        are interpreted as updates to the current state of the input devices, rather
        than actual input events.

        It is also possible to change device properties between begin_sync_state
        and finish_sync_state. TODO: Implement that
      </description>
      <arg name="serial" type="uint" summary="received from begin_update_state"/>
    </request>

    <request name="finish_sync_state">
      <description summary="current state synchronized">
        Send this in response to complete_update_state. After this request, the
        input events you write will be treated as normal input events again.
      </description>
      <arg name="serial" type="uint" summary="received from complete_update_state"/>
    </request>
  </interface>
</protocol>

@KarsMulder
Copy link
Author

KarsMulder commented Nov 29, 2023

Here is a sample C++ program that uses the above protocol to transpose the A and B keys with each other. Or that's what it would do if only the compositor supported it.

#include "wayland.h"
#include <map>
#include <string>
#include "string.h"
#include "libevdev-1.0/libevdev/libevdev.h"
// protocol.h is to be generated by wayland-scanner
#include "protocol.h"

// I was going to write this in C. Then I figured that C doesn't have an std::map type built in
// and I could not be bothered writing the boilerplate for a hashmap myself. So this code is
// written in C++. Anyway, any function or structure starting with "im_" is part of the new
// Wayland protocol and can be generated by wayland-scanner

// Represents what we know about the input devices available to us.
struct device {
    std::map<std::string, std::string> properties;
    bool is_listening;
};

// The global state of our program. Keeps track of all input devices that exist.
struct my_state {
    std::map<uint32_t, device> devices;
};

int main(int argc, char *argv[]) {
    // Obtain a reference to the im_input_mapper global with the usual boilerplate.
    struct wl_display* display = wl_display_connect(NULL);
    if (!display) {
        fprintf(stderr, "Wayland compositor not found!");
        exit(1);
    };
    
    struct wl_registry* registry = wl_display_get_registry(display);
    struct wl_registry_listener registry_listener = {
        .global = register_global,
        .global_remove = unregister_global,
    };
    struct im_input_mapper* input_mapper = NULL;
    
    wl_registry_add_listener(registry, &registry_listener, &input_mapper);
    wl_display_roundtrip(display);
    
    if (!input_mapper) {
        fprintf(stderr, "This compositor does not support event mapping.");
        exit(1);
    }

    // End of the boilerplate. Here the real part of the new protocol starts.
    struct my_state state = {
        .devices = std::map<uint32_t, device>(),
    };
    
    struct im_layer_listener layer_listener = {
        // This function gets called by Wayland whenever a new input event is ready for one of the devices
        // we're listening to.
        .input_event = handle_input_event,
        
        // This function gets called whenever a new device becomes visible to us.
        // After the new_device event, a series of zero or more device_property events will trigger.
        .new_device = handle_new_device,
        
        // Like new_device, but tells you that a device no longer exists.
        .device_remove = handle_device_remove,
        
        // Tells us either the property of a new device, or that a property of an existing device changed.
        .device_property = handle_device_property,
        
        // A ping-pong request. Requests that the client acknowledges that it has seen the new device and
        // all its properties that were sent by the server, and already acted on them if it wanted to.
        //
        // The begin_update_state/finish_update_state events are always sent before one or more of the
        // following events: new_device, device_property, delete_device, enter_surface.
        //
        // Pretty much like EV_SYN, but for devices. Unlike EV_SYN, the client is required to respond to this.
        //
        // Whenever a begin_update_state/finish_update_state event is sent by the server, normal event
        // processing halts until the client has with finish_sync_state. In case that update happened
        // because a new input device appeared, halting the processing of events avoids race cases
        // wherein a new script creates a new device and writes events before later layers even got to
        // declare that they want to listen to the new device.
        .begin_update_state = handle_begin_update_state,
        .finish_update_state = handle_finish_update_state,

        // Tells us that a new surface has received focus.
        .enter_surface = handle_enter_surface,
    };
    
    // The layer inserts us at some point in the event mapping chain.
    // We will see devices and events generated by layers with lower priority, and our events will be
    // visible to any mapper with higher priority. I choose priority 0x153f0eda arbitrarily.
    struct im_layer* layer = im_input_mapper_create_layer(input_mapper, 0x153f0eda);
    im_layer_add_listener(layer, &layer_listener, &state);
    
    while (wl_display_dispatch(display) != -1) {
        // The infinite event loop.
        //
        // On the first iteration:
        // After having created a layer, we will receive begin_update_state, then a bunch of new_device and
        // device_property events, followed up by finish_update_state. In handle_device_property(), we will
        // declare that we listen to all devices of the keyboard type.
        //
        // On the second iteration:
        // After having declared to which devices we want to listen and responding with finish_sync_state,
        // we will receive another begin_update_state, followed by the the current state of all devices we
        // want to listen to (i.e. all keys that are currently pressed), followed by finish_update_state.
        // We will reply with begin_sync_state, map the events as we always do (i.e. switch KEY_A and KEY_B)
        // and then reply finish_sync_state.
        //
        // On subsequent iterations:
        // We will most likely just receive input_event events which are handled by handle_input_event().
        // Occassionally a new device may appear or an existing device may disappear, which shall be handled
        // just as in the first iteration.
    }
}

// Boilerplate functions for obtaining a reference to the im_input_mapper global through the Wayland protocol.
void register_global(void* data, struct wl_registry* registry, uint32_t name, const char* interface, uint32_t version) {
    // This function has one, and only one, goal: to find the global called "im_input_mapper".
    if (strcmp(interface, im_input_mapper_interface.name) == 0) {
        struct im_input_mapper** mapper = (struct im_input_mapper**) data;
        *mapper = (struct im_input_mapper*) wl_registry_bind(registry, name, &im_input_mapper_interface, 1);
    };
}
void unregister_global(void* data, struct wl_registry* registry, uint32_t name) {
    // I can't be bothered to properly implement this part of the Wayland protocol for this sample script.
}

// This function is pretty simple: it gets called on all events on devices that we're listening to.
//
// Currently I'm using "device id + evdev event" as the format for an event. There probably need to be some
// more details added to it. It might also be a good idea to have different functions for different event types,
// like handle_keyboard_event, handle_pointer_event, handle_touchpad_events. Anyway, let's not get lost in the details.
void handle_input_event(
    void* data, struct im_layer* layer,
    uint32_t device_id, uint32_t type, uint32_t code, int32_t value
) {
    // One thing to keep in mind is that Wayland is asynchronous.
    //
    // There is no real way to distinguish between events that got generated as response to handle_event()
    // and events that got generated sporadically, e.g. by a timer. Unless you go out of your way to give
    // each event an ID and include "generated_from=..." as part of each event.
    //
    // For this simple case it doesn't matter, but it affects protocol design for some more complicated
    // situations that require precise ordering.
    if (type == EV_KEY && code == KEY_A) {
        im_layer_send_input_event(layer, device_id, type, KEY_B, value);
    } else if (type == EV_KEY && code == KEY_B) {
        im_layer_send_input_event(layer, device_id, type, KEY_A, value);
    } else {
        im_layer_send_input_event(layer, device_id, type, code, value);
    }
}

// Tells us that a new device with a new device_id exists.
//
// I wonder if it is a good idea to try to reuse the existing `new_id` type of the Wayland protocol for this,
// or not. Right now, I'm treating device_id's as something separate from the usual id system.
void handle_new_device(
    void* data, struct im_layer* layer,
    uint32_t device_id
) {
    struct my_state* state = (struct my_state*) data;
    struct device new_device = {
        .properties = std::map<std::string, std::string>(),
        .is_listening = false,
    };
    
    state->devices[device_id] = new_device;
}

// Tells us that a device_id no longer exists.
void handle_device_remove(
    void* data, struct im_layer* layer,
    uint32_t device_id
) {
    struct my_state* state = (struct my_state*) data;
    state->devices.erase(device_id);
}

// Tell the compositor that we want to map the events from devices if and only if they are keyboard devices.
void handle_device_property(
    void* data, struct im_layer* layer,
    uint32_t device_id, const char* property, const char* value
) {
    // Update our local cache of the device properties.
    struct my_state* state = (struct my_state*) data;
    std::string property_str(property);
    std::string value_str(value);
    
    struct device& dev = state->devices[device_id];
    dev.properties[property_str] = value_str;
    
    // It would be possible to change the device properties here, for example we may want to replace its
    // XKB keymap with another keymap that has more keys. But for now let's assume that we do not want
    // to change the device and the compositor assumes that the device properties remain unchanged unless
    // we actually invoke a function like im_update_device_property().
    
    
    // Listen if and only if it is a keyboard. Listen is pretty much open+grab in one: it makes events
    // reaching our layer get sent to this program, and also removes unless we write them back.
    //
    // I'm not sure what the proper way to encode property/value pairs would be, so for simplicity I'm going
    // to pretend they're all strings. Probably a bad choice, but let's not get lost in the details now.
    if (property_str != "type") return;
    bool is_keyboard = (value == "keyboard");

    if (is_keyboard && !dev.is_listening) {
        im_layer_listen_device(layer, device_id);
        dev.is_listening = true;
    } else if (!is_keyboard && dev.is_listening) {
        im_layer_unlisten_device(layer, device_id);
        dev.is_listening = false;
    }
}

// We receive this event after all Wayland events about new, changed, or deleted devices/device properties
// have been sent to us. By responding to the sync event, we acknowledge that we have seen all changes and
// have acted upon them.
void handle_begin_update_state(
    void* data, struct im_layer* layer,
    uint32_t serial
) {
    im_layer_begin_sync_state(layer, serial);
}

void handle_finish_update_state(
    void* data, struct im_layer* layer,
    uint32_t serial
) {
    im_layer_finish_sync_state(layer, serial);
}

// Tells us that a new surface has received the focus.
void handle_enter_surface(
    void* data, struct im_layer* layer,
    uint32_t surface, uint32_t focus_type
) {
    // Do nothing.
}

@KarsMulder
Copy link
Author

Reserved.

@KarsMulder
Copy link
Author

KarsMulder commented Nov 29, 2023

Changelog

  • Added events begin_update_state / finish_update_state and corresponding begin_sync_state / finish_sync_state requests, as a more uniform replacement of a bunch of other single-purpose events and requests;

@KarsMulder KarsMulder changed the title Wayland communication protocol draft, v0.1 Wayland communication protocol draft, v0.2 Dec 2, 2023
@kermitfrog
Copy link
Owner

I didn't read all the code parts yet. So far it sounds like a good start :)

Input devices have properties, such as:
..
Associated XKB keymap

That would open interesting possibilities, but I worry if it will be difficult for existing UI-toolkits to adapt to this. Maybe simply sending unicode data would be easier.

Whenever a mapper listens to a new device, the compositor will send the mapper the current state of all input events (probably assuming that all keys that do not get sent are released, for efficiency) [..]
The compositor then checks for the difference between the current state and what the mapper thinks the state should've been, and then inserts events "release K, press Q" into the event chain to bridgre the difference.

This would solve problems. I hope it can be efficiently implemented.
Would it be enough to only track which states are active (as in key is pressed)? That should be more efficient than keeping state for every event.

Switching between active windows
Whenever the focus changes to a new surface, all layers receive the events begin_update_state, enter_surface and finish_update_state

At first I wondered why you bother with layers for this. A protocol that mappers (as well as every other interested application) can subscribe to and then change the configuration accordingly seemed sufficient.

But it might make sense if you send information about states (e.g. keys) that are currently triggered, so the mapper knows which keys to release/press.

There is also the question of "What is a change to surface?" or rather "What triggers may a mapper want to react to?".
Inputmangler can (with help from kwin and dbus) react to changes in the window class as well as the window title. For example this gives me the possibility to have different mappings depending on the website I'm currently visiting.

So I request that changes to the window title to be part of the protocol, too. There might be other interesting stuff to react to, that is best obtained from wayland.
A dbus slot for trigger_mapper_update(id) for the users custom changes might be a good idea.

@KarsMulder
Copy link
Author

That would open interesting possibilities, but I worry if it will be difficult for existing UI-toolkits to adapt to this. Maybe simply sending unicode data would be easier.

I agree that sending Unicode would be better, but the issue is that the current Wayland input protocol (the one that applications use to talk with the compositor) simply expect an XKB map plus scancodes. I think that that was a mistake during the design of the protocol, but it's probably hard to fix now.

The workaround would be for the input_mapper module to translate scancodes into Unicode characters, feed those to the mapper scripts, and at the end translate those Unicode characters back to scancodes again. This does have some drawbacks though.

First, it may not always be possible to translate from Unicode back to characters. For example, if Unicode gets sent, then "lowercase a" and "uppercase A" become two different characters. What is the compositor supposed to do if an application wants to send "lowercase a" while the user is holding the shift key? None of the keys on the standard XKB layout will correspond to "lowercase a" while shift is held. Should the compositor send some different grapheme (uppercase A) than requested? Should it briefly release and repress shift? Should it modify the XKB layout to add an unconditional "lowercase a" button to it?

Second, it does not necessarily make writing scripts easier. For example, if a mapper now wants to press the shift key, it must not only map to the shift event, but it is now also responsible for turning all "lowercase a" graphemes into "uppercase A" graphemes. Similarly, if a script wants to map the key K to Q, it must now apply two maps: "k→q" and "K→Q", otherwise either the lowercase variant or the uppercase variant would get skipped over.

(Also, the uppercase-lowercase character pairs depend on your locale. Have fun!)

More fundamentally, it adds another layer of abstraction, increasing the difference between how mappers handle input and how applications handle input. Mapper scripts themselves can add this abstraction themselves by using a library, but it is harder to remove that layer of abstraction in case a mapper script really needs fine-grained control, for example it might start fighting against the compositor because this script really wants to use a specific scancode but the compositor decided on a wholly different scancode↔Unicode map.

The current solution
That said, it is not like the approach of letting mappers update the XKB layout is the ideal solution either. Properly updating the layout is a task that can only be reasonably accomplished by using a library, which is kinda alike to how you're probably not going to map evdev events without relying on libevdev.

But more fundamentally, the current Wayland setup is mostly designed to have only a single XKB layout active at once. If different keyboards have different layouts, the compositor is allowed to switch the active map based on which keyboard was last typed on, but that has its obvious drawbacks as well. Libinput doesn't seem to actually do that, so I doubt many compositors go that far either. I think this is another big mistake in the design of the Wayland input protocol.

But either way, if we go with the "the XKB layout is a property of the input device", then which layout should the compositor use when multiple devices end up with different layouts?

That question could be answered by "pick one arbitrarily" or "pick the biggest one", and then scripts could deal with it by always making sure that they modify all keyboard layouts in the same way, but that is still a non-ideal solution.

I could change the protocol to make the active XKB keymap a global property instead of a device-dependent property, but then we'd be further entrenching the current status quo and would actually say that at a Wayland protocol level, there must be a single keymap that applies to all input devices, rather than merely "to keep the implementation simple, the compositors have decided to assume that there is only a single keymap for all input devices."

Tl;dr: the Wayland keyboard input protocol is causing trouble no matter from which angle I approach this issue.

Would it be enough to only track which states are active (as in key is pressed)? That should be more efficient than keeping state for every event.

Yes, that should be possible.

(On a side note, I've read a bit of the source code of libevdev and libxkbcommon, and both of them decide to keep track of the states in a dense array, rather than a sparse one. They can get away with that because there are only a thousand-or-so keys, so at one bit per key that takes like 128 bytes of memory. If we want to keep track of all 2^21 unicode characters, it will have to be done sparsely. But that's an implementation detail.)

So I request that changes to the window title to be part of the protocol, too. There might be other interesting stuff to react to, that is best obtained from wayland. A dbus slot for trigger_mapper_update(id) for the users custom changes might be a good idea.

The window title seems to be part of the XDG shell extension protocol rather than the core Wayland protocol.

I don't suppose we'll see a compositor which will implement input mapping but not XDG shell, but who knows? Maybe some future mobile operating system will use Wayland without XDG Shell. It may be prudent to implement it in such a way that "window title" is an optional addition to this protocol rather than a core feature of it.

Relatedly, I think that mappers should only receive enter_surface events if they've actually told the mapper they care about them*, to reduce latency in case zero of the mappers care about whatever the active window is. I should add some way to communicate "I want to be informed about these environment changes" to the compositor, where "change in window title" happens to be one of the changes you can subscribe to, and "this particular dbus event" might be another of them.

*That means that applications that don't care about enter_surface events will not receive the enter_surface event, but that they will still receive the begin_update_state and finish_update_state events if some other layer before them caused the state to change in response to a change in the active surface.

@KarsMulder
Copy link
Author

Before I go on to write my post below, I'd like to clear up any confusion on why I decided to include active window changes in the protocol itself rather than expecting scripts to rely on some standard dbus protocol, here are my reasons:

So why is changing the active window part of the protocol anyway?

You could wonder why mapper scripts are not supposed to just listen to that by means of some dbus event, which would also be viable for most purposes, but integrating it in the protocol has two advantages:

  • It ensures that the change of the input mappers is atomic with window changes;
  • It ensures a deterministic order in which the mapper scripts adjust their mappings.

Atomic window changes
To explain the first one. Suppose the user is running some kind of program like Krita, and has a script running that maps a certain foot button to leftctrl, but only when Krita is the active window. For other windows the button does nothing (or whatever it does is irrelevant.)

The user is currently browsing the web and Firefox has the focus. It then decides it wants to ctrl+click some button in Krita, which transferst the focus to Krita and clicks a button in it with a single action. The user would expect that if he has the foot button pressed when he clicks, Krita will interpret it as a ctrl+click.

When active window changes are part of the protocol, this can be taken care of. These are rhoughly the steps that happen:

  • The compositor receives a click signal on the Krita window and sees that the focus is about to change;
  • A mapper script has subscribed to "active window changed" events. This mapper script will now receive the following events: begin_update_state, enter_surface (Krita), finish_update_state, input_event (click)
  • The mapper script will output the following requests: begin_sync_state, send_input (leftctrl), finish_sync_state, send_input (click)
  • The compositor will inform Krita that it has received the focus with a wl_keyboard::enter event after the last mapper script sent finish_sync_state. The wl_keyboard::enter.keys property (and the subsequent wl_keyboard::modifiers event) tell Krita that the leftctrl key has been down already since before Krita received focus.
  • When the subsequent click event is sent, Krita will know that it is to be interpreted as a ctrl+click.

Now, if the mapper script were to update on the basis of a dbus event, it would be likely that Krita already received focus and the click event before dbus managed to reach your mapper and got the ctrl key pressed. This would mean that the first interaction with any program might be interpreted under the keymap that was active for the previous program.

When changing the active window is part of the protocol itself however, the compositor will know when all active mapper scripts have updated their state for the program that is about to receive focus, and ensures that the active maps update seamlessly when switching between windows.

Deterministic order
(Okay, this one is sufficiently obscure that I haven't actually thought of a practical example for it. Here is a theoretical example because I think software should always be correct and not just probably correct.)

Suppose that you have Mapper A which maps a foot button to leftshift for Firefox and to leftctrl for Krita. You also have Mapper B at a later layer which maps shift+K to "X" on Firefox and ctrl+K to "Y" on Krita, also blocking the respective modifier keys. When the focus shifts from Firefox to Krita while the foot button and the K key are held, the desired effect is "release X, press Y", which should of course be processed atomically such that Firefox never saw the X key get released, and Krita thinks Y had been pressed since before it got focus.

If both mappers update independently due to dbus, it would be possible for Mapper B to update first, resulting in "shift+K" getting passed on to Krita, before Mapper A also updates to make that "ctrl+K" which Mapper A turns into "Y" again. Krita would get to see the following barrage of events: key X is held since before receiving focus. Release X. Press shift. Press K. Release K. Release shift. Press Y.

When the compositor is in charge of notifying mapper scripts that the active window has changed, it can ensure that it starts with the lowest-layer mapper, and only inform later-layer mappers about changes when the lower layers have replied finish_sync_state and thus sent all events they want to sent due to state changes.

@KarsMulder
Copy link
Author

Anyway, no update to the protocol yet. I'm just writing what I'm currently thinking about.

There is one problem that I'm currently having with both the protocol regarding the device properties, and the protocol regarding updating the state based on dbus notifications: the Wayland data types do not include any kind of dynamic type or sum type. With sum type, I mean something like std::variant in C++ or enum in Rust.

Basically, there are a variety of types that a device property might have:

  • String, which is useful e.g. for "Manufacturer: Sony Corp.";
  • Int, which is useful e.g. for "ABS_X_MAX_VALUE: 127";
  • File descriptor, which is the way that XKB keyboard layouts are currently passed on.

For compactness in the protocol definition, I might want to have just a single event and request, e.g. device_property and set_device_property, but then I wouldn't be able to write in standard Wayland format what the type of that property would be. I could do something like this:

<event name="device_property">
    <description summary="a property of a device has changed">
    Informs you that a certain property of a certain device now has a value.
    You might receive this event because you haven't seen that device before,
    because that device's property has changed since you last saw it, or because
    a new property has been added to that device.

    The new value could be an int, a string, or a file descriptor. The "type"
    argument tells you what the type is. The "value" argument contains the
    bytes of the new value, which you will have to interpret according to the
    type argument.
    </description>
    <arg name="device_id" type="uint" summary="the device whose property changed"/>
    <arg name="property" type="string" summary="the name of the property that changed"/>
    <arg name="type" type="uint" enum="datatype" summary="the type (int, string, fd) of the property"/>
    <arg name="value" type="array" summary="the new value of the property"/>
</event>

... but that is pretty stupidiously roundabout. If we receive such an event, then libwayland would parse the bytes received from the Wayland socket as semantic data, and that semantic data once again contains plain bytes which would then have to be parsed as semantic data again by the application. It feels like a real abuse of the Wayland protocol.

So the other workaround would be to have several events and requests, one for each datatype: device_property_int, set_device_property_int, device_property_str, set_device_property_str, device_property_fd, set_device_property_fd. This would increase the amount of callbacks that the user needs to define, but does at least not try to put another protocol over the Wayland protocol.

Also, that set of functions makes it possible to use different types for device property values, but not yet for keys. So far I've been asuming that device property names should be strings, but are we sure that a property name like "ABS_X_MAX_VALUE" is best encoded as string? I'd prefer for that to be a tuple ("max_axis", ABS_X). Now if I want to be able to use tuples as property names, things get even more complicated...

The same problem also shows up for custom update triggers. In case the state update protocol starts because the active window changed, you'd want to send at least a string to the application containing the name of the window that is now active, and possible some more information. But what can we send for custom triggers like a "dbus signal received"? Do those get to send no information whatsoever, or would that also require some way to send a string or int or two strings or whatever along with the custom signals?

@KarsMulder
Copy link
Author

KarsMulder commented Dec 12, 2023

Yet Another Alternative Approach

The difference between how events are being handled and how state updates are getting handled feels a bit inelegant. I feel like there is some elegant unified solution out there, and I am somehow missing it. Or maybe there's not. Here's what I'm thinking about right now.

Suppose state changes become events
Suppose that all state updates (e.g. new device appeared, device property changed, active window changed) were now turned into persistent events. A new keyboard showing up could result in something like the following events:

// Compositor -> Mapper

Event:
    Type: EV_CONFIG_CHANGE
    Code: EV_NEW_DEVICE
    Value: 700    // 700 is the ID assigned to the new device
    Device: 0     // 0 is the ID of some global virtual device that always exists
                  // and tells you about other devices showing up and disappearing.
    Sequence: 2   // Tells you the how-manyth event this is that you have received.
                  // Wraps back to zero on overflow. Increments by two each time.

Event:
    Type: EV_PROPERTY
    Code: EV_DEVICE_TYPE
    Value: KEYBOARD
    Device: 700   // This event originates from the newly created device
    Sequence: 4

Event:
    Type: EV_PROPERTY
    Code: EV_KEYMAP
    Value: [some fd]
    Device: 700
    Sequence: 6

Event:
    Type: EV_PROPERTY
    Code: EV_MANUFACTURER
    Value: "Dell Computer Corp."
    Device: 700
    Sequence: 8

// Etc.

When a mapper script is already running, it will receive the above information like it receives all other events. When a mapper script just launches, it is informed about the current state of all persistent events, and thus naturally receives the current properties of all devices.

But how do we then ensure proper synchronisation on state changes?
The current begin_sync_state and finish_sync_state mark points at which we're sure that the compositor and the mapper have the same opinion of what's going on. The new approach to that is by adding sequence numbers to all events that the compositor sends to the mapper.

Suppose that the compositor sends these event to the mapper:

// Compositor -> Mapper

Event:
    Type: EV_KEY
    Code: KEY_A
    Value: 1
    Device: 700
    Sequence: 890

Event:
    Type: EV_KEY
    Code: KEY_K
    Value: 1
    Device: 700
    Sequence: 892

In case the mapper script wants to pass these events on as-is, it should write the same events to its Wayland socket with the same sequence numbers. If it wants to replace KEY_K it with KEY_Q, it should write a KEY_Q event with the same sequence number as the KEY_K sequence number.

If it wants to map KEY_K event to Ctrl+Shift+Q, it should write these events:

// Mapper -> Compositor

Event:
    Type: EV_KEY
    Code: KEY_A
    Value: 1
    Device: 700
    Sequence: 890

Event:
    Type: EV_KEY
    Code: KEY_CTRL
    Value: 1
    Device: 700
    Sequence: 891

Event:
    Type: EV_KEY
    Code: KEY_SHIFT
    Value: 1
    Device: 700
    Sequence: 891

Event:
    Type: EV_KEY
    Code: KEY_Q
    Value: 1
    Device: 700
    Sequence: 892

Notice how both Ctrl and Shift have the same sequence number, which is an odd number equal to the sequence of KEY_K minus one. This tells the compositor that both of these events happened before the KEY_K event, but after whatever event came before the KEY_K event.

In case it wants to send some event after the KEY_K event rather than before it, e.g. mapping K to K+R, it should assign the sequence number of KEY_K to the last event that was generated in response to it, so the output in response to receiving KEY_A+KEY_K as above should be:

// Mapper -> Compositor

Event:
    Type: EV_KEY
    Code: KEY_A
    Value: 1
    Device: 700
    Sequence: 890

Event:
    Type: EV_KEY
    Code: KEY_K
    Value: 1
    Device: 700
    Sequence: 891  // Not the same as the received KEY_K event.

Event:
    Type: EV_KEY
    Code: KEY_R
    Value: 1
    Device: 700
    Sequence: 892  // The same as the received KEY_K event.

In case the mapper wants to drop those events, it must send some drop notification with those sequence numbers:

// Mapper -> Compositor

Event:
    Type: EV_DROP
    Code: 0
    Value: 0
    Device: 700
    Sequence: 890

Event:
    Type: EV_DROP
    Code: 0
    Value: 0
    Device: 700
    Sequence: 892

Lastly, sequence numbers are specific for each mapper script. Different mapper scripts may see different sequence numbers attached to the same events. This is necessary in case a mapper outputs an event with an odd event number, because the compositor guarantees that the next mapper script will only receive events with unique, even sequence numbers.

Changing the active window with sequence numbers
So, now we no longer have this whole state-update protocol. Instead, when the active window changes, the compositor will send two events:

// Compositor -> Mapper

Event:
    Type: EV_FOCUS_KEYBOARD
    Code: DEACTIVATE_WINDOW
    Value: "Firefox"
    Device: ???  // Maybe device 0. Maybe some other virtual device.
    Sequence: 1128

Event:
    Type: EV_FOCUS_KEYBOARD
    Code: ACTIVATE_WINDOW
    Value: "Dolphin"
    Device: ???
    Sequence: 1130

The mapper must now respond to these sequence numbers. And unless that script wants to break everything, it should just send these events back as-is.

However, it is allowed to send events with sequence number 1129 in between those two events, to indicate that its state has changed when the focus changed. At the end of the last layer, the compositor will discard all events that were sent in between DEACTIVATE_WINDOW and ACTIVATE_WINDOW, considering those events to have been addressed to nothing.

Essentially, the previous begin_update_state and finish_update_state Wayland events just turned into ordinary input events.

Filters
Right now, actual input events, device properties, and even changes in the active window have all been reduced to a single interface: "input event".

That are a lot of input events. And since the compositor needs to maintain a strict order between the events, all events need to be fed into and back from all active mapper scrips. With a few scripts, that could mean like a millisecond of delay.

To slightly improve efficiency, I propose that we allow mappers to declare filters, where they can choose to ignore certain events. Here are some examples of filters in a provisory scheme that I spent almost 30 seconds of thought on:

// Tells the compositor that you don't care what the active window is.
IGNORE Type=EV_FOCUS_KEYBOARD

// Tells the compositor to ignore all EV_KEY events except KEY_A events from device 700.
IGNORE Type=EV_KEY
UNIGNORE Type=EV_KEY Code=KEY_A Device=700

Or maybe instead of "ignoring" it should be the other way around: everything is ignored by default, except the events you explicitly unignored. I'll think more about that when I have more time.

Anyway, whenever the mapper has told the compositor that it wants to ignore events matching certain criteria, the compositor will assume that the mapper will pass on all those events as-is. Instead of writing those events to the mapper and waiting for the mapper to write an identical event back, the compositor will do that in the mapper's stead, reducing the latency.

But why the sequence numbers? Can't the compositor figure out the order of the events based on the order that the mapper wrote them?
Remember that Wayland is an asynchronous protocol, so for example suppose that the compositor wrote two events, KEY_A and KEY_C to the mapper, and the mapper wrote back KEY_Y, KEY_Z. Then there are several possibilities of what could've happened:

  1. KEY_A was mapped to KEY_X, and KEY_C was mapped to KEY_Z;
  2. KEY_A was dropped, and KEY_C was mapped to KEY_X, KEY_Z;
  3. KEY_A was mapped to KEY_X, KEY_Z, and KEY_C was dropped;
  4. The mapper dropped both events and decided to generate KEY_X, KEY_Z because of unrelated reasons;
  5. More options!

Now the first three options may look indistinguishable from an end-user perspective, but it starts to matter in case another ignored event happened in between KEY_A and KEY_C. Suppose that the mapper has announced that it would ignore all KEY_B events, and the user actually pressed the sequence ABC, then what would the output be?

Without sequence numbers, the compositor would have no way of knowing whether it has to be XBZ (option 1), BXZ (option 2), or XZB (option 3).

This is why sequence numbers are important: as soon as the mapper writes an event with the same sequence number as the KEY_A event back to the compositor, then the compositor knows that KEY_B will follow immediately after that event.


Difficulties:
I haven't figured out how to deal with mapper scripts sporadically generating events that are not in response to other events, such as generating events when a timer expires or a CLI command was entered. Maybe using "sporadic events use the same sequence number as the last sequence number you saw" will just work, but this requires more thought.

Some sort of state abridging will still have to be implemented in the compositor to handle cases of new mapper scripts showing up and disappearing.

In case a mapper script adjusts its filters in response to events, strange things may happen if the compositor already wrote subsequent events to the mapper's input socket.

If the mapper script ignores new events, it may still receive events it already declared it wanted to ignore. If it un-ignores events, it may receive its subsequent events out of order.

If the compositor waits with writing the next event to mapper's sockets until the mapper has responded to the last event, performance will measurably degrade.

This could probably be fixed by the mapper script using a wrapper library a la libevdev that automatically reorders events in such cases.

@KarsMulder
Copy link
Author

KarsMulder commented Dec 19, 2023

Looking at the problem with a higher level of abstraction

Like before, I'm going to use the term "persistent event" for events of types like EV_KEY, EV_ABS, and other such types which are assumed to maintain their last value until a different value gets reported, whereas "transient events" are events like EV_REL for which the last value is of no importance.

I'm going to ignore the existence of transient events for now.

Basically, at each layer and each point in time, there is a certain "state", which encompasses the devices that currently exist, which properties those devices have, and what the last seen value of all persistent events is. Basically the following:

// Language: Rust
struct State {
    devices: Vec<DeviceId>,
    device_properties: HashMap<(DeviceId, PropertyId), PropertyValue>,
    event_states: HashMap<(DeviceId, EventType, EventCode), EventValue>,
}

(Like I said in my last post, you may or may not want to treat device properties as a special kind of persistent event. Whether or not you do so is irrelevant for this post.)

The state can change with time, for example when the uses presses a key on a real keyboard, then the state on layer 0 changes to set that key from a "key up" state to a "key down" state.

Layers are basically things that map one state to another state**

When a mapper receives a "key down" event, what it really observes is that the state of the previous layer has changed. If it then writes another "key down" event to its output, what it really is doing is changing the state of the current layer. These events are basically a scheme to relay changes of the state without having to copy the entire state each time any part of it changes.

Events are basically a compression scheme
If the state was one big binary blob, then an event might as well be a message "n bytes at offset k just changed to value". Because usually only a small part of the state changes, this is efficient compression: the whole state may be several kilobytes large, but events are just a few bytes.

Sometimes you do have to relay the whole state, for example when a new mapper script starts up, it needs to know the full state of the previous layer. However, even the full state can be relayed by having the mapper script assume a certain default state involving things like "all keys are released, all absolute axes are 0" and then sending a series of events that represent the difference between the default state and the actual state.

A change in the active window are basically a single atomic change in the state

Now suppose that the name of the active window was also part of the state somewhere. (Maybe a property of a certain special device.) When the active window changes and doing so causes a key K to be pressed and another key Q to be released, we want a single transition to the state to happen:

  1. Active window name becomes "Firefox", key K is now pressed, key Q is now released.

And we do not want three different transitions to the state to happen:

  1. Active window name becomes "Firefox";
  2. Then the key Q is released;
  3. Then the key K is pressed.

Now if you were to express the first way (single change) and second way (three changes) in events, then both of them would correspond to three events, but semantically they mean something different.

If there is a single atomic change to the state, then Firefox could be informed the key K was already pressed since before it got focus, but with three subsequent changes Firefox would have to be informed that Q was pressed since before it got focus and was quickly released thereafter.

So what did we just learn?
So far I have been distinguishing two different purposes for events:

  1. To send messages about input events the user made or that some mapper wants to make;
  2. To inform mappers about changes in the state due to listening to a new device or due to the active window changing.

The difference between these two is that in the first case, you do want the UI to actively respond to the events, but in the second case you don't.

The fundamental reason that these two look different is because I had been treating each event as a "state update" by itself, and then required additional methods begin_update_state, finish_update_state, begin_sync_state and finish_sync_state for situations where we needed a state update that contains multiple events.

However, instead of seeing each event as a separate state change, it would be much more natural to consider each state change to consist out of any amount of events; the state changes are the things that happen to which the UI should respond, and the events are merely a compact way of communicating state changes.

And look—this has already been solved elegantly in the evdev protocol with EV_SYN. EV_SYN is currently used to group events that you want to happen atomically: for example, if you tap a part of the touch screen, then you want the active X and Y coordinate to update together, not one after another.

This could be reused for the the active window changing scenario. If you were to send the event that changes the active window in the same EV_SYN report as the keys that got pressed/released due to that change, then the compositor should be able to see that those presses/releases should happen atomically with the change in focus, and ensure that they happen after the last window lost focus and before the new window gains focus. Problem solved?

Unlistened devices / ignored events are basically default maps on the state

I have lumped all devices together into a single state because some libinput features like disable-while-typing treat some events on some devices differently based on what other devices are currently doing, demonstrating that there are practical use cases wherein we cannot treat all devices as independent from each other. If two devices emit two events, we need an order in which those events occurred.

However, being used to the current evdev API, it is pretty counterintuitive to automatically listen to all events on all devices; that would e.g. increase latency on joysticks even if you only wanted to map keyboard keys. As such, I have been trying to think of a way to declare some kind of filter that says "I only want to be informed of events on these devices."

Such a filter is basically a way to tell the compositor:

  1. Do not inform me if the input state changes in the specified way;
  2. If the input state changes in the specified way, change the output state in the same way.

In this example, the "specified way" could be "any event on any device except these devices."

This basically carves a blind spot on the mappers view of the input state, and delegates control over part of the output state back to the compositor.

So what did we just learn?
In the initial draft, my v0.2 revision, and in my new scheme I proposed last week, listening to new devices or unlistening to devices was significantly more involved than it should been, requiring stuff like begin_update_state, finish_update_state, begin_sync_state and finish_sync_state

Now the reason that this is difficult is because a change in the filters actually changes the blind spots of the application. Usually, events are used to relay changes in the state, but in this case they're used for a completely different purpose: the input events do not reflect "the state changed from this to that", but instead reflect "you didn't know the state but it was this all along".

In other words, events are used to encode information about the state, but they do not semantically represent changes in the state.

Now I do not have a solution for an easier way to handle the listen/unlisten protocol, but now I at least have more insight in why the protocol ends like a mess every time I try.

(Frankly, everything would become simpler if I were to remove the ability to listen to only certain devices and required all mapper scripts to relay all events on all devices, whether they want to or not.)

@kermitfrog
Copy link
Owner

Just a quick note: You might have noticed that I'm not very active lately. That's because of a pile of stuff I need to take care of. January will add more to the pile, so I don't expect to have much capacity left for another 4 weeks or so.

@KarsMulder
Copy link
Author

KarsMulder commented Dec 30, 2023

You might have noticed that I'm not very active lately. That's because of a pile of stuff I need to take care of. January will add more to the pile, so I don't expect to have much capacity left for another 4 weeks or so.

I'm relieved to hear that; I was worried that the whole project was getting slowed down because of my own tardiness.


My current thoughts:

Device properties
Maybe making device properties changeable was a bad idea after all, because in practice a change in device properties is the only thing that could cause a script not to want to listen to a device anymore, or make a running script want to start listening to a device that already existed.

If device properties were required to remain static, then it would be easy for a script to just declare "I want to listen to all keyboard devices" and we'd never have to deal with the hassle that ensues when a non-keyboard device changes to a keyboard device or vice versa. It really simplifies many things.

In case you want to change the device properties, the same good old method remains: create a virtual device and redirect all events from the old device to the new device.

The problem with the old method is that this makes it harder to configure devices in configuration screens, because the device you want to configure is now "virtual device #n" instead of "Dell Keyboard". This problem could be somewhat mitigated by being able to declare hints like "this virtual device originates from that other device", but that does not solve all issues.

Keyboard layout
Previously I intended to make the keyboard layout a property of the keyboard. However, each active window only has one keyboard layout active at any single point in time anyway. With that in mind, I think that there should instead just be a single global property "active keyboard layout for active window".

If two different keyboards want a different layout, then a certain layer is free to change that active layout property every time it receives an event from each keyboard. E.g. if there is an US keyboard and a German keyboard active, then a layer (either a mapper script or a built-in part of libinput) can set the active layout to "German" whenever the German keyboard presses a key, and set it to "US" whenever the US layout presses a key.

Having a global "active keyboard layout" property is better because it lies closer to how the Wayland input protocol actually works, enabling lower level of control over the input. E.g. if the user is dissatisfied with the logic libinput uses for when to switch layout, they will be able to put a mapper script in charge of switching layouts instead of libinput.

(What if the US layout and German layout press a key simultaneously? There is no correct answer because the Wayland input protocol is simply not capable of dealing with such situations.)

Gestures
Wayland currently does not have a way to customize gestures, and it is quite imaginable that some users are dissatisfied with the default gestures provided by libinput. So far the above protocol hasn't actually given much thought to gestures, but now I realize that we might actually have a problem here:

  • If the mapper scripts run on evdev events before being processed by libinput, then it cannot benefit from fundamental filters such as button debouncing and palm detection.
  • If the mapper scripts run on libinput events after being processed by libinput, then the events that would result in gestures have already been removed from the stream by libinput, making it impossible for mapper scripts to interpret gestures themselves.

The ideal solution would, once again, be to have a modular libinput where an earlier layer does the essential filtering and a later layer does gesture interpretation, but that would require a significant refactor of libinput and the cooperation of the libinput developers.

Purpose of the protocol
I think I said some time in the past that the reasons we needed a Wayland protocol instead of a Kernel API extension was:

  1. Active-window-dependent mappings;
  2. Mapping to keys not on the keyboard;
  3. AHK-style hotstrings;
  4. Gestures;

Purpose #1 could be rhoughly achieved by listening to D-Bus signals. It's not perfect because the switch in the mappings will happen slightly later than the switch in focus, so the first event being sent to the new active application will always use the old mapping instead of the new one. This is mosly relevant if you want to map tap/click events to something else. In practice, it probably won't matter that much for most purposes, especially if hovering the mouse over a window already activates the new mappings for pointers.

Purpose #2 could be achieved with a standard protocol for setting the keyboard layout.

Purpose #3 is not even achievable with the protocol I'm drafting right now. It is possible on the IME layer, but the IME layer of the Wayland protocol is currently not composable yet.

Purpose #4 is as mentioned above.

In short, I'm starting to think that I might be looking in completely the wrong direction. I've been trying to find an elegant protocol to map events over Wayland, but in the end I might currently be adding a whole lot of abstraction in order to solve problems that didn't really need all that abstraction, while not solving the actual difficult problems.

@kermitfrog
Copy link
Owner

Ok.. now I finally managed to read all of this thread.
I might need some time to digest all the information properly..

For now some thoughts:

do we want a way to send events to surfaces that do not have focus?

I already can send scroll events from the mouse to a different window than the one where my keyboard has focus. A possibility to send any event to any window would be nice. But I would not assign it a high priority...


Assuming we actually solve it via a wayland-protocol-extension...

..there are a few questions that need to be figured out at some point:

  • what do we do with devices that are ignored by libinput (/dev/js*)? How to combine them with other devices?
  • If the whole input remapping should be processed in a deterministic order (and I'd say that we absolutely want that), we have to treat it as a pipeline. Events must be processed by mapper A, then B, then C..
    This means there will be a certain delay, which will be longer the more mappers are involved.
    So.. what is the maximum time one mapper / the whole pipeline can have before the compositer decides that it's broken and it has to intervene in some way? What would that intervention look like?

As for sending information about the title: on second thought I think the best solution would be to send a Map<String, String> along with the window change event. The map could be filled with additional Information that either the compositor is configured to send or values sent by the user. That is: the compositor should support a dbus signal that, when triggered would trigger enter_surface(surface=[whatever it was before], focus_type=USER). Mappers could ignore any values that do not apply to them - or even all user events.


I'm relieved to hear that; I was worried that the whole project was getting slowed down because of my own tardiness.

Don't worry - I think you've put a lot of good work into this (more than me). Besides we all have limited capacity and doing this kind of thing right needs a lot of concentrated thought.


Purpose of the protocol
I think I said some time in the past that the reasons we needed a Wayland protocol instead of a Kernel API extension was:

  1. Active-window-dependent mappings;

I agree that the difference of having this as a wayland protocol vs. a DBUS signal probably won't matter that much for most purposes.
It would still be nice if we could extend the wayland protocel to simply offer a listening mechanism for window changes. But can probably do without the rest.

  1. Mapping to keys not on the keyboard;

For most purposes it might suffice to use Compose as a workaround. Interfacing with an IME would probably be even better.

  1. AHK-style hotstrings;

Assuming I understand these correctly: doing this on the IME layer might be best, but I don't see why it can't be implemented anywhere... sure, it requires to withhold input until you're sure whether it is a hotstrings or not (and yes: no preview), and it might not be a good idea having multiple mappers doing this at the same time... but I don't see how any problems with this can be avoided by any protocol.

  1. Gestures;

Modular libinput would clearly be the way to go.. At least until some programs decide that libinput is not powerful enough and decides to do that part on their own anyway...
I feel like it may be a good idea not to put too much thought into gestures right now. Things might change here and for now.. maybe there are some possible workarounds? Not sure..

In short, I'm starting to think that I might be looking in completely the wrong direction. I've been trying to find an elegant protocol to map events over Wayland, but in the end I might currently be adding a whole lot of abstraction in order to solve problems that didn't really need all that abstraction, while not solving the actual difficult problems.

Overengineering is a danger that comes with the profession ;P
I'm not yet sure if we need all that complexity or not. Your first approach seems like a pretty good start, but only an actual implementation with lots of testing would tell that for sure :/.

My gut feeling tells me that a double approach might be in order: ask wayland & libinput developers for comments while simultanously going back to the lower level.
Maybe we can find a solution that covers the most important stuff at the EVDEV level and can later use some kind of IME bridge for the rest.

Or maybe we end up with a mixture of all these..

@KarsMulder
Copy link
Author

I already can send scroll events from the mouse to a different window than the one where my keyboard has focus.

That's because the keyboard and mouse have separate focuses: mouse focus is determined by wl_pointer::enter and wl_pointer::leave, but keyboard focus is determined by wl_keyboard::enter and wl_keyboard::leave.

what do we do with devices that are ignored by libinput (/dev/js*)? How to combine them with other devices?

Since those devices are not handled by Wayland, I suppose the only real answer is to map them on evdev-level the way you used to: by grabbing the event devices and creating a shadow uinput device.

That does mean that the joysticks do not get to enjoy the composability benefits that this Wayland protocol is supposed to bring. That could be used as argument to favor an evdev-level solution over a Wayland solution: a Wayland solution is only applicable to keyboards, mice and similar, whereas an evdev-level solution would also benefit joysticks.

So.. what is the maximum time one mapper / the whole pipeline can have before the compositor decides that it's broken and it has to intervene in some way? What would that intervention look like?

I think that the compositor should be allowed to kick out mappers that take an unreasonably long time to process events. When any ordinary GUI or CLI application stops responding, the user will eventually kill it. If a mapper script is blocking the user from sending the input commands to kill that script, we may as well automate the process of killing that mapper script.

I think that the protocol should include a way for the compositor to tell the script "I've decided to start ignoring you for whatever reason kthxbai", which may be sent for reasons such as

  • The script taking too long to respond;
  • The user revoked that scripts' permission to handle input events;
  • The compositor showed the user a message "This program is reducing the responsiveness of your computer by 120ms. Do you want to stop it?" and the user then clicking the "yes" button;
  • Any other reason I haven't thought of yet.

I do not think that there is much point in standardizing a maximum time delay because that standard would be non-actionable for anyone trying to implement a standard-compliant mapper. For example, imagine that the standard said "the mapper script MUST respond within one second" or "the maximum average response delay MUST be less than 10ms", then what should somebody trying to write a standard-compliant mapper do to ensure compliance with the standards?

How long a particular operation takes is highly dependent on the speed of the user's system, how many other processes are competing for the same resources, the mercy of the kernel's scheduler. In the worst case, an otherwise responsive mapper script could gain a latency of several seconds if the kernel decided to swap out some of its memory pages to a hard drive made of spinning rust.

If we were to standardize a maximum, then an especially bad implementation could decide to send a dummy response after 0.9s if no proper response has been formulated yet in order to ensure standard-compliance — but that kind of compliance would absolutely not improve the user experience.

Any kind of maximum time written into the protocol would restrict the freedom of compositor implementations while being pretty much non-actionable for those who write mapper scripts. The best that a script can do is to just try to handle events as quickly as possible, but that would be a good policy regardless of what the protocol standard says anyway.

Therefore, I propose that we do not write any explicit maximum time into the protocol. We should write that scripts are expected to provide timely responses to all events they read (even if their response is just "I discard this event"), but let the compositors decide what the maximum is and how they measure it (maximum delay? average delay? root mean square delay?)

As for sending information about the title: on second thought I think the best solution would be to send a Map<String, String> along with the window change event. The map could be filled with additional Information that either the compositor is configured to send or values sent by the user. That is: the compositor should support a dbus signal that, when triggered would trigger enter_surface(surface=[whatever it was before], focus_type=USER). Mappers could ignore any values that do not apply to them - or even all user events.

I'm not sure I completely follow this, so could you please clarify this?

  1. Which entity is supposed to be actually sending those those D-Bus signals?
  2. Which entity is supposed to be listening to those D-Bus signals?

Assuming I understand these correctly: doing this on the IME layer might be best, but I don't see why it can't be implemented anywhere... sure, it requires to withhold input until you're sure whether it is a hotstrings or not (and yes: no preview), and it might not be a good idea having multiple mappers doing this at the same time... but I don't see how any problems with this can be avoided by any protocol.

Probably best handled at the IME-layer indeed. The problem is that there can currently be only a single IME active at any time and the IME protocol does not seem to be intended to help multiple IME's cooperate, so if somebody were to write a fancy IME that allows for AHK-style hotstrings and whatever, then Japanese and Chinese users would not be able to benefit from it.

Then again, thinking some more about it, it may be just conceptually difficult for different IME's to work together, since they both assume they have control over the text field. Furthermore, even if it was possible to simultaneously run some AHK-style mapper together with a Japanese IME, a workflow of "first you type characters, then the IME converts it, THEN the hotstring maps apply" would simply be bad UX. Imagine a hotstring that adds the current date like this:

[Step 1: type keys      ] きょうのひづけ
[Step 2: convert by IME ] 今日の日付
[Step 3: apply hotstring] 2月12日

The second step should be skipped. The best UX involves immediately jumping from step 1 to step 3 using the same UI as the rest of the IME uses. In fact, any Japanese IME worth its salt should support custom user-defined dictionary entries; it is traditionally used to make it possible to type names, but I think you could abuse it to at least map some arbitrary fixed input to some arbitrary fixed output.

(Also, state of the Linux-compatible Japanese IME's is currently so bad that the lack of hotstrings is nowhere near the biggest issue.)

In short, maybe we do not actually need composability of multiple IME's. Maybe we just need an English/Non-CJK language IME that supports hotstrings?

I'm actually starting to feel inspired to try writing one. I'm not sure if it will actually be useful to anyone, but the experience of actually implementing an IME may turn out to be valuable. I'll look into this if I find some time.

My gut feeling tells me that a double approach might be in order: ask wayland & libinput developers for comments while simultanously going back to the lower level.
Maybe we can find a solution that covers the most important stuff at the EVDEV level and can later use some kind of IME bridge for the rest.

Good idea. I think we should resume the plan of improving the Kernel API for shadowing input devices. If nothing else, it will still be handy for composably mapping joysticks even after we get a Wayland API.

@kermitfrog
Copy link
Owner

I think that the compositor should be allowed to kick out mappers that take an unreasonably long time to process events. When any ordinary GUI or CLI application stops responding, the user will eventually kill it. [..] Therefore, I propose that we do not write any explicit maximum time into the protocol. We should write that scripts are expected to provide timely responses to all events they read [..]

Ok, point taken.

I'm not sure I completely follow this, so could you please clarify this?

  1. Which entity is supposed to be actually sending those those D-Bus signals?
  2. Which entity is supposed to be listening to those D-Bus signals?
  1. The compositor
  2. The mapper

Probably best handled at the IME-layer indeed. The problem is that there can currently be only a single IME active at any time and the IME protocol does not seem to be intended to help multiple IME's cooperate, so if somebody were to write a fancy IME that allows for AHK-style hotstrings and whatever, then Japanese and Chinese users would not be able to benefit from it.
[..]
I'm actually starting to feel inspired to try writing one. I'm not sure if it will actually be useful to anyone, but the experience of actually implementing an IME may turn out to be valuable. I'll look into this if I find some time.

What about a man-in-the-middle type IME that forwards things between one real active IME and the application, and only intervenes to do some extra transformations? I don't know if that is possible, but if so it might be sufficient to enable those features for any IME.


So one of the next steps is to contact the wayland/libinput devs.. I guess the right place would either be the wayland mailing list or the wayland issue tracker on gitlab.freedesktop.org.

The initial message should give an overview of the problem (multiple input mappers are not working together well, see #3) as well the possible solutions we discussed so far (this issue and the multitude of posts in #2 regarding evdev-level solutions). It should probably link to #3 as well as this thread. Or maybe the whole discussion (including #2)?
The intent is to request comments on our plans so far and invite to participate.

Which important parts am I forgetting? And: am I overthinking this?


While looking at the wayland issue tracker I came across this WIP-MR: https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/62
It' somewhat related.. and has been around for 4 years without completion.

@KarsMulder
Copy link
Author

I'm not sure I completely follow this, so could you please clarify this?

[clarification]

Okay, I understand it now. I don't know what the performance implementations would be, but it does seem to make sense: compositors can volunteer to send any additional information they want without requiring a new protocol version. Mappers could probe which information the current compositor sends, and if it knows that the current compositor sends a particular piece of information X (e.g. window_title), then the mapper can wait with responding to the enter_surface Wayland event until it receives the window_title D-Bus event, or delay processing all received window_title D-Bus events until it receives a corresponding enter_surface event.

It may be prudent to include an event ID or surface ID in both the enter_surface Wayland event and the corresponding D-Bus events to avoid risk of desynchronisation (i.e. the wrong D-Bus event getting associated with the wrong enter_surface event).

Even if there is a 0% dropout rate on both the D-Bus and the Wayland socket, there is a bit of a tricky situation right at the startup of the script since there will be a slight delay between the point at which you connect to the wayland socket and the point at which you connect to the D-Bus socket. If right after the script starts it for example receives both a D-Bus event and a enter_surface Wayland event, then it may be impossible to tell whether these two events correspond to each other, or if the surface changed both right before this mapper launched and right after the mapper launched and you're now receiving a delayed event from the D-Bus.

What about a man-in-the-middle type IME that forwards things between one real active IME and the application, and only intervenes to do some extra transformations? I don't know if that is possible, but if so it might be sufficient to enable those features for any IME.

An intermediate layer for the IME already exists. The IME does not usually talk to the compositor directly, but throught an intermediary IME framework such as Fcitx 5 or IBus. In fact, there are so many intermediary's available that the Arch Wiki needs a whole table to explain which IME can work with which framework.

Fcitx 5 is an "input method framework with extension support" [source]. In particular, it already appears to have a plugin QuickPhrase that functions like a lightweight AutoHotkey.

This means that we may not really need a composable Wayland protocol for IME's, since we can just write Fcitx 5 addons or contribute patches to make your favourite IME framework even more extensible than it already is.

Which important parts am I forgetting?

I think the three most important things are to post are: (1) what is the problem that needs to be solved, (2) why it must be solved, (3) why we think Wayland is the right place to solve that problem.

We should of course link to the disscussion we had here, but I do not expect many (any?) of them to read through everything we've posted here.

I think too many ideas may have flown around to accurately summare them in the post we make at the Wayland channels, but it may be worth to provide a short summary of the most major roadblocks we've encountered while trying to solve the problem and post that on the Wayland channels.

(And "provide a short summary" probably means "I should write a summary", and "should write a summary" probably means "should've written a summary during the past week". I apologize again for my decreased amount of engagement with this issue and thank @kermitfrog for his continued effort.)

While looking at the wayland issue tracker I came across this WIP-MR: https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/62
It' somewhat related.. and has been around for 4 years without completion.

It's a real pity that this never got merged, because a lot of the issues we're dealing with would become solvable if Wayland didn't require us to have only a single keyboard with a single layout.

The good news is that I don't see any hard "this won't happen" comments on the proposed protocol. I wonder if it is possible to make the proposal get taken seriously again?

@KarsMulder
Copy link
Author

I've compiled a list of problems that are still difficult or finicky to solve, or "this is why I still do not have a full proposal for a new protocol yet".


The basic idea is that we have a set of mapper scripts, each of which processes the events generated by the previous mapper script:

basic_protocol

Now this is just a simple concept. More involved schemes have been proposed already, but the basic idea of "one mapper script reads the events generated by the previous mapper script" still feels like a good idea. However, there are still problems for which we have no solution or an overly complicated solution, or otherwise points that require consideration when designing a protocol.

Here's a compressed list of problems that are not easy to solve:

Monolithic libinput
Currently, libinput is a single monolithic library which takes as input evdev events and outputs libinput events.

It is important for users to be able to feed the output of one mapper script into another. As such, the best model is probably to have the events from the physical event input devices flow first into Mapper 1, then to Mapper 2, etc., and finally to the application.

At some point between "event device" and "application", the events have to go through libinput. However, libinput currently doesn't fit well before the mapper scripts nor after the mapper scripts.

The first issue is that libinput does several different things at the same time. It simultaneously applies fundamental filters such as button debouncing and palm detection, which you would probably want to apply before doing any kind of event mapping, as well as high-level filters like gesture detection, which you'd probably want to have applied after any kind of event mapping.

Ideally, libinput would be refactored to be more modular, so it becomes possible to insert a script after the fundamental filters like button debouncing yet before gesture detection.

Gesture support
Users wanting to use custom gestures other than the ones provided by libinput is a plausible use case, but this does require us to consider how mapper scripts will communicate gestures.

In case a script wants to handle a gesture similar to a libinput gesture, then we may also need a way to disable a particular gesture detection from libinput.

Event format
The events that are output by libinput are kind of unwieldy. They contain many fields that are redundant with each other, like the same coordinates but represented differently.

Somewhere in the chain from "event device" to "application", events will have to be converted from the evdev format to the libinput format, but it is not clear whether that should happen at the start, the end, or somewhere in the middle.

Having only a single keyboard layout active at any time
This is already causing conceptual trouble before we introduce any mapper scripts. If the user has both a French and a German keyboard connected to their computer, then there is simply no way to properly handle the case where the user is pressing a key at both keyboards simultaneously.

I suppose that this situation was deemed acceptable since it was deemed unlikely that a user would ever actually press keys on multiple keyboards at the same time, but when virtual input devices become involved, the situation becomes pricklier.

Does each virtual input device get its own keyboard layout, or are they supposed to send scancodes according to whatever is the currently active layout of the system?

If each device gets its own layout, then that could lead to problematic switching of layouts, for example a foot button trying to execute the sequence Ctrl+S on a QWERTY layout, while the user is typing something with his hands on a AZERTY layout.

If each virtual device mimics the currently active layout, then the implementation of both mappers and the mapping protocol, especially if we want to properly handle cases where a virtual keyboard is holding a key while the active keyboard layout changes.

Mapping to keys not on the active keyboard layout
One of the current limitations of evdev-level mapping is that we cannot map to keys that are not on the current keyboard layout. A proposed solution to that problem is to make it possible for mapper scripts to change the keyboard layout under the Wayland protocol, so they would for example become able to type β by changing the layout to something that includes the β key and then pressing its associated scancode. This would be a pretty decent (though not ideal) solution if each virtual input device could have its own associated keyboard layout. However, the situation becomes once again pricklier when all input devices, whether real or virtual, all have to share the same keyboard layout.

Synchronisation between multiple input devices
Some mapper scripts care only about some kind of events, for example one script might turn some keyboard events into other keyboard events, but do nothing to mouse events. In order to avoid mouse events getting delayed by this script, you might consider not sending mouse events to this script at all, but then we do need to think about synchronisation between different input devices.

Within each individual input device, the compositor should obviously maintain the order of the events, e.g. if "ABC" gets typed, then the events A, B, and C should be relayed in that order, and not in the order "A, C, B".

We should consider whether we also want to consistently maintain a global order of events. For example, if the combined event stream from the keyboard and mouse is "[ctrl] [left click]", would it be permissible for the compositor to reorder that to "[left click] [ctrl]"?

If not, then we do need additional synchronisation primitives. For example, if a script has elected to map the events of the keyboard but not the mouse, and the user does "[ctrl] [left click]", then the event "[left click]" cannot be processed further until the mapper script has decided how to handle the [ctrl] event, so the mapper needs some way to tell the compositor "I have handled the [ctrl] event".

On a side note, for some features as palm detection, it is important for scripts to be able to listen to multiple devices simultaneously.

(Many parts of the protocol would become simpler if all mapper scripts had to process all events from all input devices, whether they care about those events or not. But that would result in lower performance.)

Active-window-dependent mappings
Last but not least, it is common for users to want to use a different mapping for each application. The question is, how do you let scripts change the active mapping depending on which window is active?

There are various models that can be used for this, such as informing scripts that the window has changed, running a separate instance of each script for each window, or using a different virtual event device for each window. The first option probably has the least overhead.

It does however require us to figure out how to communicate changes in the active window to mapper scripts. This could be done either over the Wayland protocol itself, or over another communication layer like D-Bus. Using D-Bus has the advantage that there is more freedom in what kind of information the compositor wants to transmit, but we will still need some kind synchronisation primitives in the Wayland protocol to make sure that the state of the mapper scripts change atomically with which window (or surface) currently has focus.

Ideally, we'd also have a way for mapper scripts to change their current mapping without generating input events upon a change in the active window, e.g. if the key "A" is being held, the map "A→X" is used for the current window, and the focus changes to a window where the map "A→Y" should apply, then the window that receives focus should ideally see the key Y as being pressed in the wl_keyboard::enter::key event, rather than seeing X as being pressed and then receiving "release X, press Y" events.

@kermitfrog
Copy link
Owner

An intermediate layer for the IME already exists. The IME does not usually talk to the compositor directly, but throught an intermediary IME framework such as Fcitx 5 or IBus.

Ah, good to know :)

It may be prudent to include an event ID or surface ID in both the enter_surface Wayland event and the corresponding D-Bus events to avoid risk of desynchronisation (i.e. the wrong D-Bus event getting associated with the wrong enter_surface event).

Hm.. yes, desynchronisation could be a problem.. But what if the mapper ignores enter_surface and only acts on the D-Bus event? If all relevant data (class, title, ..) is in that one event, desynchronisation should no longer be a problem.

Which important parts am I forgetting?

I think the three most important things are to post are: (1) what is the problem that needs to be solved, (2) why it must be solved, (3) why we think Wayland is the right place to solve that problem.

(1) & (2): absolutely yes!
(3) is where I'm having increasing doubts.. there is still a lot I need to (re)think here, but I'm wondering if there is a way around this need. Except for the active-window-dependent mappings part - that's definitively something we want in wayland.

(And "provide a short summary" probably means "I should write a summary", and "should write a summary" probably means "should've written a summary during the past week". I apologize again for my decreased amount of engagement with this issue and thank @kermitfrog for his continued effort.)

Really? My impression is that I'm the one blocking this with weeks of inactivity in between, while you're quickly reacting. In the end, we all have the non-digital-life that tends to cause interruptions ;)

In any case I'm planning to put a stronger focus on this, so we might soon reach that consensus that we once wanted to have before escalating it to all the other relevant people.
My current plan is:

  1. continue to re-examine all the old posts in the other thread (about half done so far) and summarize anything that might have been forgotten by now.
  2. Come up with a solution that will get us as far as possible without changing anything outside evdev.
  3. Reach out to evdev-devs.
  4. Extend 2. to libinput.

Here's a compressed list of problems that are not easy to solve:

And here some even more compressed comments (based on the goal to make things as simply as possible).

Monolithic libinput & Gesture support

I wonder if it might suffice to simply be able to call individual parts of libinput (e.g. "debounce this button") and maybe configure it to disable some features for specific devices. That shouldn't be too hard to do (I hope).

Event format

As long as we can somehow convert them to evdev events we should be able to feed it back to an evdev-based mapper. That should take care of most use cases. For the rest (e.g. transforming gestures), mappers need to be adapted anyway.

Having only a single keyboard layout active at any time

Would be nice, but getting support for multiple layouts at the same time might take a while..
In the meantime there might be a partial workaround by means of a fused layout. That is, creating a variant of the first layout, using keycodes that are not present on the keyboard to represent a second layout and mapping those keys accordingly. The number of layouts would be limited, but I think at least 3 should be possible in most cases.

Mapping to keys not on the active keyboard layout

Arbitary codes can be handled by the compose workaround.

Synchronisation between multiple input devices

That is something I need to think about more.. but in the best case it might not actually matter, as the processing should be faster than the user triggers the events. Cases I imagine:

  • ctrl+click by user: press ctrl first, then click afterwards. The delay should be long enough for mappers to process it.
  • ctrl+click by mapper: should work as above, if the mapper inserts a delay.
  • mapping the combination of ctrl+click: would trigger something on click while ctrl is held down. If ctrl needs to be delayed to see what the next event is, than I don't think there is anything we can do about it anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants