Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rumble-Associated Disconnection #189

Closed
hibes opened this issue May 12, 2020 · 46 comments · Fixed by #202
Closed

Rumble-Associated Disconnection #189

hibes opened this issue May 12, 2020 · 46 comments · Fixed by #202

Comments

@hibes
Copy link

hibes commented May 12, 2020

Describe the bug
After initial connection and the rumble associated, any subsequent rumble sent to the controller seems to cause the controller to hang and subsequent inputs to the controls no longer produce the desired effect.

To Reproduce
Steps to reproduce the behavior:

  • Connect Xbox Elite 2 Wireless Controller
  • Launch a game which causes rumble (Tonight We Riot, Rocket League with Camera Shake enabled)
  • Play until a rumble occurs.
  • Observe loss of input capability.

Expected behavior
A continuation of input capability after the rumble occurs.

System information

$ uname -a
Linux hpop 5.4.0-7626-generic #30~1588169883~20.04~bbe668a-Ubuntu SMP Wed Apr 29 21:00:02 UTC  x86_64 x86_64 x86_64 GNU/Linux
$ dmesg | grep xpadneo
[61183.823216] xpadneo 0005:045E:0B05.000F: fixing up report size
[61183.823219] xpadneo 0005:045E:0B05.000F: fixing up Rx axis
[61183.823220] xpadneo 0005:045E:0B05.000F: fixing up Ry axis
[61183.823222] xpadneo 0005:045E:0B05.000F: fixing up Z axis
[61183.823223] xpadneo 0005:045E:0B05.000F: fixing up Rz axis
[61183.823225] xpadneo 0005:045E:0B05.000F: fixing up button mapping
[61183.824363] LED trigger xpadneo-98:7a:14:1f:c3:82-battery-charging-or-full failed to register (-17)
[61183.824366] LED trigger xpadneo-98:7a:14:1f:c3:82-battery-charging failed to register (-17)
[61183.824368] LED trigger xpadneo-98:7a:14:1f:c3:82-battery-full failed to register (-17)
[61183.824370] LED trigger xpadneo-98:7a:14:1f:c3:82-battery-charging-blink-full-solid failed to register (-17)
[61183.824422] xpadneo 0005:045E:0B05.000F: battery registered
[61183.828575] xpadneo udev: 0005:045E:0B05.000F
[61183.832466] xpadneo udev: ok
[61204.332598] xpadneo 0005:045E:0B05.000F: input,hidraw0: BLUETOOTH HID v9.03 Gamepad [Xbox Elite Wireless Controller] on 5c:80:b6:70:8b:5d
$ sudo find "/sys/kernel/debug/hid/" -name "0005:045E:*" -exec sh -c 'echo "{}" && head -1 "{}/rdesc" | tee /dev/tty | cksum && echo' \;
/sys/kernel/debug/hid/0005:045E:0B05.0010

Additional context
I attempted to search for issues which were explicitly focused on this issue, but the only issues which bring it up merely mention it. I hope this ticket is warranted as a focus explicitly on the rumble issue.

A workaround is to disconnect and reconnect the controller, but this takes a considerable amount of time.

@hibes hibes changed the title Xbox Elite 2 Wireless Controller Xbox Elite 2 Wireless Controller Rumble-Associated Disconnection May 12, 2020
@hibes
Copy link
Author

hibes commented May 12, 2020

Is it possible to add an option to filter out rumble events and to not forward them to the controller? I'd rather use the controller without rumble than have the device disconnect. It could serve as a temporary workaround.

@kakra
Copy link
Collaborator

kakra commented May 12, 2020

This problem is not specific to the Elite controller, it depends on the BT dongle used. I'll leave this open to collect some more feedback but I'm tempted to close it in favor of #171 (which is closed due to being identified as a dongle issue). However, I'm not sure if that's a hardware issue with the dongle or something that's going wrong in the BT driver of Linux. Comparing against Windows could help.

In any case: Yes, you can already filter out rumble events. Just pass disable_ff=3 to the module parameters: It will still evaluate the rumble commands but since no rumble motor is enabled, it won't send any commands to the controller. It should also disable the initial connection rumble. So if done correctly, there shouldn't be a single rumble. It's described in the docs on how to add the parameter.

@kakra
Copy link
Collaborator

kakra commented May 12, 2020

@hibes One idea... Maybe you could run sudo btmon | tee btmon.log to monitor the BT connection until the rumble causes a disconnect. You may want to trim the file to around the point when it disconnected.

While monitoring, I'd suggest to disconnect all other BT devices, and please turn off the controller before starting the monitor so we can capture the connection init, too.

@hibes
Copy link
Author

hibes commented May 13, 2020

This was on a fresh boot, still a pretty enormous file, but there shouldn't be much excess noise and the rumble occurred fairly quickly.

btmon.log

@kakra
Copy link
Collaborator

kakra commented May 21, 2020

Okay, it looks like it disconnected after around 26 seconds, then immediately reconnected but probably the rumble was now playing continuously or stopped playing while the controller still reported input events.

This sent cmd 0x03 (rumble, 2nd byte) to the controller:

< ACL Data TX: Handle 256 flags 0x00 dlen 14              #322 [hci0] 26.510854
      Channel: 65 len 10 [PSM 19 mode Basic (0x00)] {chan 1}
        a2 03 02 0a 0a 14 28 05 05 03                    ......(...      

Immediately after it, the controller semi-disconnected as btmon exits sniff mode (but I can only guess that's what happening, there's no explicit disconnect):

< HCI Command: Exit Sniff Mode (0x02|0x0004) plen 2       #323 [hci0] 26.510866
        Handle: 256
> HCI Event: Command Status (0x0f) plen 4                 #324 [hci0] 26.512915
      Exit Sniff Mode (0x02|0x0004) ncmd 1
        Status: Success (0x00)
> HCI Event: Number of Completed Packets (0x13) plen 5    #325 [hci0] 26.521883
        Num handles: 1
        Handle: 256
        Count: 1
> HCI Event: Mode Change (0x14) plen 6                    #326 [hci0] 26.531916
        Status: Success (0x00)
        Handle: 256
        Mode: Active (0x00)
        Interval: 0.000 msec (0x0000)
> HCI Event: Max Slots Change (0x1b) plen 3               #327 [hci0] 26.534916
        Handle: 256
        Max slots: 5
> HCI Event: Max Slots Change (0x1b) plen 3               #328 [hci0] 26.535925
        Handle: 256
        Max slots: 1
> HCI Event: Mode Change (0x14) plen 6                    #329 [hci0] 26.547922
        Status: Success (0x00)
        Handle: 256
        Mode: Sniff (0x02)
        Interval: 10.000 msec (0x0010)

But eventually btmon entered sniff mode again and we continue seeing input events 0x01 (input report, 2nd byte):

> ACL Data RX: Handle 256 flags 0x02 dlen 60              #330 [hci0] 26.660350
      Channel: 65 len 56 [PSM 19 mode Basic (0x00)] {chan 1}
        a1 01 bb 82 2d 80 7f 7d 2e 85 00 00 00 00 00 00  ....-..}........
        00 00 bb 82 2d 80 7f 7d 2e 85 00 00 00 00 00 00  ....-..}........
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00 00 00 00 00 00 00 00                          ........        

Another rumble sent to the controller and the cycle repeats:

< ACL Data TX: Handle 256 flags 0x00 dlen 14              #331 [hci0] 26.839337
      Channel: 65 len 10 [PSM 19 mode Basic (0x00)] {chan 1}
        a2 03 0c 0a 0a 14 28 05 05 03                    ......(...      
< HCI Command: Exit Sniff Mode (0x02|0x0004) plen 2       #332 [hci0] 26.839350
        Handle: 256
> HCI Event: Command Status (0x0f) plen 4                 #333 [hci0] 26.839983
      Exit Sniff Mode (0x02|0x0004) ncmd 1
        Status: Success (0x00)
> HCI Event: Number of Completed Packets (0x13) plen 5    #334 [hci0] 26.860985
        Num handles: 1
        Handle: 256
        Count: 1
> HCI Event: Mode Change (0x14) plen 6                    #335 [hci0] 26.865985
        Status: Success (0x00)
        Handle: 256
        Mode: Active (0x00)
        Interval: 0.000 msec (0x0000)
> HCI Event: Max Slots Change (0x1b) plen 3               #336 [hci0] 26.866989
        Handle: 256
        Max slots: 5
> HCI Event: Max Slots Change (0x1b) plen 3               #337 [hci0] 26.867988
        Handle: 256
        Max slots: 1
> HCI Event: Mode Change (0x14) plen 6                    #338 [hci0] 26.880991
        Status: Success (0x00)
        Handle: 256
        Mode: Sniff (0x02)
        Interval: 10.000 msec (0x0010)
> ACL Data RX: Handle 256 flags 0x02 dlen 60              #339 [hci0] 27.620570
      Channel: 65 len 56 [PSM 19 mode Basic (0x00)] {chan 1}
        a1 01 bb 82 2d 80 7f 7d 2e 85 00 00 00 00 00 01  ....-..}........
        00 00 bb 82 2d 80 7f 7d 2e 85 00 00 00 00 00 01  ....-..}........
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00 00 00 00 00 00 00 00                          ........        

We are still receiving input:

> ACL Data RX: Handle 256 flags 0x02 dlen 60              #340 [hci0] 27.780595
      Channel: 65 len 56 [PSM 19 mode Basic (0x00)] {chan 1}
        a1 01 bb 82 2d 80 7f 7d 2e 85 00 00 00 00 00 00  ....-..}........
        00 00 bb 82 2d 80 7f 7d 2e 85 00 00 00 00 00 00  ....-..}........
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00 00 00 00 00 00 00 00                          ........        
> ACL Data RX: Handle 256 flags 0x02 dlen 60              #341 [hci0] 28.220648
      Channel: 65 len 56 [PSM 19 mode Basic (0x00)] {chan 1}
        a1 01 bb 82 2d 80 7f 7d 2e 85 00 00 00 00 00 01  ....-..}........
        00 00 bb 82 2d 80 7f 7d 2e 85 00 00 00 00 00 01  ....-..}........
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00 00 00 00 00 00 00 00                          ........        
> ACL Data RX: Handle 256 flags 0x02 dlen 60              #342 [hci0] 28.380716
      Channel: 65 len 56 [PSM 19 mode Basic (0x00)] {chan 1}
        a1 01 bb 82 2d 80 7f 7d 2e 85 00 00 00 00 00 00  ....-..}........
        00 00 bb 82 2d 80 7f 7d 2e 85 00 00 00 00 00 00  ....-..}........
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00 00 00 00 00 00 00 00                          ........        
> ACL Data RX: Handle 256 flags 0x02 dlen 60              #343 [hci0] 28.740750
      Channel: 65 len 56 [PSM 19 mode Basic (0x00)] {chan 1}
        a1 01 bb 82 2d 80 7f 7d 2e 85 00 00 00 00 00 01  ....-..}........
        00 00 bb 82 2d 80 7f 7d 2e 85 00 00 00 00 00 01  ....-..}........
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00 00 00 00 00 00 00 00                          ........        

The controller seems to mostly work until around after 67 seconds when another rumble command is sent:

< ACL Data TX: Handle 256 flags 0x00 dlen 14             #1665 [hci0] 67.008246
      Channel: 65 len 10 [PSM 19 mode Basic (0x00)] {chan 1}
        a2 03 0f 00 00 34 00 ff 00 ff                    .....4....      
< HCI Command: Exit Sniff Mode (0x02|0x0004) plen 2      #1666 [hci0] 67.008262
        Handle: 256
> HCI Event: Command Status (0x0f) plen 4                #1667 [hci0] 67.009723
      Exit Sniff Mode (0x02|0x0004) ncmd 1
        Status: Success (0x00)

It now receives multiple rumble commands in succession and then starts to send bogus event data:

< ACL Data TX: Handle 256 flags 0x00 dlen 14             #1695 [hci0] 67.120267
      Channel: 65 len 10 [PSM 19 mode Basic (0x00)] {chan 1}
        a2 03 0f 00 00 2e 00 ff 00 ff                    ..........      
< HCI Command: Exit Sniff Mode (0x02|0x0004) plen 2      #1696 [hci0] 67.120282
        Handle: 256
> HCI Event: Command Status (0x0f) plen 4                #1697 [hci0] 67.121696
      Exit Sniff Mode (0x02|0x0004) ncmd 1
        Status: Success (0x00)
> ACL Data RX: Handle 256 flags 0x02 dlen 60             #1698 [hci0] 67.123699
      Channel: 65 len 56 [PSM 19 mode Basic (0x00)] {chan 1}
        a1 01 ff ff e0 87 1b 7f 8f 85 00 00 00 00 00 00  ................
        00 00 ff ff e0 87 1b 7f 8f 85 00 00 00 00 00 00  ................
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00 00 00 00 00 00 00 00                          ........        

The bytes 0x01 0xff 0xff don't look right. This is something I've seen before, too. The controller firmware probably crashed now (or at least it is responding like partly crashed). It looks like at least one axis is stuck now. Rumble also probably now longer works correctly at this point.

If you'd watch the logs a little longer you would probably see some axis stuck at particular values, and a few moments later no more inputs are being reported and the controller disconnects, resets after a timeout, then reconnects.

@hibes Can you confirm this observation?

Since our driver actually does nothing else than sending and receiving HID reports, we have absolutely no control about how the Bluetooth protocol works with the controller. I think this is something that has to be fixed in the Bluetooth drivers or the controller firmware. Something about how Linux uses the Bluetooth protocol seems to negatively impact the stability of the controller firmware. The Linux Bluetooth stack is probably very standards compliant, so it's a quirk of how the firmware works. Maybe it can be fixed in the Linux driver stack, maybe not.

It doesn't even look like we are overwhelming the controller with commands, the strange btmon behavior seems to happen out of the blue just on sending 0x03 commands (rumble). As long as it's working flawlessly I couldn't observe that behavior.

Is anyone more familiar with the Bluetooth protocol who may explain why this is happening? Or maybe anyone who could recreate a similar Bluetooth protocol log in Windows? If the issue cannot be reproduced in Windows, maybe we find a difference in how the Bluetooth protocol is handled...

@kakra kakra changed the title Xbox Elite 2 Wireless Controller Rumble-Associated Disconnection Rumble-Associated Disconnection May 22, 2020
@kakra kakra pinned this issue May 22, 2020
@hibes
Copy link
Author

hibes commented May 22, 2020

@hibes Can you confirm this observation?

I've broken down some of the individual assertions and questions that I saw. Let me know if I missed something.

the rumble was now playing continuously or stopped playing while the controller still reported input events.

The latter. I think I didn't have functioning inputs until that point.

you would probably see some axis stuck at particular values

Yep, this stuck axis behavior occurs almost every time, I think it has to do with the state of the joystick when the rumble occurs.

If you'd watch the logs a little longer you would probably see some axis stuck at particular values, and a few moments later no more inputs are being reported and the controller disconnects, resets after a timeout, then reconnects.

In past experiences with the controller, what you're describing is exactly what happened. Did you want a log that shows that? It's not hard to reproduce if it'd be helpful.

@kakra
Copy link
Collaborator

kakra commented May 22, 2020

If you can confirm with any logs that the rumble problem results from specific joystick positions, then I'll look at such logs. But I was able to reproduce the problem with the directional rumble test program in this repository even with the joysticks idle.

Looking at the stale branches of this repository, there seemed to exist an idea that the problem may be related to trigger rumble. Maybe try to disable the trigger rumble.

It would be interesting to see a Bluetooth monitor log of Windows sending rumble to the controller... Maybe the controller doesn't like how we handle some combination of rumble values. I don't think it is related in any way to the joystick position but if you can prove otherwise, I'd be fine with that.

you would probably see some axis stuck at particular values

Yep, this stuck axis behavior occurs almost every time, I think it has to do with the state of the joystick when the rumble occurs.

I've seen cases during tests where the rumble would stuck (either no more rumble at all out of the blue or continuous rumble that doesn't stop) while I was still able to control the game camera and character. But that situation only lasted a few seconds, maybe a minute or two at best, until the controller completely stopped working with the Xbox logo blinking. After around 30 seconds, the controller would reconnect but rumble no longer works. But the sticks and buttons are working again.

@kakra
Copy link
Collaborator

kakra commented May 22, 2020

BTW:

It now receives multiple rumble commands in succession and then starts to send bogus event data:

< ACL Data TX: Handle 256 flags 0x00 dlen 14             #1695 [hci0] 67.120267
      Channel: 65 len 10 [PSM 19 mode Basic (0x00)] {chan 1}
        a2 03 0f 00 00 2e 00 ff 00 ff                    ..........      
< HCI Command: Exit Sniff Mode (0x02|0x0004) plen 2      #1696 [hci0] 67.120282
        Handle: 256
> HCI Event: Command Status (0x0f) plen 4                #1697 [hci0] 67.121696
      Exit Sniff Mode (0x02|0x0004) ncmd 1
        Status: Success (0x00)
> ACL Data RX: Handle 256 flags 0x02 dlen 60             #1698 [hci0] 67.123699
      Channel: 65 len 56 [PSM 19 mode Basic (0x00)] {chan 1}
        a1 01 ff ff e0 87 1b 7f 8f 85 00 00 00 00 00 00  ................
        00 00 ff ff e0 87 1b 7f 8f 85 00 00 00 00 00 00  ................
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00 00 00 00 00 00 00 00                          ........        

The bytes 0x01 0xff 0xff don't look right.

From the HID report descriptor:

0x27, 0xFF, 0xFF, 0x00, 0x00,  //     Logical Maximum (65534)

This tells us that the maximum axis value is 65534, or fe ff in hex. But it clearly returned a value of ff ff to us which is out of bounds. This indicates a crash or instability of the firmware to me. Tho, I'm still not quite sure why the HID report descriptor shows 0xFF 0xFF 0x00 0x00 which is 65535 in signed 32-bit but the parsed value says 65534 (which is one less). The USB HID documentation is not quite clear on that topic. It says nothing about the value being signed or unsigned, it only says that in the report the value can be assumed unsigned if both logical minimum and maximum in the descriptor are positive, so I deduce that the value is unsigned in the descriptor. Thus, if you want to have a 16-bit value above 0x7fff you have to extend it from 16 bit to 32 bit (that's why the report descriptor has 4 bytes at that row and not 2). But the wording of the spec is very complicated and unstructured at many points.

This may actually be a bug in the decoder I'm using for the 0x27 case (logical maximum 32-bit), as the kernel decodes it as 65535. For 16-bit (0x26) and 8-bit (0x25) values it doesn't show this strange off-by-1 effect.

Whatever the case is, the value 65535 is valid for the kernel, so we are having no problem resulting from this. This may indicate that this value is actually no indicator of the firmware stability, still I see values of some axis sometimes stuck just before the controller disconnects.

If you want to do some tests, look for data starting with a1 01 in btmon:

  • a1 (first byte) seems to describe the Bluetooth packet type, I didn't dig into this further, let's assume it says "here's some data from the device"
  • 01 (second byte) is the report ID for an input event, what follows is described in our file xb1s_linux.md in the docs folder:
    • 2 bytes X left stick, unsigned 16-bit
    • 2 bytes Y left stick, unsigned 16-bit
    • 2 bytes X right stick, unsigned 16-bit
    • 2 bytes Y right stick, unsigned 16-bit
    • 2 bytes left trigger, unsigned 10-bits
    • 2 bytes right trigger, unsigned 10-bits
    • 1 byte dpad, 4 bits (one bit per direction)
    • 2 bytes buttons, depends on Windows (10 buttons) or Linux (15 buttons) mode, 1 bit per button
    • 1 byte SELECT button (depends on mode, 1 bit)

The Linux mode is a bit strange with 15 bits mapping to 10 buttons (with some bits obviously dead), that's why I remapped the button bits in the driver so it stays compatible with old-school JS interface and matches Windows mode.

Ignore a1 04 which is a battery report:

  • a1 (first byte)
  • 04 (second byte)
  • 1 byte battery status (but not following the HID spec)

If you are seeing a2 03 that's a rumble report sent TO the device:

  • a2 (first byte) is a Bluetooth packet SENT TO the device
  • 03 (second byte) is the rumble report ID, what follows is also described in our docs:
    • 1 byte motor mask, 4-bits, 1 bit per rumble motor
    • 4 bytes of motor magnitudes, 1 byte per rumble motor, values 0-100%
    • 2 bytes of pulse timing values in 10ms steps: pulse sustain, pulse release
    • 1 byte loop count: how often to loop pulse sustain and release

I'd expect one should see a2 03 pass through just normally but in your logs I'm always seeing the btmon seemingly exiting sniff mode after this. That should probably not happen. That could indicate something going wrong within the Bluetooth layer and we have no control over that as that's happening in a completely different driver. OTOH, I'm always seeing the same behavior but the controller is stable for me (at least for long periods of time, I've seen 1 or 2 disconnects in a series of multiple hours). Someone knowing the Bluetooth stack may explain why btmon says "exiting sniff mode". If that's not meant to happen, there may be a bug in the Bluetooth driver.

What's the value of bcdDevice in lsusb -v for your BT dongle? Mine is 88.91.

@Mikaka27
Copy link
Contributor

Hi,
I'm not sure if it's related, but I'm seeing random disconnects of ordinary Xbox One gamepad.
Whenever this happens (rarely) it behaves like you describe here, the rumble motors are running and some joystick axis is stuck permanently.
I have to restart controller to recover.

I'm using Intel 7260hmw bluetooth (laptop wifi/bt card) connected via adapter to pci express (I'm on desktop PC).

@kakra
Copy link
Collaborator

kakra commented May 28, 2020

@Mikaka27 I hope that kernel 5.7 fixes a few problems for us, it seems to mostly stem from how the controller interacts with the Bluetooth stack. And that's in front of our driver. We cannot do anything about it except maybe try reducing the number of BT packets sent (which I've already done) or flipping the order of some protocol interactions (I was thinking about finding a way to send packets only after receiving some so we won't built up buffering in the controller). I'm pretty sure this is a bug in the controller firmware, and maybe it's something like overwriting internal buffers without locking (it's probably single-threaded but interrupt driven, so locks are still needed). I've once worked on a custom USB HID device with a custom-made firmware maybe 10 year ago, and that had similar issues that went away when the firmware was fixed to do proper locking on the IO buffers for the HID protocol by disabling interrupts. Otherwise the stability of the firmware vastly depended on the amount and order of data you received and sent to the device. That was USB directly, tho. We have the Bluetooth and HID drivers in front of us and cannot really steer the packet order during concurrent receive/sent. Seeing some axes stuck while other still work is usually a strong indicator of a bug in an interrupt-driven firmware. But I could not yet actually see a clear and conclusive pattern to support this observation.

@kakra
Copy link
Collaborator

kakra commented Jun 2, 2020

Please refer to #198 if this is a new issue that showed up only after a kernel update.

@He-Ro
Copy link

He-Ro commented Jun 8, 2020

I also got random disconnects which occur when some rumbling effect occurs.
But it is not reproducible exactly. E.g. it does not occur with the first rumble.

Info:

  • 2 Xbox One controllers connected to internal Bluetooth of laptop
  • Arch Linux 5.6.15-arch1-1

I captured a log with sudo btmon | tee /tmp/btmon.log:
btmon.log.gz

I started the log, then I turned the two controllers on. Then I played a game with both controllers. I ended the log taking shortly after one of the controllers rumbled continuously and did not accept any input. I shutdown the controller by pressing the main button for several seconds. Then I turned it on again. (I am not sure if I ended the log taking before or after re restarting of the controller).

If there is anything else you would need, please let me know.

@kakra
Copy link
Collaborator

kakra commented Jun 8, 2020

Does the second controller still work when the first fails?

@medusalix
Copy link

medusalix commented Jun 8, 2020

I've seen cases during tests where the rumble would stuck (either no more rumble at all out of the blue or continuous rumble that doesn't stop) while I was still able to control the game camera and character.

I'm pretty sure this is a bug in the controller firmware, and maybe it's something like overwriting internal buffers without locking

There's definitely something wrong with the controller's rumble packet handling in the firmware. I ran into a similar issue with my driver for the wireless dongle where the controller vibrates for an unusually long time while ignoring any further rumble commands. I never ran into a case where the input stopped working though. I know this might not be very helpful, just wanted to share my thoughts with you.

@kakra
Copy link
Collaborator

kakra commented Jun 8, 2020

I think we're facing two problems here:

The one is a buggy Bluetooth implementation in the controller (maybe as part of the firmware) and there're probably also bugs in the kernel drivers especially with some of the chipset revisions used in the vast amount of cheap dongles. And usually affects packet flow "on the air".

The other is a firmware bug with the rumble protocol (which may interact with the Bluetooth/firmware part of the first), probably something with interrupt handling and buffer locking within the firmware.

That you're only seeing the second may indicate that the original dongle uses a completely different code path in the firmware.

I figured out lately (with the direct hidraw test program in misc/examples/c_hidraw) that the firmware actually exposes the motor control as independent loops running inside the firmware: We could start one motor with some pulse parameters looping for a long time, then a little moment later start another motor with different parameters and both "programs" will run independent from each other. So the firmware is a lot more complex than it would need to be for a memless ff device.

I wonder how the Windows driver programs the rumble motors. Maybe this would be a clue at finally fixing this? Maybe it's wrong to program all motors in parallel with only one command? Did you by coincidence do some packet traces of this and analyzed them?

@medusalix
Copy link

medusalix commented Jun 8, 2020

We could start one motor with some pulse parameters looping for a long time, then a little moment later start another motor with different parameters and both "programs" will run independent from each other.

Oh wow, that makes it even more likely to hit some kind of race condition while processing rumble packets. Which would make sense because I only ran into that continuous-rumble problem in games that send a lot of force feedback events, thereby increasing the chance of hitting such a case.

I wonder how the Windows driver programs the rumble motors.

If I remember correctly, the Windows driver always programmed all motors with the maximum duration at the same time. Much like XInput's XInputSetState function which doesn't even allow you to specify the duration at all. I can do some additional packet captures if you want, either USB or wireless, they use the same packet format.

@kakra
Copy link
Collaborator

kakra commented Jun 8, 2020

Oh wow, that makes it even more likely to hit some kind of race condition

Yes, I'm suspecting this. BTW: This works by setting the motor enable bit in byte 2 of the rumble packet only for the motors you want to program. Leaving the other bits zero will not stop them, they just continue to play their last command. Only setting the rumble strength to 0 makes the motors stop if you'd also set the enable bit of the motor.

This protocol is really strange: We have those four enable bits, but we also have a strength per motor. It seems like two different engineering teams worked on this and simply smashed their results together.

So if there's a race condition: What could be done to alleviate the problem? I can think of some solutions:

Sending only reprograms for motors that actually had their strength changed since the last command.

Reducing the running time of a command to the absolute minimum needed so the internal looping in the firmware would stop early. In the kernel, we can expect a command every 50ms as this is the resolution of the ff-memless emulation driver sitting in front of us. But I already tried setting sustain to 5, release to 0 and loop to 1 - and that was kinda choppy, the kernel timing isn't quite exact for this.

@medusalix
Copy link

This protocol is really strange: We have those four enable bits, but we also have a strength per motor. It seems like two different engineering teams worked on this and simply smashed their results together.

Yeah, especially considering that Microsoft's driver apparently doesn't even make use of the motor-bits. They might be used in this UWP API though. The article recommends setting all the vibration values at once.

Sending only reprograms for motors that actually had their strength changed since the last command.

That seems like the best solution, if it fixes the underlying problem.

I only rely on the length and delay values of the ff_replay struct for my driver (I can't use ff-memless from user space). This works very well for games that make use of those value, while other games that do the timing internally often cause the controller to get stuck while rumbling.

@kakra
Copy link
Collaborator

kakra commented Jun 8, 2020

I've set sustain to 60ms now, release to 0, loops to 1. Choppiness seems to be gone (as opposed to 50ms). But it'll need some time to show the stability of this as I only rarely had this issue. Also, this hack conflicts with my "optimization" of sending rumble commands only when motors changed: If I do not send another command the next 50ms cycle, the rumble will just stop. But if it proves more stable, sending reprograms only for changed motors may solve this.

I only rely on the length and delay values of the ff_replay struct for my driver (I can't use ff-memless from user space). This works very well for games that make use of those value, while other games that do the timing internally often cause the controller to get stuck while rumbling.

If your driver registers as ff_memless device (I don't know if this is even an option in uinput), you should see length and delay always 0 as the ff_memless driver will emulate all the various effect uploads (square, sawtooth, sine) and emulate them by sending updated constant rumble values every 50ms. If you'd instead opt-in for effect uploads, you'd probably need to implement the emulation itself as the controller only supports square wave rumble. The Linux drivers for XB1S controllers (the one in hid-microsoft and xpadneo) just emulate a constant rumble with a very long sustain time of the square pulse (sustain = U8_MAX, release = 0, loop = U8_MAX, essentially a square wave with infinite sustain period), relying on the ff_memless driver to send us updated values for all the various values resulting from combining all the various sine/square/sawtooth/whatever effects that could play in parallel.

@kakra
Copy link
Collaborator

kakra commented Jun 8, 2020

I don't know if this is even an option in uinput

Oh, I see you just posted that it's not an option. ;-)

kakra added a commit to kakra/xpadneo that referenced this issue Jun 8, 2020
@kakra
Copy link
Collaborator

kakra commented Jun 8, 2020

@medusalix I've taken the opportunity to actually implement this optimization instead of hacking it in - it was super simple after my previous efforts of code redesign. I now see the rumble packets with varying motor enable flags, and as such only sending updates for motors that changed:

# sudo btmon | fgrep --color=auto "a2 03"
        a2 03 03 00 00 01 01 ff 00 ff                    ..........      
        a2 03 03 00 00 00 00 ff 00 ff                    ..........      
        a2 03 07 00 1d 1d 1d ff 00 ff                    ..........      
        a2 03 07 00 63 63 63 ff 00 ff                    ....ccc...      
        a2 03 07 00 3d 3d 3d ff 00 ff                    ....===...      
        a2 03 07 00 23 23 23 ff 00 ff                    ....###...      
        a2 03 07 00 0d 22 22 ff 00 ff                    .....""...      
        a2 03 04 00 0c 22 22 ff 00 ff                    .....""...      
        a2 03 07 00 00 21 21 ff 00 ff                    .....!!...      
        a2 03 03 00 00 20 20 ff 00 ff                    .....  ...      
        a2 03 03 00 00 1f 1f ff 00 ff                    ..........      
        a2 03 03 00 00 1d 1d ff 00 ff                    ..........      
        a2 03 03 00 00 1b 1b ff 00 ff                    ..........      
        a2 03 03 00 00 17 17 ff 00 ff                    ..........      
        a2 03 03 00 00 12 12 ff 00 ff                    ..........      

See #202

Also, I see you adapted the directional rumble into your driver. I've reworked this as direction with FF_RUMBLE seems out of specification of the ff driver design, instead we support a new pressure-based trigger rumble. You may want to adapt it to your driver.

kakra added a commit to kakra/xpadneo that referenced this issue Jun 8, 2020
@kakra kakra reopened this Jun 10, 2020
@medusalix
Copy link

medusalix commented Jun 19, 2020

Now, just one motor may "crash", it spins for a while, then stops.

It's always the left/main motor for me. The right one never crashes.
I've logged the force feedback events while playing a round of Broforce and I can semi-reliably reproduce the problem using the following script:

import time
import evdev
from evdev import ecodes, InputDevice, ff

for name in evdev.list_devices():
    dev = InputDevice(name)
    if ecodes.EV_FF in dev.capabilities():
        break

effect_id = -1
rumble = ff.Rumble(strong_magnitude=65000, weak_magnitude=0)

while rumble.strong_magnitude > 0:
    rumble.strong_magnitude -= 1000
    effect = ff.Effect(
        ecodes.FF_RUMBLE, effect_id, 0,
        ff.Trigger(0, 0),
        ff.Replay(0, 0),
        ff.EffectType(ff_rumble_effect=rumble)
    )
    effect_id = dev.upload_effect(effect)
    dev.write(ecodes.EV_FF, effect_id, 1)
    time.sleep(0.008)
    
for i in range(1000):
    dev.write(ecodes.EV_FF, effect_id, 1)
    time.sleep(0.005)

The first loop iterates over the full magnitude range of the left motor (from high to low).
The second loop tries to stop the rumble by sending zero magnitudes for a short while. The motor locks up and refuses to stop.

I've had to play around with the delays a little bit, but it does the trick for me.

@kakra
Copy link
Collaborator

kakra commented Jun 19, 2020

It's always the left/main motor for me. The right one never crashes.

I didn't really notice that but now that you write it: Yes, it was always the left motor. But maybe only because of parsing direction of the payload within the firmware? IOT, if we "fixed" this problem, maybe the right motor shows the same problem.

But kudos for creating a reproducer script. Will try later.

@medusalix
Copy link

I've gone ahead and captured some rumble packets on Windows. Nothing unusual, expect that the magnitudes indeed only ever reach 100 and not 255 (as you've rightfully noticed in your comment). I'll have to fix that for xow.
Interestingly, there seems to be a minimum delay of 10 milliseconds between each transmission of a rumble packet. Once I had implemented that in my driver, the motors did no longer lock up. I guess that's the only thing you have to pay attention too, I can't reproduce this problem anymore.

@kakra
Copy link
Collaborator

kakra commented Jun 19, 2020

In theory, the kernel ff-memless driver only gives us one rumble command per 20ms... I'll try to log it. Thanks for analyzing.

@kakra
Copy link
Collaborator

kakra commented Jun 20, 2020

@medusalix Okay, testing shows that ff-memless calls us much more often than it is supposed to do given its 50ms time resolution. But maybe the timing is just a little off more often than not due to scheduling issues. I'll be pushing a branch with that throttles access by deferring too fast programming into the future, and only programming the final state of that period. It's been working fine so far. Thanks a lot for your suggestion.

kakra added a commit to kakra/xpadneo that referenced this issue Jun 20, 2020
Thanks to @medusalix great findings that the controller crashes are
most probably indeed a timing issue, this commit introduces a
modified rumble protocol which throttles rumble commands sent to the
controller. Also, it seems we cannot rely on `ff-memless` to only send
us commands once per 50ms although that's the proposed time resolution
of the protocol. It may depend on kernel HZ setting.

This commit migrates the rumble worker into a delayed worker. When
programming a rumble motor, we now remember the timing and allow the
next reprogramming only 10ms+ in the future. Further requests will be
buffered and the final command will be replayed to the controller when
the delayed worker finally runs. Care was taken that we do not
accidentally introduce a fixed lag of 10ms for rumble events.

Maybe-fixes: atar-axis#180
Maybe-fixes: atar-axis#189
Signed-off-by: Kai Krakow <[email protected]>
kakra added a commit to kakra/xpadneo that referenced this issue Jun 20, 2020
Thanks to @medusalix great findings that the controller crashes are
most probably indeed a timing issue, this commit introduces a
modified rumble protocol which throttles rumble commands sent to the
controller. Also, it seems we cannot rely on `ff-memless` to only send
us commands once per 50ms although that's the proposed time resolution
of the protocol. It may depend on kernel HZ setting.

This commit migrates the rumble worker into a delayed worker. When
programming a rumble motor, we now remember the timing and allow the
next reprogramming only 10ms+ in the future. Further requests will be
buffered and the final command will be replayed to the controller when
the delayed worker finally runs. Care was taken that we do not
accidentally introduce a fixed lag of 10ms for rumble events.

Maybe-fixes: atar-axis#180
Maybe-fixes: atar-axis#189
Signed-off-by: Kai Krakow <[email protected]>
@kakra
Copy link
Collaborator

kakra commented Jun 22, 2020

Testing looks good so far with my PR #204, I didn't have a single controller disconnect/rumble crash yet. But I didn't have many issues with that anyways. So anybody affected might want to test that PR?

@medusalix
Copy link

medusalix commented Jun 22, 2020

I wonder whether dropping all rumble commands in that 10 ms window would cause any issues. Implementing a proper queue that buffers any excess rumbles in user space is a tricky thing to do.
Based on my Wireshark investigations, I assume Microsoft has found a clever way to buffer quick successions of rumbles, while still having a threshold where they are dropping all rumbles and only sending the last one when they receive no more commands.

@kakra
Copy link
Collaborator

kakra commented Jun 22, 2020

I do not drop them in that way... I'll buffer them and send only the last command buffered (so it's actually a one-packet buffer). The delay until the buffer is sent is calculated dynamically based on the last time we've sent data. For this, the actual send command will write a timestamp+10ms for the next acceptable earliest time.

So the first command may send the rumble immediately, successive commands may be buffered until after 10ms, when we send the latest buffered command. So we get a lag of 10ms max. After multiple successive commands, the rumble eventually stops: The way the implementation works, is that we always send the last command seen, so we actually send a stop command no matter what.

I think that matches "a clever way to buffer ... only sending the last one" you mentioned above?

@kakra
Copy link
Collaborator

kakra commented Jun 22, 2020

In your driver you may implement this by splitting the command receiver and command sender into two threads: The receiver will update the buffer (put a spin lock around) and signal the sender. The sender will block on a signal from the receiver and read the buffer (put a spin lock around), send the buffer to hardware, then wait 10ms before it loops back to waiting on the signal. I think for the signal, a semaphore could work.

Actually, I wanted to implement it that way but the kernel does not allow me to use mdelay() in some of the contexts, and a worker is supposed to exit fast otherwise we may block other worker processes in the kernel. So my implementation with timekeeping of the next window etc is quite complicated.

@medusalix
Copy link

medusalix commented Jun 22, 2020

I think that matches "a clever way to buffer ... only sending the last one" you mentioned above?

Yeah, sounds like that might be exactly what Microsoft is doing.

In your driver you may implement this by splitting the command receiver and command sender into two threads: The receiver will update the buffer (put a spin lock around) and signal the sender. The sender will block on a signal from the receiver and read the buffer (put a spin lock around), send the buffer to hardware, then wait 10ms before it loops back to waiting on the signal. I think for the signal, a semaphore could work.

Makes perfect sense, thank you. So I don't even need a queue, I just overwrite the last rumble command until the 10 milliseconds are over and I'm allowed to send another packet.

@kakra
Copy link
Collaborator

kakra commented Jun 22, 2020

Makes perfect sense, thank you. So I don't even need a queue, I just overwrite the last rumble command until the 10 milliseconds are over and I'm allowed to send another packet.

Yes, just ensure that you actually do this final send even when there's no more incoming rumble requests for some time. A semaphore and a loop in a different thread (that just waits 10ms right after sending the packet) will do the trick. It's a classical producer/consumer algorithm, you may want to Google that if you're unsure how to start.

@kakra
Copy link
Collaborator

kakra commented Jun 22, 2020

The tricky part is that you're working with two different locks (semaphore and spin locks) which introduces a chance of a deadlock and you need to prevent that... Since you want to update the buffer independent of the semaphore, you need to carefully design the interactions between the two threads: You don't want the receiver update the buffer after the sender reads it but before the sender sent it, so the spin lock must last until after sending the buffer to hardware, but not while the sender is sleeping. Your design must ensure that your buffer state is always known-sent or known-pending when the receiver looks at it.

You should create a concurrency diagram of your code to see that there's no point in time where you could lose a packet. I guarded my code with a final if that could detect this case - but it should never happen. If it does, there's a bug in my code.

PS: The wording here is swapped compared to classical consumer/producer: The receiver is the producer here (it produces the packet received by the API), the sender is the consumer (it consumes the buffer by sending it to hardware). You may thus use the wording consumer/producer in your code and put a little comment in context of API and hardware.

@medusalix
Copy link

I've had a similar problem in earlier versions of my driver where I needed to synchronize USB reads from two different endpoints. I had decided to put all USB packets into a queue and protected them with a mutex. Another thread was simultaneously reading from that queue while being notified of new packets using a conditional_variable. I believe that's also a producer/consumer problem. Oh boy did it cause problems... Never again lol.

I think I'll stick to the simpler approach of calculating the delta time and discarding rumble commands. While it might (potentially) miss some packets, it does at least save me from opening up that can of worms again.

@kakra
Copy link
Collaborator

kakra commented Jun 22, 2020

An easier approach would be to just send your commands 10ms late... I think, this is barely noticeable. And it's quite easy to get things started with two threads. Adding the "immediate on first packet" is a simple matter then. I've went this route in my code.

It took me around 4 hours to get my code into shape. The main problem is: When I crashed or hung the kernel, I had to reboot, and a clean reboot took 30 minutes sometimes, or I even had to hard-reboot. So in the end, it maybe had taken me just 1.5 to 2 hours.

Since you don't need to handle a queue, it's pretty simple. If you'd like, I can mentor you a little bit on the synchronization part but I don't know your code good enough to know where to start with thread init and tear-down.

@kakra
Copy link
Collaborator

kakra commented Jun 22, 2020

Another thread was simultaneously reading from that queue while being notified of new packets using a conditional_variable. I believe that's also a producer/consumer problem.

You may want to look into designing multi-threaded consumer/producer algorithms. There're some good read on the internet with simple examples. Also, I believe that C++ already ships with a FIFO queue implementation suitable for multi-threaded usage in consumer/producer scenario by subclassing it: https://stackoverflow.com/questions/23661759/creating-a-blocking-queue

So your main issue was probably abstracting the code properly.

@kakra
Copy link
Collaborator

kakra commented Jun 22, 2020

@medusalix
Copy link

medusalix commented Jun 22, 2020

The main issue was the fact that I couldn't find a way to stop the consumer (when the driver was shut down) without a horrendous amount of locking. I think my solution back then involved at least three different mutexes (imagine the horror). There is also not a safe way to cancel asynchronous USB transfers and I had to handle SIGINT and SIGTERM and pass them to the USB dongle. I think the actual queue worked fine 😃.
Now that I've switched to libusb's sync API I only need a single lock and it works rock solid so far.

EDIT: Oh, I forgot to mention that there was a USB hotplug handler too. Also required locking.

@kakra
Copy link
Collaborator

kakra commented Jun 22, 2020

The main issue was the fact that I couldn't find a way to stop the consumer (when the driver was shut down) without a horrendous amount of locking. I think my solution back then involved at least three different mutexes (imagine the horror). There is also not a safe way to cancel asynchronous USB transfers and I had to handle SIGINT and SIGTERM and pass them to the USB dongle. I think the actual queue worked fine .

The link I've posted actually covers all of that - it can signal the end of production, so you can properly catch signals and inform the queue to stop consuming and exit cleanly:

Or this: https://github.com/CodeExMachina/BlockingCollection

What you describe sounds a lot like an abstraction/design issue. I've done a lot of those wrong decisions in my 20+ years of programming career so I think I can smell it when it reeks. ;-)

But the interaction with USB probably bloated everything up, the consumer and producer for a single packet buffer should be much easier and without most of the pitfalls, you just need to guard access to the buffer with a spin-lock. Good that the sync API simplified everything, that's always a good choice.

You may also get away with a double buffer by using https://en.cppreference.com/w/cpp/atomic/atomic/exchange and swapping the consumer with the receiver buffer. The http://www.cplusplus.com/reference/condition_variable/condition_variable/ then just waits for a signal in the consumer and swaps the buffers, then acts on it (Update: actually, the producer should swap the buffers, then signal the consumer!). You can put an additional exit flag into the buffers to signal an intent to exit immediately.

Another idea: Maybe even a triple buffer... That's how graphic applications prevent using vsync. The producer just swaps two buffers it constantly updates, a front and a back buffer (the latter will be updated in the background). The consumer then swap the front buffer with its own buffer. But this introduces a possible lag of one buffer, you need to ensure that the latest buffer is always sent to the device at some point in the future. This should be easy enough by ensuring a proper order of signal and swapping. The advantage of this: It's completely lock-less.

Compared to GPU triple buffering: Your receiver (producer) is the uinput client (the application). It will update the back buffer, then swap it with the front buffer. Your sender (consumer) is the hardware (the display), it will wait on the condition (aka sync signal for an adaptive sync display), then swaps the front buffer with its send buffer (the frame buffer for the display), then returns to waiting on the condition. So for naming I suggest using producer, consumer, buffer { .front, .back, .send }.

@medusalix
Copy link

What you describe sounds a lot like an abstraction/design issue. I've done a lot of those wrong decisions in my 20+ years of programming career so I think I can smell it when it reeks. ;-)

Absolutely. I'm not a software engineer by any means. Things can get messy real quick sometimes.

Another idea: Maybe even a triple buffer...

I like to keep things simple and straightforward, you know. Least amount of threads, simple Makefile, the list goes on... Saves you a lot of trouble ;)

@kakra
Copy link
Collaborator

kakra commented Jun 22, 2020

I like to keep things simple and straightforward, you know. Least amount of threads, simple Makefile, the list goes on... Saves you a lot of trouble ;)

Then triple buffer is probably the way to go as you do not need thread synchronization for buffer access, just signalling.

kakra added a commit to kakra/xpadneo that referenced this issue Jun 26, 2020
Thanks to @medusalix great findings that the controller crashes are
most probably indeed a timing issue, this commit introduces a
modified rumble protocol which throttles rumble commands sent to the
controller. Also, it seems we cannot rely on `ff-memless` to only send
us commands once per 50ms although that's the proposed time resolution
of the protocol. It may depend on kernel HZ setting.

This commit migrates the rumble worker into a delayed worker. When
programming a rumble motor, we now remember the timing and allow the
next reprogramming only 10ms+ in the future. Further requests will be
buffered and the final command will be replayed to the controller when
the delayed worker finally runs. Care was taken that we do not
accidentally introduce a fixed lag of 10ms for rumble events.

Maybe-fixes: atar-axis#180
Maybe-fixes: atar-axis#189
Signed-off-by: Kai Krakow <[email protected]>
kakra added a commit to kakra/xpadneo that referenced this issue Jun 27, 2020
Thanks to @medusalix great findings that the controller crashes are
most probably indeed a timing issue, this commit introduces a
modified rumble protocol which throttles rumble commands sent to the
controller. Also, it seems we cannot rely on `ff-memless` to only send
us commands once per 50ms although that's the proposed time resolution
of the protocol. It may depend on kernel HZ setting.

This commit migrates the rumble worker into a delayed worker. When
programming a rumble motor, we now remember the timing and allow the
next reprogramming only 10ms+ in the future. Further requests will be
buffered and the final command will be replayed to the controller when
the delayed worker finally runs. Care was taken that we do not
accidentally introduce a fixed lag of 10ms for rumble events.

Maybe-fixes: atar-axis#180
Maybe-fixes: atar-axis#189
Signed-off-by: Kai Krakow <[email protected]>
kakra added a commit to kakra/xpadneo that referenced this issue Jun 27, 2020
Thanks to @medusalix great findings that the controller crashes are
most probably indeed a timing issue, this commit introduces a
modified rumble protocol which throttles rumble commands sent to the
controller. Also, it seems we cannot rely on `ff-memless` to only send
us commands once per 50ms although that's the proposed time resolution
of the protocol. It may depend on kernel HZ setting.

This commit migrates the rumble worker into a delayed worker. When
programming a rumble motor, we now remember the timing and allow the
next reprogramming only 10ms+ in the future. Further requests will be
buffered and the final command will be replayed to the controller when
the delayed worker finally runs. Care was taken that we do not
accidentally introduce a fixed lag of 10ms for rumble events.

Maybe-fixes: atar-axis#180
Maybe-fixes: atar-axis#189
Signed-off-by: Kai Krakow <[email protected]>
kakra added a commit to kakra/xpadneo that referenced this issue Jun 27, 2020
Thanks to @medusalix great findings that the controller crashes are
most probably indeed a timing issue, this commit introduces a
modified rumble protocol which throttles rumble commands sent to the
controller. Also, it seems we cannot rely on `ff-memless` to only send
us commands once per 50ms although that's the proposed time resolution
of the protocol. It may depend on kernel HZ setting.

This commit migrates the rumble worker into a delayed worker. When
programming a rumble motor, we now remember the timing and allow the
next reprogramming only 10ms+ in the future. Further requests will be
buffered and the final command will be replayed to the controller when
the delayed worker finally runs. Care was taken that we do not
accidentally introduce a fixed lag of 10ms for rumble events.

Maybe-fixes: atar-axis#180
Fixes: atar-axis#189
Signed-off-by: Kai Krakow <[email protected]>
@kakra kakra closed this as completed in 8e7eba3 Jun 27, 2020
@kakra kakra unpinned this issue Jun 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants