Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commands on Docker host fail with "returned empty string" #644

Open
jamesmyatt opened this issue Jun 8, 2023 · 14 comments
Open

Commands on Docker host fail with "returned empty string" #644

jamesmyatt opened this issue Jun 8, 2023 · 14 comments
Labels
bug Something isn't working docker-plain plain Docker (no swarm, no compose, no stack)

Comments

@jamesmyatt
Copy link

SUMMARY

I'm unable to run commands on a docker host. This used to work.

ISSUE TYPE
  • Bug Report
COMPONENT NAME

community.docker

ANSIBLE VERSION
ansible [core 2.15.0]
  config file = /workspaces/project/ansible.cfg
  configured module search path = ['/home/vscode/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/vscode/.local/lib/python3.10/site-packages/ansible
  ansible collection location = /home/vscode/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/vscode/.local/bin/ansible
  python version = 3.10.10 (main, Mar 23 2023, 17:28:09) [GCC 10.2.1 20210110] (/usr/local/bin/python)
  jinja version = 3.1.2
  libyaml = True
COLLECTION VERSION
$ ansible-galaxy collection list community.docker

# /home/vscode/.ansible/collections/ansible_collections
Collection       Version
---------------- -------
community.docker 3.4.6  

# /home/vscode/.local/lib/python3.10/site-packages/ansible_collections
Collection       Version
---------------- -------
community.docker 3.4.6  

# /usr/local/lib/python3.10/site-packages/ansible_collections
Collection       Version
---------------- -------
community.docker 3.4.6  
CONFIGURATION
CONFIG_FILE() = /workspaces/project/ansible.cfg
OS / ENVIRONMENT

Running Ansible in vscode dev container (mcr.microsoft.com/vscode/devcontainers/python:0-3.10). Docker version 24.0.1, build 6802122.

STEPS TO REPRODUCE

I have a role that includes the following tasks:

---
- name: Create temporary container
  community.docker.docker_container:
    name: tmp_data
    image: python:3.10-bullseye
    state: started
    auto_remove: yes
    command: /bin/bash
    interactive: yes
    tty: yes
    volumes_from:
      - "{{ container_name }}"

- name: Add container to inventory
  add_host:
    name: "{{ inventory_hostname }}+tmp_data"
    ansible_connection: community.docker.docker
    ansible_host: tmp_data
    ansible_user: root
    ansible_become: yes
    ansible_remote_tmp: /tmp/.ansible
    ansible_docker_extra_args: "-H=tcp://{{ ansible_host }}:2375"
    ansible_python_interpreter: python

- name: ping
  delegate_to: "{{ inventory_hostname }}+tmp_data"
  ansible.builtin.ping:
EXPECTED RESULTS

I expect this to work.

ACTUAL RESULTS

Running with -vvv, everything works up to the last task, which fails:

<tmp_data> ESTABLISH DOCKER CONNECTION FOR USER: root
<tmp_data> EXEC ['/usr/bin/docker', '-H=tcp://192.168.1.168:2375', b'exec', b'-u', 'root', b'-i', 'tmp_data', '/bin/sh', '-c', '/bin/sh -c \'( umask 77 && mkdir -p "` echo /tmp/.ansible `"&& mkdir "` echo /tmp/.ansible/ansible-tmp-1686177373.2244887-17139-101563472711239 `" && echo ansible-tmp-1686177373.2244887-17139-101563472711239="` echo /tmp/.ansible/ansible-tmp-1686177373.2244887-17139-101563472711239 `" ) && sleep 0\'']
fatal: [windshear-1 -> windshear-1+tmp_data]: FAILED! => {
    "msg": "failed to resolve remote temporary directory from ansible-tmp-1686177373.2244887-17139-101563472711239: `( umask 77 && mkdir -p \"` echo /tmp/.ansible `\"&& mkdir \"` echo /tmp/.ansible/ansible-tmp-1686177373.2244887-17139-101563472711239 `\" && echo ansible-tmp-1686177373.2244887-17139-101563472711239=\"` echo /tmp/.ansible/ansible-tmp-1686177373.2244887-17139-101563472711239 `\" )` returned empty string"
}

Seems odd that the command has '/bin/sh', '-c' twice. Running the same docker command at the command line also doesn't work-it just hangs with no output--but running the command again without the duplicate /bin/sh -c works as I'd expect.

@jamesmyatt jamesmyatt changed the title Docker host commands fail with "returned empty string" Commands on Docker host fail with "returned empty string" Jun 8, 2023
@felixfontein felixfontein added bug Something isn't working docker-plain plain Docker (no swarm, no compose, no stack) labels Jun 8, 2023
@felixfontein
Copy link
Collaborator

This used to work.

Do you have a bit more precise information with which versions of community.docker and ansible-core this used to work?

@felixfontein
Copy link
Collaborator

I'm trying to reproduce this locally:

---
- hosts: localhost
  gather_facts: false
  tasks:
    - name: Create temporary container
      community.docker.docker_container:
        name: test
        image: python:3.10-bullseye
        state: started
        auto_remove: true
        command: /bin/bash
        interactive: true
        tty: true

    - name: Add container to inventory
      add_host:
        name: "test"
        ansible_connection: community.docker.docker
        ansible_host: test
        ansible_user: root
        ansible_become: true
        ansible_remote_tmp: /tmp/.ansible
        ansible_python_interpreter: python

    - name: ping
      delegate_to: "test"
      ansible.builtin.ping:

But it works fine for me (I'm running it with ANSIBLE_PIPELINING=false, otherwise it runs other, simpler commands):

<test> ESTABLISH DOCKER CONNECTION FOR USER: root
<test> EXEC ['/usr/bin/docker', b'exec', b'-u', 'root', b'-i', 'test', '/bin/sh', '-c', '/bin/sh -c \'( umask 77 && mkdir -p "` echo /tmp/.ansible `"&& mkdir "` echo /tmp/.ansible/ansible-tmp-1686254840.6949074-18066-241769342060559 `" && echo ansible-tmp-1686254840.6949074-18066-241769342060559="` echo /tmp/.ansible/ansible-tmp-1686254840.6949074-18066-241769342060559 `" ) && sleep 0\'']
Using module file /path/to/ansible/modules/ping.py
<test> PUT /path/to/.ansible/tmp/ansible-local-18034qycxoiwk/tmpvzp9a6hh TO /tmp/.ansible/ansible-tmp-1686254840.6949074-18066-241769342060559/AnsiballZ_ping.py
<test> EXEC ['/usr/bin/docker', b'exec', b'-u', 'root', b'-i', 'test', '/bin/sh', '-c', "/bin/sh -c 'chmod u+x /tmp/.ansible/ansible-tmp-1686254840.6949074-18066-241769342060559/ /tmp/.ansible/ansible-tmp-1686254840.6949074-18066-241769342060559/AnsiballZ_ping.py && sleep 0'"]
<test> EXEC ['/usr/bin/docker', b'exec', b'-u', 'root', b'-i', 'test', '/bin/sh', '-c', "/bin/sh -c 'python /tmp/.ansible/ansible-tmp-1686254840.6949074-18066-241769342060559/AnsiballZ_ping.py && sleep 0'"]
<test> EXEC ['/usr/bin/docker', b'exec', b'-u', 'root', b'-i', 'test', '/bin/sh', '-c', "/bin/sh -c 'rm -f -r /tmp/.ansible/ansible-tmp-1686254840.6949074-18066-241769342060559/ > /dev/null 2>&1 && sleep 0'"]
ok: [localhost -> test] => {
    "changed": false,
    "invocation": {
        "module_args": {
            "data": "pong"
        }
    },
    "ping": "pong"
}

This could also be related to using -H=tcp://192.168.1.168:2375, i.e. the Docker version. In any case, I cannot reproduce the problem you are having.

@jamesmyatt
Copy link
Author

jamesmyatt commented Jun 9, 2023

This used to work.

Do you have a bit more precise information with which versions of community.docker and ansible-core this used to work?

Sorry. No. I wish I did.

I'll do some more tests.

@jamesmyatt
Copy link
Author

For what it's worth, the Docker versions are 24.0.1 locally and 24.0.2 on the host.

Confusingly, running that command directly at the command prompt works fine. I'm pretty sure it didn't yesterday. But the playbook still fails for me.

/usr/bin/docker -H=tcp://192.168.1.168:2375 exec -u root -i tmp_data /bin/sh -c '/bin/sh -c "( umask 77 && mkdir -p \"` echo /tmp/.ansible `\" && mkdir \"` echo /tmp/.ansible/ansible-tmp-1686177373.2244887-17139-101563472711239 `\" && echo ansible-tmp-1686177373.2244887-17139-101563472711239=\"` echo /tmp/.ansible/ansible-tmp-1686177373.2244887-17139-101563472711239 `\" ) && sleep 0"'

@jamesmyatt
Copy link
Author

jamesmyatt commented Jun 10, 2023

The playbook you posted does work on the machine hosting the docker container, but doesn't work from another machine. Even when adding ansible_docker_extra_args: "-H=tcp://192.168.1.168:2375" to the add_host task on either machine.

However, as I said, running that command directly from the other machine does work. Also running community.docker.docker_container with the docker_host argument also works (see below). So I don't think it's just a Docker problem. So what's different between the shell environment and the Ansible environment?

Also, I'm open to a better way of connecting to the remote Docker containers than -H=tcp://192.168.1.168:2375.

---
- hosts: localhost
  gather_facts: false
  tasks:
    - name: Create temporary container
      community.docker.docker_container:
        docker_host: tcp://192.168.1.168:2375
        name: test
        image: python:3.10-bullseye
        state: started
        auto_remove: true
        command: /bin/bash
        interactive: true
        tty: true

    - name: Add container to inventory
      add_host:
        name: "test"
        ansible_connection: community.docker.docker
        ansible_host: test
        ansible_user: root
        ansible_become: true
        ansible_remote_tmp: /tmp/.ansible
        ansible_python_interpreter: python
        ansible_docker_extra_args: "-H=tcp://192.168.1.168:2375"

    - name: ping
      delegate_to: "test"
      ansible.builtin.ping:

@felixfontein
Copy link
Collaborator

The playbook you posted does work on the machine hosting the docker container, but doesn't work from another machine. Even when adding ansible_docker_extra_args: "-H=tcp://192.168.1.168:2375" to the add_host task on either machine.

What are the differences between these two machines with regard to Python version, ansible-core version, community.docker version, Docker version, ...?

@jamesmyatt
Copy link
Author

I never figured this out, but instead I worked around it. Thanks for your help. Instead, I changed it to connect to the docker host via SSH instead:

    ansible_docker_extra_args: "-H=ssh://{{ ansible_user }}@{{ ansible_host }}"

In fact, I think this is an all-round better solution anyway, since it's both more secure and easier. You're using the existing SSH connection, rather than configuring Docker to open a new port, and it uses SSH for security rather than allowing open access. I don't know why the Docker documentation doesn't recommend this approach more prominently for remote access (e.g. https://docs.docker.com/config/daemon/remote-access/). I don't know if it's worth adding something to the docs here.

@calillo
Copy link

calillo commented Sep 6, 2023

I have the same problem when connecting to a remote docker engine, it happens because the exec call checks the command output too soon before it is executed.

I did an analysis of the packets with wireshark and the response from the docker engine is that the command is still running, possible it goes into error because the return code is null and not zero, it would be enough to introduce a delay or repeat the call when the command is still running.

`
POST /v1.41/containers/test/exec HTTP/1.1
Host: 10.0.10.17:2375
User-Agent: Docker-Client/20.10.25+azure-2 (linux)
Content-Length: 533
Content-Type: application/json

{"User":"","Privileged":false,"Tty":false,"AttachStdin":true,"AttachStderr":true,"AttachStdout":true,"Detach":false,"DetachKeys":"","Env":null,"WorkingDir":"","Cmd":["/bin/sh","-c","/bin/sh -c '( umask 77 \u0026\u0026 mkdir -p "echo ~/.ansible/tmp"\u0026\u0026 mkdir "echo ~/.ansible/tmp/ansible-tmp-1693931639.118705-877823-209799273796602" \u0026\u0026 echo ansible-tmp-1693931639.118705-877823-209799273796602="echo ~/.ansible/tmp/ansible-tmp-1693931639.118705-877823-209799273796602" ) \u0026\u0026 sleep 0'"]}
HTTP/1.1 201 Created
Api-Version: 1.42
Content-Type: application/json
Docker-Experimental: false
Ostype: linux
Server: Docker/23.0.6 (linux)
Date: Tue, 05 Sep 2023 16:33:59 GMT
Content-Length: 74

{"Id":"37620a287a45af02f0481d8ef439eee6b3d93483e07c208f2db841400545a8b5"}
GET /v1.41/exec/37620a287a45af02f0481d8ef439eee6b3d93483e07c208f2db841400545a8b5/json HTTP/1.1
Host: 10.0.10.17:2375
User-Agent: Docker-Client/20.10.25+azure-2 (linux)

HTTP/1.1 200 OK
Api-Version: 1.42
Content-Type: application/json
Docker-Experimental: false
Ostype: linux
Server: Docker/23.0.6 (linux)
Date: Tue, 05 Sep 2023 16:33:59 GMT
Content-Length: 682

{"ID":"37620a287a45af02f0481d8ef439eee6b3d93483e07c208f2db841400545a8b5","Running":true,"ExitCode":null,"ProcessConfig":{"tty":false,"entrypoint":"/bin/sh","arguments":["-c","/bin/sh -c '( umask 77 && mkdir -p "echo ~/.ansible/tmp"&& mkdir "echo ~/.ansible/tmp/ansible-tmp-1693931639.118705-877823-209799273796602" && echo ansible-tmp-1693931639.118705-877823-209799273796602="echo ~/.ansible/tmp/ansible-tmp-1693931639.118705-877823-209799273796602" ) && sleep 0'"],"privileged":false},"OpenStdin":true,"OpenStderr":true,"OpenStdout":true,"CanRemove":false,"ContainerID":"9f4c334008a42c6d101143ebc9274fc3fa58c8db1603c3f4524f62878f5cfed6","DetachKeys":"","Pid":0}
`

@jfjauvin
Copy link

SUMMARY

When executing a task delegated to a host using the community.docker.docker connection, the task fails with the failed to resolve remote temporary directory from [...] returned empty string

ISSUE TYPE

  • Bug Report

COMPONENT NAME

community.docker

ANSIBLE VERSION

ansible [core 2.15.5]
  config file = None
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.12/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.12.0 (main, Oct 11 2023, 23:03:18) [GCC 12.2.0] (/usr/local/bin/python)
  jinja version = 3.1.2
  libyaml = True

DOCKER VERSION

Docker Host

Docker version 24.0.6, build ed223bc

Ansible Container

Docker version 20.10.24+dfsg1, build 297e128

I have tried to upgrade the binary in the Ansible container to version 24, but the issue still persists.

COLLECTION VERSION

ansible-galaxy collection list community.docker

# /usr/local/lib/python3.12/site-packages/ansible_collections
Collection       Version
---------------- -------
community.docker 3.4.9 

CONFIGURATION

CONFIG_FILE() = None

OS / ENVIRONMENT

Docker Desktop

Windows 11 Version 22H2 (OS Build 22621.2283)
Docker Desktop Version 4.24.1 (123237)
Engine: 24.0.6

Ansible Container

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Target Container

NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.18.4
PRETTY_NAME="Alpine Linux v3.18"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues"

STEPS TO REPRODUCE

Create the following 3 files in WSL.

/Dockerfile

FROM python:3.12-slim AS ansible

RUN set -eux; \
	apt-get update; \
	apt-get install -y --no-install-recommends \
	docker.io \
	; \
	pip install --upgrade pip; \
	pip install --no-cache-dir \
	boto \
	boto3 \
	ansible \
	; 

/docker-compose.yaml

version: "3.8"

services:
  ansible:
    stdin_open: true
    tty: true
    build:
      target: ansible
    stop_signal: SIGKILL
    container_name: "ansible"
    volumes:
      - ./task.yaml:/task.yaml

  alpine:
    stdin_open: true
    tty: true
    image: "alpine:latest"
    container_name: "alpine"
    stop_signal: SIGKILL

/task.yaml

---
- hosts: localhost
  connection: local
  vars:
      docker_extra_args: "-H=tcp://host.docker.internal:2375"
  user: root
  tasks:
  
  - name: Add container to inventory
    ansible.builtin.add_host:
      name: alpine
      ansible_connection: community.docker.docker
      ansible_docker_extra_args: "{{docker_extra_args}}"

  - name: "Delegating command to docker container"
    delegate_to: alpine
    ansible.builtin.command: "echo 'test'"

Start the docker containers, then execute the task in the ansible container

$ docker compose up --detach

$ docker exec -u root -i ansible /bin/sh -c "ansible-playbook task.yaml"

EXPECTED RESULTS

The expected result would be the task completing successfully. Docker Desktop version 4.18.0 is the last version in which this would work. Since the bug appeared after a Docker Desktop update, you might think the issue is related to docker, but it's not the case, please see my analysis.

ACTUAL RESULTS

ansible-playbook [core 2.15.5]
  config file = None
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.12/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible-playbook
  python version = 3.12.0 (main, Oct 11 2023, 23:03:18) [GCC 12.2.0] (/usr/local/bin/python)
  jinja version = 3.1.2
  libyaml = True
No config file found; using defaults
setting up inventory plugins
Loading collection ansible.builtin from 
host_list declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
Skipping due to inventory source not existing or not being readable by the current user
script declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
auto declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
Skipping due to inventory source not existing or not being readable by the current user
yaml declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
Skipping due to inventory source not existing or not being readable by the current user
ini declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
Skipping due to inventory source not existing or not being readable by the current user
toml declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'
Loading callback plugin default of type stdout, v2.0 from /usr/local/lib/python3.12/site-packages/ansible/plugins/callback/default.py
Skipping callback 'default', as we already have a stdout callback.
Skipping callback 'minimal', as we already have a stdout callback.
Skipping callback 'oneline', as we already have a stdout callback.

PLAYBOOK: task.yaml ****************************************************************************************************************************************************************************************************
Positional arguments: task.yaml
verbosity: 4
connection: smart
timeout: 10
become_method: sudo
tags: ('all',)
inventory: ('/etc/ansible/hosts',)
forks: 5
1 plays in task.yaml

PLAY [localhost] *******************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *************************************************************************************************************************************************************************************************
task path: /task.yaml:2
<127.0.0.1> ESTABLISH LOCAL CONNECTION FOR USER: root
<127.0.0.1> EXEC /bin/sh -c 'echo ~root && sleep 0'
<127.0.0.1> EXEC /bin/sh -c '( umask 77 && mkdir -p "` echo /root/.ansible/tmp `"&& mkdir "` echo /root/.ansible/tmp/ansible-tmp-1697208460.66339-603-211097265800276 `" && echo ansible-tmp-1697208460.66339-603-211097265800276="` echo /root/.ansible/tmp/ansible-tmp-1697208460.66339-603-211097265800276 `" ) && sleep 0'
Using module file /usr/local/lib/python3.12/site-packages/ansible/modules/setup.py
<127.0.0.1> PUT /root/.ansible/tmp/ansible-local-599xz7g9c5e/tmp0b6q65ax TO /root/.ansible/tmp/ansible-tmp-1697208460.66339-603-211097265800276/AnsiballZ_setup.py
<127.0.0.1> EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1697208460.66339-603-211097265800276/ /root/.ansible/tmp/ansible-tmp-1697208460.66339-603-211097265800276/AnsiballZ_setup.py && sleep 0'
<127.0.0.1> EXEC /bin/sh -c '/usr/local/bin/python /root/.ansible/tmp/ansible-tmp-1697208460.66339-603-211097265800276/AnsiballZ_setup.py && sleep 0'
<127.0.0.1> EXEC /bin/sh -c 'rm -f -r /root/.ansible/tmp/ansible-tmp-1697208460.66339-603-211097265800276/ > /dev/null 2>&1 && sleep 0'
ok: [localhost]

TASK [Add container to inventory] **************************************************************************************************************************************************************************************
task path: /task.yaml:9
creating host via 'add_host': hostname=alpine
changed: [localhost] => {
    "add_host": {
        "groups": [],
        "host_name": "alpine",
        "host_vars": {
            "ansible_connection": "community.docker.docker",
            "ansible_docker_extra_args": "-H=tcp://host.docker.internal:2375"
        }
    },
    "changed": true
}

TASK [Delegating command to docker container] **************************************************************************************************************************************************************************
task path: /task.yaml:15
Loading collection community.docker from /usr/local/lib/python3.12/site-packages/ansible_collections/community/docker
<alpine> ESTABLISH DOCKER CONNECTION FOR USER: root
<alpine> EXEC ['/usr/bin/docker', '-H=tcp://host.docker.internal:2375', b'exec', b'-u', 'root', b'-i', 'alpine', '/bin/sh', '-c', "/bin/sh -c 'echo ~root && sleep 0'"]
<alpine> EXEC ['/usr/bin/docker', '-H=tcp://host.docker.internal:2375', b'exec', b'-u', 'root', b'-i', 'alpine', '/bin/sh', '-c', '/bin/sh -c \'echo "`pwd`" && sleep 0\'']
<alpine> EXEC ['/usr/bin/docker', '-H=tcp://host.docker.internal:2375', b'exec', b'-u', 'root', b'-i', 'alpine', '/bin/sh', '-c', '/bin/sh -c \'( umask 77 && mkdir -p "` echo ~/.ansible/tmp `"&& mkdir "` echo ~/.ansible/tmp/ansible-tmp-1697208461.9177275-648-193093839347660 `" && echo ansible-tmp-1697208461.9177275-648-193093839347660="` echo ~/.ansible/tmp/ansible-tmp-1697208461.9177275-648-193093839347660 `" ) && sleep 0\'']
fatal: [localhost -> alpine]: FAILED! => {
    "msg": "failed to resolve remote temporary directory from ansible-tmp-1697208461.9177275-648-193093839347660: `( umask 77 && mkdir -p \"` echo ~/.ansible/tmp `\"&& mkdir \"` echo ~/.ansible/tmp/ansible-tmp-1697208461.9177275-648-193093839347660 `\" && echo ansible-tmp-1697208461.9177275-648-193093839347660=\"` echo ~/.ansible/tmp/ansible-tmp-1697208461.9177275-648-193093839347660 `\" )` returned empty string"
}

PLAY RECAP *************************************************************************************************************************************************************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

Debugging

@calillo had the idea of analysing the traffic exchanged between the docker binary and the docker engine. I did the same to try to figure out what is happening and discovered some additional key information.

Docker Binary Behavior

According to my findings, when using the docker exec command, 2 connections will be made to the Docker Engine.

First Connection

  1. Ping the engine
  2. Get information about the container
  3. Start the exec, get an Id
  4. Wait, the second connection is initiated, then wake up when it's done
  5. Get the summary of the exec

Second Connection

  1. Connect with the Id provided in step 3 of the first connection
  2. Upgrade the HTTP connection to TCP in order to stream the output of the executed task
  3. When the exec finishes, the connection is closed

Comparing against functionning version

In order to find out what is happening, I've sniffed packets from 3 differents scenarios:

  • Docker Engine v1.43 (Docker Desktop 4.24.1) via Ansible
  • Docker Engine v1.43 (Docker Desktop 4.24.1) via Bash
  • Docker Engine v1.41 (Docker Desktop 4.18.0) via Ansible

We'll concentrate on the second connection where we are expecting some output.

Docker Engine v1.43 (Docker Desktop 4.24.1) via Ansible

When inspecting the packets of the second connection, something is weird: the FIN flag is initiated by the binary. How does it know that the connection needs to be closed?

  • 172.26.0.2 Docker Binary in the ansible container
  • 192.168.65.254 Docker Engine on the host

4 24 2 fin

After the second connection closes, the first connection queries the API to get the summary of the exec command. We can see that it was prematurely closed since it's still running and the exit code is still null.

4 24 1 5

Docker Engine v1.43 (Docker Desktop 4.24.1) via Bash

That same command that fails in Ansible, if we run in bash within the ansible container (so that it runs from the same environement), it executes successufly with the expected ouput.

# From WSL
$ docker exec -i ansible bash

# Within Ansible container
$ docker -H=tcp://host.docker.internal:2375 exec -u root -i alpine /bin/sh -c "/bin/sh -c '( umask 77 && mkdir -p "` echo ~/.ansible/tmp `"&& mkdir "` echo ~/.ansible/tmp/ansible-tmp-1697223552.034312-223-168432577105606 `" && echo ansible-tmp-1697223552.034312-223-168432577105606="` echo ~/.ansible/tmp/ansible-tmp-1697223552.034312-223-168432577105606 `" ) && sleep 0'"

ansible-tmp-1697223552.034312-223-168432577105606=/root/.ansible/tmp/ansible-tmp-1697223552.034312-223-168432577105606

$

When inspecting the packets, there is a big difference: the FIN flag is initiated by the Engine once it finishes streaming the command. This is the behaviour I would expect.

  • 172.26.0.2 Docker Binary in the ansible container
  • 192.168.65.254 Docker Engine on the host

bash 2 fin

Docker Engine v1.41 (Docker Desktop 4.18.0) via Ansible

Why did it work with the previous docker engine? Let's inspect the packet flow. We get the same behaviour has with the newest engine, but luckily we get the output in between the FIN request and the FIN confirmation.

  • 172.19.0.3 Docker Binary in the ansible container
  • 192.168.65.2 Docker Engine on the host

4 18 2 fin

Since we managed to get the expected response from the exec command, when the first connection queries the API about the summary of the command, running is false and the exit code is 0.

4 18 1 5

Conclusion

From my understanding, this is not a new bug. It's been dormant and the changes to the engine caused it to surface.

The biggest question is what causes Ansible to close the second connection prematurely? Maybe the handle where the socket is piped is closed, thus cascading upward?

@felixfontein
Copy link
Collaborator

I cannot reproduce this. I guess the main difference is that I use Linux, not WSL. My Docker client in the container is exactly the same as you have, and my Docker daemon also is 24.0.6, though a slightly different build:

Server:
 Engine:
  Version:          24.0.6
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.21.1
  Git commit:       1a7969545d
  Built:            Sat Sep 30 15:48:58 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.7.7
  GitCommit:        8c087663b0233f6e6e2f4515cee61d49f14746a8.m
 runc:
  Version:          1.1.9
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

(I'm using Arch Linux with Docker from the system packages.)

@calillo
Copy link

calillo commented Oct 15, 2023

In my case the docker engine is on a remote VM (not in localhost), in this scenario it practically always happens.
I confirm @jfjauvin consideration, if try to run on bash the same command work successeful. So, from my point of view, it's a session/connection management issue by community.docker

@felixfontein
Copy link
Collaborator

I did now try this with a Docker daemon on a remote machine reached via TCP. Same result: it works flawlessly for me.

The connection plugin also isn't doing any magic (assuming you don't use become: true with a plugin that needs to handle a prompt): it will simply execute the docker command line with subprocess.Popen (https://github.com/ansible-collections/community.docker/blob/main/plugins/connection/docker.py#L300-L305) and then uses Popen.communicate to wait for it to complete (https://github.com/ansible-collections/community.docker/blob/main/plugins/connection/docker.py#L343).

@jfjauvin
Copy link

I have also tested it on Archlinux, and like @felixfontein it works. I took the opportunity to packet sniff the exchange and we are getting the same weird behavior of Ansible or the Binary sending the FIN flag as soon as the connection is upgraded.

  • 172.22.0.2 Docker Binary in the ansible container
  • 172.11.0.1 Docker Engine on the host

image

archlinux fin packet.pcap.gz

As @felixfontein mentioned, there's not at lot happening in the connection plugin, but the binary behaves differently has it does in a shell.

When I have some time later this week, I'll try to isolate running the docker binary in a simple python script using subprocess.Popen and see if we get the same behavior. In the past, I've seen some binaries behave different when they weren't running a in a terminal (TTY).

@felixfontein
Copy link
Collaborator

@jfjauvin did you had a chance to try this out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docker-plain plain Docker (no swarm, no compose, no stack)
Projects
None yet
Development

No branches or pull requests

4 participants