Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes and improvements to deployment #90

Merged
merged 9 commits into from
Apr 19, 2024
Merged

Fixes and improvements to deployment #90

merged 9 commits into from
Apr 19, 2024

Conversation

markgoddard
Copy link

  • deployment: Add comments to inventories
  • deployment: Always get CA fingerprint
  • deployment: Assert that there is only one HAProxy server

This avoids an issue when adding hosts to a cluster where the host
getting the fingerprint has already been bootstrapped, so does not query
the CA's fingerprint.
Currently we are not deploying any failover mechanism such as keepalived,
so limit to one HAProxy server.
@valeriupredoi
Copy link

many thanks @markgoddard 🍺 I am currently testing this branch with this inventory:

# Example inventory for deployment to a single host (localhost).

# HAProxy load balancer.
# Should contain exactly one host.
[haproxy]
activeh

# Jaeger distributed tracing UI.
# Should contain at most one host.
[jaeger]
activeh

# Minio object storage service (for test & development only).
# Should contain at most one host.
[minio]
activeh

# Prometheus monitoring service.
# Should contain at most one host.
[prometheus]
activeh

# Reductionist servers.
# May contain multiple hosts.
[reductionist]
activeh
active2
active3

# Step Certificate Authority (CA).
# Should contain exactly one host.
[step-ca]
activeh

# Do not edit.
[step:children]
reductionist

# Do not edit.
[docker:children]
haproxy
jaeger
minio
prometheus
reductionist
step-ca

but it's currently hanging at "Gethring facts":

[vpredoi@activeh ~]$ ansible-playbook -i reductionist-rs/deployment/inventory reductionist-rs/deployment/site.yml
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details

PLAY [Install Docker] ******************************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************************
ok: [active2]
ok: [active3]

I can send you the debug log but there are no critical issues there, just a lot of fluff, and the ssh connections to active2 and active3 work fine. Any clues?

@valeriupredoi
Copy link

also the deps have not been updated, just for logging reasons:

[vpredoi@activeh ~]$ pip install -r reductionist-rs/deployment/requirements.txt
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: ansible-core<2.16 in ./.local/lib/python3.9/site-packages (from -r reductionist-rs/deployment/requirements.txt (line 1)) (2.15.10)
Requirement already satisfied: jinja2>=3.0.0 in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (3.1.3)
Requirement already satisfied: PyYAML>=5.1 in /usr/lib64/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (5.4.1)
Requirement already satisfied: cryptography in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (42.0.5)
Requirement already satisfied: packaging in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (24.0)
Requirement already satisfied: resolvelib<1.1.0,>=0.5.3 in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (1.0.1)
Requirement already satisfied: importlib-resources<5.1,>=5.0 in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (5.0.7)
Requirement already satisfied: MarkupSafe>=2.0 in ./.local/lib/python3.9/site-packages (from jinja2>=3.0.0->ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (2.1.5)
Requirement already satisfied: cffi>=1.12 in ./.local/lib/python3.9/site-packages (from cryptography->ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (1.16.0)
Requirement already satisfied: pycparser in ./.local/lib/python3.9/site-packages (from cffi>=1.12->cryptography->ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (2.22)

🍺

@markgoddard
Copy link
Author

@valeriupredoi I'd diff your new inventory against your old one. I expect you don't want to deploy minio.

@markgoddard
Copy link
Author

Do you have activeh in /etc/hosts on activeh? Perhaps previously you were referring to it as localhost?

@valeriupredoi
Copy link

yessir! Here's my hosts file:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.171.169.xxx activeh
192.168.3.xxx active2
192.168.3.xxx active3

with aactual numbers not xxx - no, I was using exactly the same inventory as now, but with three machines listed under HAproxy, etc/hosts has not changed either

@valeriupredoi
Copy link

let me try rerun with my previous configuration, see if that goes through (with the error at the end), so we can isolate the issue

@valeriupredoi
Copy link

right! So the thing now hangs with my old setup from yesterday as well 🤦‍♂️ Need to see what's happened in the meantime

@valeriupredoi
Copy link

@markgoddard apols for the tardiness: this is the bit that's hanging:

TASK [Gathering Facts] *****************************************************************************************************************************
task path: /home/vpredoi/reductionist-rs/deployment/site.yml:4
<activeh> ESTABLISH SSH CONNECTION FOR USER: None
<activeh> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o 'ControlPath="/home/vpredoi/.ansible/cp/33982bffcc"' activeh '/bin/sh -c '"'"'echo ~ && sleep 0'"'"''

-> the funny thing is that if I run that command myself, all is fine...am very confused 😖

@valeriupredoi
Copy link

valeriupredoi commented Apr 17, 2024

figured it out thanks to @RosalynHatcher whom I owe a massive 🍺 - but am back to the Bootstrap CA issue now:

TASK [Bootstrap CA] ********************************************************************************************************************************
skipping: [activeh]
fatal: [active2]: FAILED! => {"changed": true, "cmd": ["step", "ca", "bootstrap", "--ca-url", "https://192.168.3.212:9999", "--fingerprint", "01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c", "--install"], "delta": "0:00:00.016529", "end": "2024-04-17 15:35:25.308223", "msg": "non-zero return code", "rc": 1, "start": "2024-04-17 15:35:25.291694", "stderr": "error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host", "stderr_lines": ["error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host"], "stdout": "", "stdout_lines": []}
fatal: [active3]: FAILED! => {"changed": true, "cmd": ["step", "ca", "bootstrap", "--ca-url", "https://192.168.3.212:9999", "--fingerprint", "01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c", "--install"], "delta": "0:00:00.014800", "end": "2024-04-17 15:35:25.340400", "msg": "non-zero return code", "rc": 1, "start": "2024-04-17 15:35:25.325600", "stderr": "error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host", "stderr_lines": ["error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host"], "stdout": "", "stdout_lines": []}

😖

@valeriupredoi
Copy link

note that the reductionist verification also fails, for activeh - which is kinda to be expected since the CA massage was skipped there:

TASK [Wait for reductionist server to be accessible via HAProxy] ***********************************************************************************
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (3 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (2 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (1 retries left).
fatal: [activeh]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.212:8080/.well-known/reductionist-schema"}

🍺

@markgoddard
Copy link
Author

Your bootstrapping is failing with network errors:

dial tcp 192.168.3.212:9999: connect: no route to host

Perhaps the port is not open in a firewall/security group?

@markgoddard
Copy link
Author

Also looks like you are not using the changes in this PR because the task to get the fingerprint is skipped on activeh

@valeriupredoi
Copy link

I am - I am using this branch, bud. Port 9999 doesn't actually exist afaik, does it?

@valeriupredoi
Copy link

yes it is indeed skipped - the problem is with active2 and active3 for Bootstrap CA - but even activeh fails at the end with reductionist unreachable - very possible it's how @bnlawrence has configured active2 and 3?

@valeriupredoi
Copy link

this is the full partial section:

TASK [Check whether step has been bootstrapped] ****************************************************************************************************
ok: [active2]
ok: [active3]
ok: [activeh]

TASK [Get CA fingerprint] **************************************************************************************************************************
ok: [activeh]


TASK [Bootstrap CA] ********************************************************************************************************************************
skipping: [activeh]
fatal: [active2]: FAILED! => {"changed": true, "cmd": ["step", "ca", "bootstrap", "--ca-url", "https://192.168.3.212:9999", "--fingerprint", "01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c", "--install"], "delta": "0:00:00.014868", "end": "2024-04-17 16:14:49.526672", "msg": "non-zero return code", "rc": 1, "start": "2024-04-17 16:14:49.511804", "stderr": "error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host", "stderr_lines": ["error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host"], "stdout": "", "stdout_lines": []}
fatal: [active3]: FAILED! => {"changed": true, "cmd": ["step", "ca", "bootstrap", "--ca-url", "https://192.168.3.212:9999", "--fingerprint", "01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c", "--install"], "delta": "0:00:00.014415", "end": "2024-04-17 16:14:49.538207", "msg": "non-zero return code", "rc": 1, "start": "2024-04-17 16:14:49.523792", "stderr": "error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host", "stderr_lines": ["error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host"], "stdout": "", "stdout_lines": []}

TASK [Install root certificate to system] **********************************************************************************************************
skipping: [activeh]

,,,

@markgoddard
Copy link
Author

I am - I am using this branch, bud. Port 9999 doesn't actually exist afaik, does it?

Perhaps you still have the code changes you made previously? That task should not be skipped if using this branch.

@markgoddard
Copy link
Author

I am - I am using this branch, bud. Port 9999 doesn't actually exist afaik, does it?

Perhaps you still have the code changes you made previously? That task should not be skipped if using this branch.

ok, my mistake - it's the Get CA fingerprint task that should not be skipped, and it's not.

@markgoddard
Copy link
Author

What do you mean port 9999 doesn't exist? TCP/UDP ports go up to 65,535

@markgoddard
Copy link
Author

note that the reductionist verification also fails, for activeh - which is kinda to be expected since the CA massage was skipped there:

TASK [Wait for reductionist server to be accessible via HAProxy] ***********************************************************************************
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (3 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (2 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (1 retries left).
fatal: [activeh]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.212:8080/.well-known/reductionist-schema"}

🍺

It's failing with certificate expired. Step CA uses short-lived certificates, so probably the renewal isn't working for some reason.

@valeriupredoi
Copy link

valeriupredoi commented Apr 17, 2024

OK:

[vpredoi@active2 ~]$ sudo firewall-cmd --add-port=9999/tcp
Warning: ALREADY_ENABLED: '9999:tcp' already in 'public'
success

and same for active3, and regarding the current reductionist-rs:

[vpredoi@activeh ~]$ cd reductionist-rs/
[vpredoi@activeh reductionist-rs]$ git status
On branch deployment
Your branch is up to date with 'origin/deployment'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   deployment/inventory

no changes added to commit (use "git add" and/or "git commit -a")

where the inventory file is changed to have it for this deployment. At any rate, I just pulled the latest:

[vpredoi@activeh reductionist-rs]$ git pull origin deployment
From https://github.com/stackhpc/reductionist-rs
 * branch            deployment -> FETCH_HEAD
Already up to date.

Have you pushed any changes? 🍺

@valeriupredoi
Copy link

What do you mean port 9999 doesn't exist? TCP/UDP ports go up to 65,535

my bad, I thought the biggest one was 9090 🤣

@markgoddard
Copy link
Author

Port 9999 needs to be accessible on activeh. Are you able to curl it (using HTTPS)?

@markgoddard
Copy link
Author

What do you mean port 9999 doesn't exist? TCP/UDP ports go up to 65,535

my bad, I thought the biggest one was 9090 🤣

That's what they want you to believe

@valeriupredoi
Copy link

valeriupredoi commented Apr 17, 2024

you, sir, are a life-saver! Completely forgot to turn on port 999 on activeh - now, massive progress, but it stumbled right at the end:

TASK [Wait for reductionist server to be accessible via HAProxy] ***********************************************************************************
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (3 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (2 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (1 retries left).
fatal: [activeh]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.xxx:8080/.well-known/reductionist-schema"}

PLAY RECAP *****************************************************************************************************************************************
active2                    : ok=20   changed=2    unreachable=0    failed=0    skipped=7    rescued=0    ignored=0   
active3                    : ok=20   changed=2    unreachable=0    failed=0    skipped=7    rescued=0    ignored=0   
activeh                    : ok=44   changed=1    unreachable=0    failed=1    skipped=12   rescued=0    ignored=0

Should I regenerate the step root_ca.crt?
🍻

@valeriupredoi
Copy link

actually, hang on, just opened 8080 too (was not open 🤦‍♂️ ) now it clearly says cert is expired:

fatal: [activeh]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.212:8080/.well-known/reductionist-schema"}

@valeriupredoi
Copy link

valeriupredoi commented Apr 17, 2024

OK that certificat is not expired:

[vpredoi@activeh ~]$ sudo step-cli certificate inspect root_ca.crt 
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: hidden 
    Signature Algorithm: ECDSA-SHA256
        Issuer: O=Smallstep,CN=Smallstep Root CA
        Validity
            Not Before: Apr 16 13:04:45 2024 UTC
            Not After : Apr 14 13:04:45 2034 UTC
        Subject: O=Smallstep,CN=Smallstep Root CA
        Subject Public Key Info:
            Public Key Algorithm: ECDSA
                Public-Key: (256 bit)

Mark, any clues why the pinger would think it expired?

@valeriupredoi
Copy link

sorry bud, last message for today I promise (just about to go home): the deployed Reductionist on activeh is using the activeh's backend IP "url": "https://192.168.3.xxx:8080/.well-known/reductionist-schema" - that'll never work AFAIK since the public facing IP is needed (as I found out with the last deployment from a month ago)

@markgoddard
Copy link
Author

OK that certificat is not expired:

[vpredoi@activeh ~]$ sudo step-cli certificate inspect root_ca.crt 
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: hidden 
    Signature Algorithm: ECDSA-SHA256
        Issuer: O=Smallstep,CN=Smallstep Root CA
        Validity
            Not Before: Apr 16 13:04:45 2024 UTC
            Not After : Apr 14 13:04:45 2034 UTC
        Subject: O=Smallstep,CN=Smallstep Root CA
        Subject Public Key Info:
            Public Key Algorithm: ECDSA
                Public-Key: (256 bit)

Mark, any clues why the pinger would think it expired?

That is the root CA certificate, not the server certificate(s). Those are generated by the step CLI in ~/.config/reductionist/certs/ and have a much shorter life.

@markgoddard
Copy link
Author

sorry bud, last message for today I promise (just about to go home): the deployed Reductionist on activeh is using the activeh's backend IP "url": "https://192.168.3.xxx:8080/.well-known/reductionist-schema" - that'll never work AFAIK since the public facing IP is needed (as I found out with the last deployment from a month ago)

Perhaps we can talk about your exact network setup tomorrow, but there is a variable called reductionist_host in deployment/group_vars/reductionist that you can set for the frontend host on which HAProxy will expose the reductionist API.

@valeriupredoi
Copy link

valeriupredoi commented Apr 17, 2024

both those two things - great clues, many thanks for taking the time with me, Mark, and my apologies for bombarding you with questions - I realize I am annoying, but, as you can see, am a total n00b at ansibles and its networking, and I want to get this done. I owe you a couple pints for sure! I reckon by tomorrow, given what you pointed me to, we'll get it to successfully deploy (and work) 🍻

@markgoddard
Copy link
Author

both those two things - great clues, many thanks for taking the time with me, Mark, and my apologies for bombarding you with questions - I realize I am annoying, but, as you can see, am a total n00b at ansibles and its networking, and I want to get this done. I owe you a couple pints for sure! I reckon by tomorrow, given what you pointed me to, we'll get it to successfully deploy (and work) 🍻

I've pushed some changes that should help. There is a fix for the wait task that would be necessary if you modify reductionist_host. There are also some docs changes that provide more info about how your hosts need to be setup, with required ports etc.

@valeriupredoi
Copy link

Mark, you're a bloody wizard! I took the latest changes you made here, opened port 8081 on active2 and active3, and the thing ran with absolutely no hitch:

PLAY RECAP *****************************************************************************************************************************************
active2                    : ok=20   changed=1    unreachable=0    failed=0    skipped=8    rescued=0    ignored=0   
active3                    : ok=20   changed=1    unreachable=0    failed=0    skipped=8    rescued=0    ignored=0   
activeh                    : ok=46   changed=1    unreachable=0    failed=0    skipped=12   rescued=0    ignored=0

I'm actually gonna poke it about see what's what, but boy am I happy to see no fails and comms and certs not barking at me 😁 🍻

@valeriupredoi
Copy link

we got "Hello world!" from a remote client (me laptop) to activeh

(base) valeriu@valeriu-PORTEGE-Z30-C:~$ curl -k https://192.171.169.xxx:8080/.well-known/reductionist-schema
Hello, world!

where 192... is activeh public facing IP, and biutiful Hello Worlds from activeh to the two active2 and active3 via backend IPs and ports 8081:

[vpredoi@activeh ~]$ curl -k https://192.168.3.xxx:8081/.well-known/reductionist-schema
Hello, world![vpredoi@activeh ~]$ curl -k https://192.168.3.xxx:8081/.well-known/reductionist-schema
Hello, world!

I am over the moon 😄

@valeriupredoi
Copy link

just ran an actual PyActiveStorage test and it runs very well (let's not concern with the times just yet) - reductionist process running on each of the three computers: active, active2, and active3 🥳

Copy link

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolute legend @markgoddard 🍺 Very many thanks for this and your CH (continuous help) over the past couple days. One itty bitty mention I'd put in here for others not to struggle like me is to have the ssh connection from the main node (activeh in my case) not be init-ed with an eal of the ssh-agent ie eval $(ssh-agent -s) because that will result in ansibles needing a password be inputted, but it's not asking for it explicitly, and instead it hangs - this is what my lovely colleague @RosalynHatcher sorted me out with, me barely speaking any ssh. Apart from that, I owe you a couple pints, mate 🍺

@markgoddard
Copy link
Author

absolute legend @markgoddard 🍺 Very many thanks for this and your CH (continuous help) over the past couple days. One itty bitty mention I'd put in here for others not to struggle like me is to have the ssh connection from the main node (activeh in my case) not be init-ed with an eal of the ssh-agent ie eval $(ssh-agent -s) because that will result in ansibles needing a password be inputted, but it's not asking for it explicitly, and instead it hangs - this is what my lovely colleague @RosalynHatcher sorted me out with, me barely speaking any ssh. Apart from that, I owe you a couple pints, mate 🍺

I've added a note about the SSH agent issue. Will merge once CI goes green again. Thanks for trying out my changes!

@markgoddard markgoddard merged commit b6aeda6 into main Apr 19, 2024
8 checks passed
@markgoddard markgoddard deleted the deployment branch April 19, 2024 08:08
@valeriupredoi
Copy link

🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants