Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace telnetlib with scrapli #279

Open
wants to merge 17 commits into
base: scrapli-dev
Choose a base branch
from

Conversation

kaelemc
Copy link

@kaelemc kaelemc commented Nov 7, 2024

This PR replaces the usage of telnetlib in vrnetlab with scrapli.

Changes

  • Telnetlib has been removed and replaced and now the Scrapli Base Driver class is used for base telnet functionality.

    • Previously telnetlib was used to connect to the qemu monitor, I didn't see any use for this anywhere so I removed this functionality as well
    • All nodes have had old telnetlib functions swapped with new scrapli-based ones.
    • All Dockerfiles except OpenWRT (and maybe one or two more i'm forgetting) should be on debian:bookworm-slim from the Amazon ECR.
    • Dockerfiles have had git and pip added, as well as installation command of Scrapli via the git repo.
  • Cisco and Juniper nodes are in their own vendor-specific subdirectory.

    • Defined a quick and dirty VENDOR_SUBDIR variable in the Makefiles of each node as well as logic in makefile.include so that files get copied one level further (due to vendor subdirectory). If VENDOR_SUBDIR is 1 then files are copied one subdirectory level further to account for the vendor subdirectory.
  • Logger for VR class (inherited by all nodes) now uses a logger under the vrnetlab instance. This is because Scrapli uses the root logging instance, and outputs all channel output as 'debug' log level.

    • The root logger is now set to only write 'info' level logs.
    • self.logger will still write 'debug' logs to stdout (docker logs).
  • Coloured the logging levels so it's easier to see, looks nicer visually too.

Functions

read_until()

telnetlib had a read_until() function which was used by wait_write. This has been re-implemented under the VR class.

Args:

  • match_str: a string of the text to wait for
  • timeout: a float of how long to wait before exiting the function even if no match is found. Defaults to None.

Returns: All console output that was read up until the match, or function timeout.

wait_write()

Adapted to scrapli, functionality should be analogous to the telnetlib version.

  • Added argument timeout (float): This has been added, and is passed to read_until, Defaults to None.

expect()

telnetlib had the expect() function which was used in the launch.py entrypoint. This new adapted version in the VR class should function the same as the telnetlib one.

Args:

  • regex_list: a list of byte-strings which are used to match via regex on the console output.
  • timeout: a float of how long before the function should just timeout and stop waiting. Defaults to None.

Returns: a tuple of:

  • List index of the matched byte-string. Defaults to -1 if there was no match. (int)
  • The regex match object. (re.match)
  • The console output up until that point (byte-string)

print()

The VR class now has a print function which is used to simply write to stdout and instantly flush. It's used so that the telnet console can output nicely on docker logs.

If the console output was printed via the logger, the formatting would make it difficult to interpret the output.

Args:

  • bytes: byte-string of what to write to stdout.

Usage

Relevant changes to make a node work:

  • Ensure git, pip and scrapli are installed in the Dockerfile

  • Replace self.tn.expect() with self.expect()

  • Use self.print() to print any console output, do not use logging for this. There is no need to decode()/convert the returned byte-string of expect() as self.print() doesn't accept strings, this would result in malformed outputs.

  • Logging levels for any log output should be changed to be more appropriate.

  • INFO is Green

  • WARNING is Yellow

  • ERROR/DEBUG is Red

Other

Only Cisco nodes have been tested working (csr1000v, cat8kv, cat9kv, ASAv, FTDv, vIOS, n9kv, XRv, XRv9k). I have not tested nxos.

  • XRv9k, CSR, cat8kv and cat9kv nodes uses their relevant Scrapli XR driver.
  • XRv didn't play nicely with the XR driver, I assume due to poor performance and old code version (XRv is deprecated and only runs 6.x code, XRv9k is 7.x)
  • ASAv was not working correctly, had to make fixes for that.

I don't expect other nodes to 'plug and play' and work perfectly from the get go, I'm sure I have made errors or overlooked something so some work is required on the other nodes to ensure functionality.

  • For other nodes I think the Scrapli drivers can probably increase reliability and make it easier to send configs.

Collaboration from others is required so we can confirm all nodes work reliably (as well as to fix some other possible issues in the way i've implemented the changes). I'm open to feedback 😊.

- Add python3-pip, git and scrapli install via pip to all node - Dockerfiles.

- Replace all nodes telnetlib functions with new adapated scrapli functions. (wait_write, expect, read_until)

- Move Cisco and Juniper nodes to their own vendor subdirectories, fix makefiles to ensure functionality works.

- Other things, which should be outlined in the PR.
@kaelemc
Copy link
Author

kaelemc commented Nov 8, 2024

It might be worth making another scrapli branch and then I can change this PR to merge into that branch (instead of master).

This way other contributors can submit PRs for any changes to other nodes and once all/enough nodes work, that branch can be merged into master?

@hellt
Copy link
Owner

hellt commented Nov 8, 2024

sound idea @kaelemc
I have created the scrapli-dev branch

@kaelemc kaelemc changed the base branch from master to scrapli-dev November 8, 2024 09:49
…nfigurations

- Startup and bootstrap configurations use scrapli with IOSXEDriver
- Changed variable name from 'csr' to 'cat8kv' in cat8kv install function.
- Reverted change in bootstrap_spin so that console output is evaluated against empty byte-string instead of regular string.
@kaelemc
Copy link
Author

kaelemc commented Nov 9, 2024

IOS-XE nodes (CSR1000V, Cat8kv and Cat9kv) the Scrapli XE driver is now being used. Had some issues with it working with IOS-XE 16.x but it was an error on my end.. got the configs (both bootstrap and startup) to work nicely.

- existing XRv node uses vmdk image only. This one will use the qcow2 image.
@kaelemc
Copy link
Author

kaelemc commented Nov 9, 2024

Maybe this belongs as it's own PR but I've added an xrv_qemu directory. It is the same as XRv but with some modifications to make XRv work when the user provides the qcow2 image.

The existing xrv directory requires vmdk images, most users who import images from CML or other places on the web most likely will get a qcow2, this makes it easier for users so they don't have to mess around with qemu-img. (even the vmdk I have has been converted from a qcow via qemu-img)

I tried one more time to use the scrapli IOSXRDriver with XRv but it still seems unreliable, and I get some weird behavior from the XRv (don't think this is scrapli's fault).

- Alter the log levels for some logs from debug->error
- Move the VM startup log message and add information about qemu smp and startup RAM.
@kaelemc kaelemc marked this pull request as ready for review November 9, 2024 10:39
@hellt hellt mentioned this pull request Nov 9, 2024
@jbemmel
Copy link

jbemmel commented Nov 9, 2024

Not a trivial change, but I would suggest to create 1 base Dockerfile for all of vrnetlab, and then derive all platform images from that

It doesn't make sense to do 100x "install scrapli" in all those separate files - that becomes unmanageable

@hellt
Copy link
Owner

hellt commented Nov 9, 2024

yeah we can definitely create the base image hosted on ghcr.io with base pacakages that every vr system relies on

@kaelemc
Copy link
Author

kaelemc commented Nov 9, 2024

@jbemmel Excellent idea, certainly is/was unmanageable for me

@kaelemc
Copy link
Author

kaelemc commented Nov 10, 2024

So this is a difficult one to balance, in labs of different sizes the nodes take varying times to boot (depending on system specs etc.).. I'm wondering if we should bump all the scrapli timeouts to something very large to make the connection timing out a non-issue?

Currently I have XE devices on 10 minutes, I have a feeling for people with lower-spec systems might hit time-outs meaning they could never boot the device. Should we increase to something very large just to be safe?

The base Scrapli Driver (underlying telnet connection for wait_write, expect and read_until functions) uses a timeout of 1hr.

@hellt
Copy link
Owner

hellt commented Nov 10, 2024

I agree, let's have a long enough timeout so that you don't have to guess the time down to minutes.

We should leave the timeout configurable via the env var setting that can be then put in the topology file if needed to tune it up

@kaelemc
Copy link
Author

kaelemc commented Nov 10, 2024

Ok 👍, having timeouts as an env var is a good idea

- New env var SCRAPLI_TIMEOUT added, defaults to 3600 seconds (1 hour).
- It's used to enable the user to modify the operation, transport and socket timeout for the Scrapli driver used to apply the config to the CSR.
@kaelemc
Copy link
Author

kaelemc commented Nov 13, 2024

I had to remove NETCONF configuration on CSR as different versions of IOS-XE 16.x and 17.x have different behaviors when this command is entered in the config.

In 16.x a prompt will appear asking for some confirmation about the NETCONF configuration. I figured it's best to just remove this instead of adding extra complexity to handle the prompt for that single command.

Thoughts?

- Remove env var printing in SROS launch.py (this is done in vrnetlab now)
- Comment out clean buffer parameter default (this isn't supported anymore), but just incase we might want to add it back.
@kaelemc
Copy link
Author

kaelemc commented Nov 13, 2024

SROS seems to be working fine with minimal changes.

I just had to comment out the defaulting of the clean_buffer=true in wait_write() (since this is not supported anymore). If I run into issues I'll implement the functionality back in.

Also removed the env var printing from the SROS side, This is done as part of vrnetlab now.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vendor telnetlib
3 participants